这是我的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)
我在找“eFotrait-table”:
efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
<div class="panel panel-default" id="efotrait_panel">
<div class="panel-heading background-color-primary-accent">
<h3 class="panel-title">
<span class="efotrait_label">
Traits
</span>
<span class="efotrait_count badge available-data-btn-badge">
</span>
</h3>
<span class="pull-right">
<span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
<span class="glyphicon glyphicon-chevron-up">
</span>
</span>
</span>
</div>
<div class="panel-body">
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
</div>
</div>
</div>
具体来说,这一条:
soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
所需数据在API调用中可用。
import requests
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data).json()
print(r)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
您可以遵循r.keys()
并通过访问DICT加载您所需的数据。
但这里有一个快速加载(惰性代码):
import requests
import re
import pandas as pd
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data)
match = {item.group(2, 1) for item in re.finditer(
r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
df = pd.DataFrame.from_dict(match)
print(df)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
0 heel bone mineral density Heel bone mineral density
1 interleukin-8 measurement Chronic obstructive pulmonary disease-related ...
2 self reported educational attainment Educational attainment (years of education)
3 waist-hip ratio Waist-hip ratio
4 eye morphology measurement Eye morphology
5 CC16 measurement Chronic obstructive pulmonary disease-related ...
6 age-related hearing impairment Age-related hearing impairment (SNP x SNP inte...
7 eosinophil percentage of leukocytes Eosinophil percentage of white cells
8 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
9 multiple sclerosis Multiple sclerosis
10 mathematical ability Highest math class taken (MTAG)
11 risk-taking behaviour General risk tolerance (MTAG)
12 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
13 self reported educational attainment Educational attainment (MTAG)
14 pancreatitis Pancreatitis
15 hair colour measurement Hair color
16 breast carcinoma Breast cancer specific mortality in breast cancer
17 eosinophil count Eosinophil counts
18 self rated health Self-rated health
19 bone density Bone mineral density
正如标题所示,我正在尝试使用Selenium从网站(示例)中获取一些数据,但是我在从Pro结果表中获取隐藏在每一行中的数据时遇到了问题,即单击Show Details按钮()时显示的数据。 这是我的代码: 正如您所看到的,我可以很容易地获取表中的行,但是当我试图获取隐藏数据时,我就是找不到获取它的方法。 我对Selenium也不是很熟悉,所以欢迎提供任何指导。
最近我一直在用Python和靓汤学习网页刮刮乐。然而,当我试图刮下下面的页面时,我遇到了一点麻烦: http://www.librarything.com/work/3203347 我想从页面上得到的数据是这本书的标签,但我找不到任何方法来获取数据,尽管我花了很多时间在网上拖网。 我试着在网上看了几本指南,但似乎没有一本奏效。我尝试将页面转换为XML和JSON,但仍然找不到数据。 我现在有点手足无
我试图从一个网站上为我的项目收集数据。但是问题是我没有在我的输出中得到我在我的开发者工具栏屏幕中看到的标签。以下是我想从其中抓取数据的DOM的快照: 我能够获得类为“bigContainer”的div标记,但是我不能在这个标记中刮取标记。例如,如果我想得到网格项标记,我得到了一个空列表,这意味着它表明没有这样的标记。为什么会这样?请帮忙!!
我有一个桌子,它的膨胀和折叠,但它变得太乱,无法使用它,IE和Firefox不能正常工作。 下面是JavaScript代码: 和示例HTML: 问题是,我对每一个都使用一个ID,这是非常烦人的,因为我想为每个父级和很多父级都有很多隐藏行,所以要处理的ID太多了。IE和FireFox只显示第一个隐藏行,而不显示其他行。我怀疑发生这种情况是因为我将所有ID触发在一起使其工作。我认为如果我使用类而不是I
我一直在使用Python和Selenium从特定的州健康网页中获取数据,并将该表输出到本地CSV。 我在其他几个州使用类似的代码取得了很多成功。但是,我遇到了一种状态,即使用看起来像R的东西来创建动态仪表板,而我无法使用常规方法真正访问这些仪表板。 我花了很多时间梳理StackOverflow。我已经检查了是否有一个iframe可以切换,但是,我只是没有看到页面上iframe中我想要的数据。 使用