问题：

使用Python对隐藏表进行Web刮取

阎德宇

2023-03-14

这是我的代码：

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)

我在找“eFotrait-table”：

efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())

<div class="row" id="efotrait-table-loading" style="margin-top:20px">
 <div class="panel panel-default" id="efotrait_panel">
  <div class="panel-heading background-color-primary-accent">
   <h3 class="panel-title">
    <span class="efotrait_label">
     Traits
    </span>
    <span class="efotrait_count badge available-data-btn-badge">
    </span>
   </h3>
   <span class="pull-right">
    <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
     <span class="glyphicon glyphicon-chevron-up">
     </span>
    </span>
   </span>
  </div>
  <div class="panel-body">
   <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
   </table>
  </div>
 </div>
</div>

具体来说，这一条：

soup.select('table#efotrait-table')[0]

<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>

纪成礼

2023-03-14

所需数据在API调用中可用。

import requests

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data).json()
    print(r)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

您可以遵循r.keys()并通过访问DICT加载您所需的数据。

但这里有一个快速加载（惰性代码）：

import requests
import re
import pandas as pd

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data)
    match = {item.group(2, 1) for item in re.finditer(
        r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
    df = pd.DataFrame.from_dict(match)
    print(df)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

0              heel bone mineral density                          Heel bone mineral density
1              interleukin-8 measurement  Chronic obstructive pulmonary disease-related ...
2   self reported educational attainment        Educational attainment (years of education)
3                        waist-hip ratio                                    Waist-hip ratio
4             eye morphology measurement                                     Eye morphology
5                       CC16 measurement  Chronic obstructive pulmonary disease-related ...
6         age-related hearing impairment  Age-related hearing impairment (SNP x SNP inte...
7    eosinophil percentage of leukocytes               Eosinophil percentage of white cells
8          coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
9                     multiple sclerosis                                 Multiple sclerosis
10                  mathematical ability                    Highest math class taken (MTAG)
11                 risk-taking behaviour                      General risk tolerance (MTAG)
12         coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
13  self reported educational attainment                      Educational attainment (MTAG)
14                          pancreatitis                                       Pancreatitis
15               hair colour measurement                                         Hair color
16                      breast carcinoma  Breast cancer specific mortality in breast cancer
17                      eosinophil count                                  Eosinophil counts
18                     self rated health                                  Self-rated health
19                          bone density                               Bone mineral density

使用Python对隐藏表进行Web刮取

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档