我正在尝试解析此站点的表。我用蟒蛇靓汤来做这个。虽然它在我的Ubuntu 14.04机器上产生了正确的输出,但在我朋友的windows机器上却产生了错误的输出。我将代码片段粘贴到此处:
from bs4 import BeautifulSoup def buildURL(agi, families): #agi and families contains space seperated string of genes and families genes = agi.split(" ") families = families.split(" ") base_url = "http://www.athamap.de/search_gene.php" url = base_url if len(genes): url = url + "?agi=" for i, gene in enumerate(genes): if i>0: url = url + "%0D%0A" url = url + gene url = url + "&upstream=-500&downstream=50&restriction=0&sortBy1=gen&sortBy2=fac&sortBy3=pos" for family in families: family = family.replace("/", "%2F") url = url +"&familySelected%5B"+family+"%5D=on" url = url + "&formSubmitted=TRUE" return url def fetch_html(agi, families): url = buildURL(agi, families) response = requests.get(url) soup = BeautifulSoup(str(response.text), "lxml") divs = soup.find_all('div') seldiv = "" for div in divs: try: if div["id"] == "geneAnalysisDetail": ''' This div contains interesting data ''' seldiv = div except: None return seldiv def parse(seldiv): soup = seldiv rows= soup.find_all('tr') attributes =["Gene", "Factor", "Family", "Position", "Relative orientation", "Relative Distance", "Max score", "Threshold Score", "Score"] print attributes save_rows = [] for i in range(2, len(rows)): cols = rows[i].find_all('td') lst = [] for j,col in enumerate(cols): if j==0: lst.append(re.sub('', '',str(col.contents[1].contents[0]))) elif j==1: lst.append(str(col.contents[1].contents[0])) elif j==2: lst.append(str(col.contents[0])) elif j==3: lst.append(str(col.contents[1].contents[0])) else: lst.append(str(col.contents[0])) save_rows.append(lst) return save_rows
你知道这里会出什么问题吗?我试过使用和不使用lxml。
提前谢谢。
一种可能是您没有为请求添加用户代理。不同的用户代理有时会得到不同的结果,特别是来自怪异的网站。这是所有可能的代理的列表,请选择一个。它不一定是你的机器
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Firefox/52.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:51.0) Gecko/20100101 Firefox/51.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
]
您可以这样解析表,并且应该在这两台机器上都能很好地工作。
import requests
from bs4 import BeautifulSoup
def fetch_html(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
seldiv = soup.find("div", id="geneAnalysisDetail")
return seldiv
def parse(url):
soup = fetch_html(url)
rows= soup.find_all("tr")
attributes = ["Gene", "Factor", "Family", "Position", "Relative orientation", "Relative Distance", "Max score", "Threshold Score", "Score"]
save_rows = []
for i in range(2, len(rows)):
cols = rows[i].find_all("td")
lst = []
for col in cols:
text = col.get_text()
text = text.strip(" ")
text = text.strip("\n")
lst.append(text)
save_rows.append(lst)
return save_rows
url = "http://www.athamap.de/search_gene.php?agi=At1g76540%0D%0AAt3g12280%0D%0AAt4g28980%0D%0AAt4g37630%0D%0AAt5g11300%0D%0AAt5g27620%0D%0A&upstream=-500&downstream=50&restriction=0&sortBy1=gen&sortBy2=fac&sortBy3=pos&familySelected[ARF]=on&familySelected[CAMTA]=on&familySelected[GARP%2FARR-B]=on&formSubmitted=TRUE"
save_rows = parse(url)
for row in save_rows:
print(row)
问题内容: 我真的很head头。我使用s已有一段时间了,但现在,使用SimpleDateFormat解析日期(只是有时)是完全错误的。 特别: 打印字符串。有没有搞错?-它甚至都不会一直解析为错误的日期! 更新: 修复得很漂亮。您不知道吗,在其他一些地方也滥用了它。一定喜欢调试别人的代码:) 问题答案: 我认为您要使用格式,而不是’hh’,以便您使用00-23之间的小时数。“ hh”采用12小时增
在我刚刚编写的一个测试解析器中,我遇到了一个奇怪的问题,我不太明白。 将其简化为显示问题的最小示例,让我们从以下语法开始: 这是语法的变化: 突然间,相同的测试输入被错误地分解成两个statement_list,每个statement_list都继续到一个带有“missing”;“警告,第一个返回“z=”的不完整的assignment_statement,第二个返回“x+”的不完整的assignm
我需要解析一个日期,当我解析一个无效的日期“2007年2月29日”时,它将以dd/MM/yyyy的格式返回给我,作为本地日期2007-02-28,代码如下: 但是,如果我使用ISO格式(没有DateTimeFormatter)进行解析,则会出现异常,代码如下: 例外情况: 所以我的问题是,我想考虑一下: 由于无效,我如何使用LocalDate呢。作语法分析
我正在尝试读取特定格式的日期字符串。 首先,让我们以所需格式生成当前时间的字符串。 现在,让我们尝试将其转换回datetime对象。 错误与格式有关,格式完全相同。 这是怎么回事?
问题内容: 我用来针对浮点型字段返回匹配的数值结果。看来,一旦小数点左边有4位以上的数字,就不会返回与我的小数点右边的搜索项匹配的值。这是说明情况的示例: 该查询不会返回值为12357.68的记录,但会返回值为3457.68的记录。同样使用78(而不是68)运行查询不会返回23456.78记录,但是使用76将返回1234.76记录。 因此,要提出一个问题:为什么拥有更大的数量会导致这些结果发生变化