问题：

把一张有漂亮汤的html表格刮成熊猫

韩志专

2023-03-14

我试图使用beautiful soup刮一个html表，并将其导入熊猫--http://www.baseball-reference.com/teams/nym/2017.shtml--“Team Batting”表。

找表没问题：

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
    print(i.get_text())

table_head = table.find('thead')

for i in table_head.findAll('th'):
    print(i.get_text())

现在我很难把所有的东西放在一个数据帧中。以下是我目前掌握的信息：

header = []    
for th in table_head.findAll('th'):
        key = th.get_text()
        header.append(key)

row= []
for tr in table.findAll('tr')[2]:
    value = tr.get_text()
    row.append(value)

od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])

这一次只适用于一行。我的问题是如何同时对表中的每一行都这样做？

共有1个答案

柴良哲

2023-03-14

我已经测试了以下将为您的目的工作。基本上，您需要创建一个列表，循环播放器，使用该列表填充一个数据帧。建议不要逐行创建DataFrame，因为这样可能会慢得多。

import collections as co
import pandas as pd

from bs4 import BeautifulSoup

with open('team_batting.html','r') as fin:
    soup = BeautifulSoup(fin.read(),'lxml')

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

table_head = table.find('thead')
header = []    
for th in table_head.findAll('th'):
    key = th.get_text()
    header.append(key)

# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
    if tr.findAll('th')[0].get_text() in (''):
        endrows += 1

rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk' 

list_of_dicts = []
for row in range(rows):
    the_row = []
    try:
        table_row = table.findAll('tr')[row]
        for tr in table_row:
            value = tr.get_text()
            the_row.append(value)
        od = co.OrderedDict(zip(header,the_row))
        list_of_dicts.append(od)
    except AttributeError:
        continue 

df = pd.DataFrame(list_of_dicts)

类似资料：

漂亮的刮汤台

我有一小段代码来从web站点中提取表数据，然后以csv格式显示。问题是for循环多次打印记录。我不确定是不是因为标签。顺便说一句，我是Python新手。谢谢你的帮助！
用漂亮的汤刮多页

我已经获得了刮取第一页的代码，但是url从： https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html -- 如何创建从第2页到第65页的循环？非常感谢！
用漂亮的汤刮Flipkart网页

我试图刮此页上Flipkart： http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM 我试图找到的div类"fk-ui-ccarousel超级容器相同的vreco部分reco-carousel-边界-顶部sameHorizontalReco"，但它返回空结果。 divs是空的。我使用inspect元
用漂亮的汤从div tag刮href

我有一个带有div标签的页面源，如下面的示例页面源。我想像下面的例子一样刮掉所有的网址，并将它们保存在列表中。示例url：来自：我尝试使用下面的代码从href中刮取网址。我试图使用span类来过滤只包含作业卡search__easy飞机的div标签。代码不返回任何网址，只是一个空列表。我对美丽的汤和硒不熟悉。如果有人能指出我的问题是什么，并提出一个解决方案，我会很高兴。特别是如果你也能给出一
用漂亮的汤刮除超过渲染的数据

我从谷歌应用商店抓取应用名称，每个网址作为输入，我只得到60个应用（因为如果用户不向下滚动，网站会呈现60个应用）。它是如何工作的，我如何才能从一个页面刮所有的应用程序使用美丽的汤和/或硒？非常感谢。这是我的密码：
漂亮的汤模块错误（html解析器）

我使用beautifulsoup查找网页上的页数，但在编写代码时：它给出了以下错误：回溯（最近一次调用）：文件“C:/Users/HangaarLab/Desktop/sonartik/sonartik.py”，第13行，在soup=BeautifulSoup（response.text）TypeError中：“模块”对象不可调用在另一台计算机中，代码运行，但它给出了以下警告： UserWa

把一张有漂亮汤的html表格刮成熊猫

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档