问题：

无法使用BeautifulSoup find_all或pandas.read_html函数从表中刮取数据

乐山

2023-03-14

import mechanicalsoup
from bs4 import BeautifulSoup
import pandas

root_url="https://www.pro-football-reference.com"

#Opens the main pro-football page with list of 2018 games
browser=mechanicalsoup.StatefulBrowser()
browser.open("https://www.pro-football-reference.com/years/2018/games.htm")
main_page = browser.get_current_page()
browser.close()
data=main_page.find_all("tr")

#Finds the link to box-score information for the first game.  
#Will iterate over all games in a for loop later on.
box_score_tag = data[1].find("td",{"data-stat":"boxscore_word"})
box_score_link = root_url+box_score_tag.a.get("href")

#Opens the box-score page for the first game
browser2=mechanicalsoup.StatefulBrowser()
browser2.open(box_score_link)
boxscorepage=browser2.get_current_page()
browser2.close()

#attempt to scrape all the tables using Pandas
tables = pandas.read_html(box_score_link)
print(len(tables))

BeautifulSoup尝试（替换最后3行）

#attempt to scrape the specific table in question using BeautifulSoup
game_info = boxscorepage.find_all("table",{"id":"game_info"})
print(game_info)

这不会输出任何东西--在这个页面上，找到一些标签（divs、spans等）可以工作，但另一些则不行。在本例中，它没有按照预期找到带有game_info的表。

共有1个答案

宇文灿

2023-03-14

不需要用硒。这些表可以在HTML的注释中找到。只要把那些拔出来，你就能抓到所有的桌牌。该特定表是第二个表（在索引位置1）。

代码：

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd


url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue
        
print (tables[1].loc[1:])

输出：

print (tables[1].loc[1:])
            0                         1
1    Won Toss         Eagles (deferred)
2        Roof                  outdoors
3     Surface                     grass
4    Duration                      3:19
5  Attendance                     69696
6     Weather    81 degrees, wind 8 mph
7  Vegas Line  Philadelphia Eagles -1.0
8  Over/Under              44.5 (under)

类似资料：

用BeautifulSoup或Pandas刮表数据

我对使用python有点陌生，我接到了一个任务，需要从表中抓取数据。我也不太懂html。我以前从来没有这样做过，花了几天时间研究各种刮桌子的方法。不幸的是，所有的例子都是一个看起来比我所处理的更简单的网页布局。我尝试了很多不同的方法，但没有一种方法允许我选择所需的表数据。下面网页底部的“每日水位”选项卡下的表怎么刮？ url=https://apps.wrd.state.or.us/apps/g
使用beautifulsoup从span标记中刮取数据

我正在尝试刮网页，在那里我需要解码整个表到一个数据帧。我正为此使用漂亮的汤。在某些标记中，有一些标记没有任何文本。但这些值会显示在网页上的特定span标记中。下面的代码对应于该网页, 但是，这个标记中显示的值是。我试着删掉它，但我收到的是空短信。如何刮这个价值使用美丽的汤。 URL：https://en.tutiempo.net/climate/ws-432950.html 下面给出了我的用于
如何在python中从html表中刮取数据

我对python和刮擦是新手，请帮助我如何从这个表中刮擦数据。对于登录，请转到公共登录，然后输入收件人和收件人日期。数据模型：数据模型具有以下特定顺序和大小写的列：“record_date”、“doc_number”、“doc_type”、“role”、“name”、“apn”、“transfer_amount”、“county”和“state”。“角色”列可以是“授权人”，也可以是“授权人”，
如何从Python函数或方法中获取函数或方法的名称？

问题内容：我觉得我应该知道这一点，但我一直无法弄清楚…… 我想从内部获取一个方法的名称（恰好是一个集成测试），以便它可以打印出一些诊断文本。当然，我可以将方法的名称硬编码在字符串中，但是如果可能的话，我想使测试更加干燥。问题答案：涉及通过诸如此类进行内省的答案是合理的。但是根据您的情况，可能还有另一种选择：如果您的集成测试是使用unittest模块编写的，则可以在TestCase中使用。
从网页中刮取数据属性

我需要一些关于使用python来删除站点中的一些数据属性的帮助。我尝试过使用和但没有成功，我在网上找到了一些关于使用beautiful Soup的文章。唯一的问题是我不知道怎么做。这是我要刮的。我正在尝试获得值，但我不知道如何获得。希望有人能帮忙。问候, 哈扎
从网页中刮取数据。Java，HTMLUnit

已解决通过使用HTMLUnit并在打印页面前停止一段时间，我让它打印缺少的内容

无法使用BeautifulSoup find_all或pandas.read_html函数从表中刮取数据

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档