问题：

使用Python和Selenium刮取难以找到的Web表

贝成业

2023-03-14

我一直在使用Python和Selenium从特定的州健康网页中获取数据，并将该表输出到本地CSV。

我在其他几个州使用类似的代码取得了很多成功。但是，我遇到了一种状态，即使用看起来像R的东西来创建动态仪表板，而我无法使用常规方法真正访问这些仪表板。

我花了很多时间梳理StackOverflow。我已经检查了是否有一个iframe可以切换，但是，我只是没有看到页面上iframe中我想要的数据。

使用Chrome的“检查”功能，我可以很容易地找到表格信息。但是，从原始URL开始，我需要的数据不在该页面上，并且我找不到表的源URL。我甚至用Fiddler来看看某处是否有电话。

所以，我不知道该怎么办。我可以看到数据——但是，我不知道在哪里告诉Selenium和BS4在哪里访问它。

页面在这里：https://coronavirus.utah.gov/case-counts/

页面加载需要一段时间...我已经让其他州有了这个问题，硒可以解决这个问题。

如有任何帮助或建议，将不胜感激。

这是我一直在使用的代码。它在这里不起作用，但是，它的结构与其他州的结构非常相似。

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20

# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)

# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))

# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")

tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"

exec(file_path)

# Close the chrome web driver
driver.close()

共有1个答案

陈马鲁

2023-03-14

我找到了另一种方法来获取我需要的信息。

感谢朱利安·斯坦利让我了解卡塔隆录音机产品。这让我看到了iframe是什么，桌子在哪里。

使用CSS或XPATH查找元素的旧方法会由于线程锁定而导致Pickle错误。我不知道该怎么处理。但是，这导致整个项目停滞不前。

但是，我能够通过属性获得表的文本/超文本标记语言。之后，我只是像往常一样用BS4读它。

import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20

# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)


# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)

# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]

# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)

# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")

tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"

exec(file_path)

# Close the chrome web driver
driver.close()

类似资料：

用Python和selenium刮URL

> 取文本文件booktitle.txt，它是书名列表。然后使用Python/Selenium在网站goodreads.com中搜索该标题。获取结果的URL并创建一个新的.csv文件，其中列1=书名，列2=站点URL
使用Python对隐藏表进行Web刮取

这是我的代码：我在找“eFotrait-table”：具体来说，这一条：
使用rvest进行Web刮取
尝试使用Selenium刮取数据>

我试图使用Selenium从代码中获得jpg。我已经设法找到了链接点击获得我的jpg所在的位置。（真倒霉！我刚接触硒）。所有的窗户都随着它的点击而打开。与刮刮乐相比，它真的很慢，所以如果有人能告诉我一个更快的方法，那就太好了。我试图搜索的网站是www.rosegal.com。我正在刮的类别是大尺寸的背心。这第一页有60个产品在它。如果单击这些产品，它会将您带到一个产品页面，在该页面上您可以选择所
使用DataDome的网站在使用Selenium和Python进行刮取时captcha被阻止

我实际上正在尝试从不同的网站中删除一些汽车数据，我一直在chromebrowser中使用selenium，但一些网站实际上通过验证码验证（例如:https://www.leboncoin.fr/)，阻止了selenium，而这只需要一到两个请求。我尝试在chromebrowser中更改$_cdc，但这没有解决问题，我一直在chromebrowser中使用这些选项我试图刮的网站使用DataDome
在python中使用selenium刮取动态网页失败

我正试图从这一页上删除所有5000家公司。当我向下滚动时，它的动态页面和公司被加载。但我只能刮去5家公司的钱，那我怎么能刮去全部5000家呢？当我向下滚动页面时，URL正在更改。我试过硒，但没用。https://www.inc.com/profile/onetrust注意：我想刮公司的所有信息，但刚才选择了两个。更新了代码，但页面根本不滚动。更正了BeautifulSoup代码中的一些错误谢谢

使用Python和Selenium刮取难以找到的Web表

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档