如何使用Python检索动态html内容的值

顾昌翰

2023-03-14

问题内容：

我正在使用Python 3，并且正在尝试从网站检索数据。但是，此数据是动态加载的，而我现在拥有的代码不起作用：

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\\n", "\n")
print(data)

在尝试查找特定值的地方，我找到的是模板，例如“ {{formatPrice平均数}}”而不是“ 4.48”。

我该如何做才能检索值而不是占位符文本？

编辑：这是我要从中提取信息的特定页面。我试图获取“中位数”值，该值使用模板{{formatPrice平均数}}

编辑2：我已经安装并设置了程序以使用Selenium和BeautifulSoup。

我现在拥有的代码是：

from bs4 import BeautifulSoup
from selenium import webdriver

#...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

这是程序执行时的屏幕截图。不幸的是，它似乎没有找到指定“ formatPriceaverage”的任何内容。

问题答案：

假设您正试图从使用javascript模板（例如handlebars之类）呈现的页面获取值，那么这就是任何标准解决方案（即beautifulsoup或requests）所能获得的。

这是因为浏览器使用javascript更改了接收到的内容并创建了新的DOM元素。urllib将像浏览器一样执行请求部分，而不是模板呈现部分。本文讨论了三种主要解决方案：

直接解析ajax JSON
使用离线Javascript解释器处理SpiderMonkey和Crowbar请求
使用浏览器自动化工具分解

编辑

从您的评论看来，它是一个由把手驱动的网站。这个答案给出了一个很好的代码示例，可能会有用：

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
    print tag.text

硒基本上是从您的浏览器获取呈现的HTML，然后您可以使用来自page_source属性的BeautifulSoup对其进行解析。祝好运：）

如何使用Python检索动态html内容的值

相关阅读

相关文章

相关问答

相关工具

相关文档