问题：

BeautifulSoup刮.文本属性问题

解飞语

2023-03-14

我有下面的代码来刮一个页面，https://www.hotukdeals.com

from bs4 import BeautifulSoup
import requests

url="https://www.hotukdeals.com/hot"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
deals = soup.find_all("article")
for deal in deals:
    priceElement = deal.find("span",{"class":"thread-price"})
    try:
        print(priceElement,priceElement.text)
    except AttributeError:
        pass

由于某种原因，这种方法起作用，在循环中刮取交易的价格一定的次数，然后停止工作。

程序输出：

<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£9.09</span> £9.09
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£39.95</span> £39.95
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£424.98</span> £424.98
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£8.10</span> £8.10
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£14.59</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.50</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£20</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£19</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£29</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£49.97</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">FREE</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.49</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£54.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£12.85</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£21.03</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£5.29</span>

从输出中可以看到，在前四行之后，.text属性为空，但元素中有文本。

有人知道这事吗？有什么想法或解决办法吗？

共有2个答案

徐鸿文

2023-03-14

Beautifulsoup需要HTML5lib解析器来正确解析站点，例如：

import requests
from bs4 import BeautifulSoup

url = "https://www.hotukdeals.com/"

soup = BeautifulSoup(requests.get(url).content, "html5lib")  # <-- use html5lib

for price in soup.select(".thread-price"):
    print(price.text)

打印：

£149.99
£7
£21.03
£31.79
£359.10
£19.99
£60
£0.60
£168
£4.99
£20
£119
Free P&P
Free
£5
£89.99
FREE
£10.96
£1.79

吴丁雷

2023-03-14

我不确定到底是什么导致了这个问题，但我找到了一个解决方法，只需找到文本字段的开始和结束，并使用string.find（）获得它的索引。下面是它的一个实现：

from bs4 import BeautifulSoup
import requests

url = "https://www.hotukdeals.com/hot"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
deals = soup.find_all("article")
for deal in deals:
    priceElement = deal.find("span", {"class": "thread-price"})

    if priceElement is not None:
        price = str(priceElement)
        start_price = price.find('">') + len('">')  # finds the start of the price
        end_price = price.find("</span")  # finds the end of the price area
        price = price[start_price:end_price] 
    else:
        price = None
    try:
        print(priceElement, price)
    except AttributeError:
        pass

类似资料：

用Beautifulsoup刮iframe

嗨，我想刮与美丽的汤，但通常iframe src应该是一个html链接，这次我遇到一个wordpress URL，基本上是文件夹结构，导致PHP文件。我在想有没有什么办法可以把那个文件里的桌子刮开？当我检查Chrome中的元素时，表DIV标记存在，然而，当我用BeautifulSoup加载链接时，iframe中的内容就会消失（表）。请帮忙
用Python和BeautifulSoup刮数据-无法提取div属性内容

我一直试图从一个使用Python和BeautifulSoup的网站中提取一些数据。我似乎找不到提取div属性内容的方法。例如，由此：我想提取标题，得到的结果是：我试过用这个：现在，这当然拉出了整个div类。我想要的只是把标题拔出来。另一个问题是，当我试图将整个div类写到一个CSV中时，它要么写一个空白，要么就把整个CSV弄乱了。我真的很感谢任何帮助。我是Python和编码的初学者，所以
BeautifulSoup'没有属性'HTML_ENTITIES

问题内容：我最近将Windows计算机上的BeautifulSoup从3.0版升级到了4.1版。我现在遇到一个奇怪的错误：这是导致引发异常的代码段： BS的文档没有提到构造函数签名是如何从v3更改为v4的。我该如何解决？问题答案：传入的HTML或XML实体始终会转换为相应的Unicode字符。Beautiful Soup 3有许多重叠的实体处理方式，已被删除。 BeautifulSoup
使用beautifulSoup，Python在h3和div标签中刮取文本

问题内容：我没有使用python，BeautifulSoup，Selenium等的经验，但是我很想从网站上抓取数据并将其存储为csv文件。我需要的单个数据样本编码如下（一行数据）。我需要的输出是我发现这些数据没有ID或类，但仍以通用文本形式出现在网站中。为此，我分别尝试使用BeautifulSoup和Python Selenium，在这两种方法中，我都陷入了无法提取的麻烦，因为我没有看到任何
用BeautifulSoup或Pandas刮表数据

我对使用python有点陌生，我接到了一个任务，需要从表中抓取数据。我也不太懂html。我以前从来没有这样做过，花了几天时间研究各种刮桌子的方法。不幸的是，所有的例子都是一个看起来比我所处理的更简单的网页布局。我尝试了很多不同的方法，但没有一种方法允许我选择所需的表数据。下面网页底部的“每日水位”选项卡下的表怎么刮？ url=https://apps.wrd.state.or.us/apps/g
2.1 CSS属性：字体属性和文本属性

本文重要内容 CSS的单位字体属性文本属性定位属性：position、float、overflow等 CSS的单位 html中的单位只有一种，那就是像素px，所以单位是可以省略的，但是在CSS中不一样。CSS中的单位是必须要写的，因为它没有默认单位。绝对单位： 1 in=2.54cm=25.4mm=72pt=6pc。各种单位的含义： in：英寸Inches (1 英寸 = 2.54 厘米

BeautifulSoup刮.文本属性问题

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档