如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？

孙项禹

2023-03-14

问题内容：

我正在运行一个Python程序，该程序可获取UTF-8编码的网页，并使用BeautifulSoup从HTML中提取一些文本。

但是，当我将此文本写入文件（或在控制台上打印）时，它会以意外的编码方式写入。

示例程序：

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

运行此结果：

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

但是我希望Python
Unicode字符串ö在单词中呈现können为\xf6：

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

我已经试过了“fromEncoding”参数传递给BeautifulSoup，并试图read()与decode()该response对象，但它要么没什么区别，或引发错误。

使用命令curl www.voxnow.de | hexdump -C，我可以看到该网页确实是字符的UTF-8编码的（即包含0xc3 0xb6）ö：

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k.."
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

我已经超出了Python的能力极限，因此对于如何进一步调试它一无所知。有什么建议吗？

问题答案：

HTML内容以utf-8编码的形式报告自己，并且在大多数情况下是这样，除了一个或两个流氓无效的utf-8字符。

这显然使BeautifulSoup不清楚正在使用哪种编码，以及在将内容传递给BeautifulSoup时尝试首先解码为UTF-8时，如下所示：

soup = BeautifulSoup(response.read().decode('utf-8'))

我会得到错误：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

仔细观察输出，有一个字符实例Ü被错误编码为无效字节序列0xe3 0x9c，而不是正确的0xc3 0x9c。

正如该问题当前评分最高的答案所暗示的那样，在解析时可以删除无效的UTF-8字符，以便仅将有效数据传递给BeautifulSoup：

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

如何使用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？

相关阅读

相关文章

相关问答

相关工具

相关文档