当前位置: 首页 > 面试题库 >

Python 3 UnicodeDecodeError:“ charmap”编解码器无法解码字节0x9d

向锦
2023-03-14
问题内容

我想制作搜索引擎,并按照某些网络中的教程进行操作。我想测试解析html

from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d

parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")

它得到错误

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

我在网上看到了一些使用encode()的解决方案。但是我不知道如何在代码中插入encode()函数。谁能帮我?


问题答案:

在Python 3中,文件会以文本(解码为Unicode)的形式为您打开。您无需告诉BeautifulSoup要解码的编解码器。

如果数据解码失败,那是因为您没有告诉open()调用文件读取文件时使用哪种编解码器;这是因为 使用encoding参数添加正确的编解码器:

with open(filename, encoding='utf8') as infile:
    html = BeautifulSoup(infile, "html.parser")

否则,将使用系统默认的编解码器打开文件,该默认编解码器取决于操作系统。



 类似资料: