UnicodeDecodeError：'utf8'编解码器无法解码位置3131中的字节0x80：无效的起始字节

马国源

2023-03-14

问题内容：

我正在尝试使用python 2.7.12从json文件读取twitter数据。

我使用的代码是这样的：

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')

    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '\r\n':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

结果我得到：

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

我仔细研究了类似问题的所有答案，并想出了这段代码，它在上一次有效。我不知道为什么现在不起作用…我将不胜感激！

问题答案：

这对您没有帮助sys.setdefaultencoding('utf-8')，这会使事情进一步混乱-
这是一个讨厌的黑客，您需要将其从代码中删除。

错误正在发生，因为line是一个字符串，您正在调用encode()。encode()仅当字符串是Unicode时才有意义，因此Python会尝试首先使用默认编码（在您的情况下为UTF-8，但应为）将Unicode转换为Unicode
ASCII。无论哪种方式，0x80无效的ASCII或UTF-8都将失败。

0x80在某些字符集中有效。在windows-1252/ cp1252是€。

这里的窍门是从头到尾理解代码的数据编码。此刻，您还有太多机会。Unicode字符串类型是Python的一种便捷功能，它使您可以解码已编码的字符串，而无需进行编码，直到需要写入或传输数据为止。

使用该io模块以文本模式打开文件并对其进行解码-
不再.decode()！您需要确保传入数据的编码是一致的。您可以在外部对其重新编码，也可以在脚本中更改编码。这是我将编码设置为windows-1252。

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

该io模块还提供通用换行符。这意味着将\r\n被检测为换行符，因此您不必注意它们。

UnicodeDecodeError：'utf8'编解码器无法解码位置3131中的字节0x80：无效的起始字节

相关阅读

相关文章

相关问答

相关工具

相关文档