utf8编解码器无法在python中解码字节0x96

徐奇

2023-03-14

问题内容：

我正在尝试检查许多网站的页面上是否有某个单词。该脚本可以在15个站点上正常运行，然后停止。

UnicodeDecodeError：’utf8’编解码器无法解码位置15344处的字节0x96：无效的起始字节

我在stackoverflow上进行了搜索，发现了很多问题，但似乎无法理解我的情况出了什么问题。

我想解决它，或者如果有错误，请跳过该站点。请为我提供新手建议，下面的代码本身花了我一天的时间。顺便说一句，脚本暂停的站点是http://www.homestead.com

filetocheck = open("bloglistforcommenting","r")
resultfile = open("finalfile","w")

for countofsites in filetocheck.readlines():
        sitename = countofsites.strip()
        htmlfile = urllib.urlopen(sitename)
        page = htmlfile.read().decode('utf8')
        match = re.search("Enter your name", page)
        if match:
            print "match found  : " + sitename
            resultfile.write(sitename+"\n")

        else:
            print "sorry did not find the pattern " +sitename

print "Finished Operations"

根据Mark的评论，我更改了代码以实现beautifulsoup

htmlfile = urllib.urlopen("http://www.homestead.com")
page = BeautifulSoup((''.join(htmlfile)))
print page.prettify()

现在我收到此错误

page = BeautifulSoup((''.join(htmlfile)))
TypeError: 'module' object is not callable

我正在尝试从http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start进行快速入门的示例。如果我复制粘贴它，那么代码可以正常工作。

我终于让它工作了。谢谢大家的帮助。这是最终代码。

import urllib
import re
from BeautifulSoup import BeautifulSoup

filetocheck = open("listfile","r")

resultfile = open("finalfile","w")
error ="for errors"

for countofsites in filetocheck.readlines():
        sitename = countofsites.strip()
        htmlfile = urllib.urlopen(sitename)
        page = BeautifulSoup((''.join(htmlfile)))  
        pagetwo =str(page) 
        match = re.search("Enter YourName", pagetwo)
        if match:
            print "match found  : " + sitename
            resultfile.write(sitename+"\n")

        else:
            print "sorry did not find the pattern " +sitename

print "Finished Operations"

问题答案：

许多网页编码不正确。对于解析HTML，请尝试使用BeautifulSoup，因为它可以处理在野外发现的许多类型的错误HTML。

Beautiful Soup是一个Python HTML / XML解析器，设计用于快速周转的项目，例如屏幕抓取。三个功能使其强大：

如果您给它不好的评分，美丽的汤不会won。它产生的解析树的意义与原始文档差不多。通常，这足以收集所需的数据并使其消失。

Beautiful
Soup提供了一些用于导航，搜索和修改解析树的简单方法和Pythonic习惯用法：用于剖析文档并提取所需内容的工具箱。您不必为每个应用程序创建自定义解析器。

Beautiful Soup会自动将传入文档转换为Unicode，将传出文档转换为UTF-8。 您无需考虑编码
，除非文档未指定编码并且Beautiful Soup无法自动检测到编码。然后，您只需要指定原始编码即可。

强调我的。

utf8编解码器无法在python中解码字节0x96

相关阅读

相关文章

相关问答

相关工具

相关文档