常见问题(Troubleshooting)

优质
小牛编辑
131浏览
2023-12-01

This section covers common problems people have with Beautiful Soup. 这一节是使用BeautifulSoup时会遇到的一些常见问题的解决方法。

为什么Beautiful Soup不能打印我的no-ASCII字符?

If you're getting errors that say: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)", the problem is probably with your Python installation rather than with Beautiful Soup. Try printing out the non-ASCII characters without running them through Beautiful Soup and you should have the same problem. For instance, try running code like this:
如果你遇到这样的错误: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)", 这个错误可能是Python的问题而不是BeautifulSoup。
(译者注:在已知文档编码类型的情况下,可以先将编码转换为unicode形式,在转换为utf-8编码,然后才传递给BeautifulSoup。 例如HTML的内容htm是GB2312编码:
htm=unicode(htm,'gb2312','ignore').encode('utf-8','ignore')
soup=BeautifulSoup(htm)
如果不知道编码的类型,可以使用chardet先检测一下文档的编码类型。chardet需要自己安装一下,在网上很容下到。)
试着不用Beautiful Soup而直接打印non-ASCII 字符,你也会遇到一样的问题。 例如,试着运行以下代码:

latin1word = 'Sacr\xe9 bleu!'
unicodeword = unicode(latin1word, 'latin-1')
print unicodeword

If this works but Beautiful Soup doesn't, there's probably a bug in Beautiful Soup. However, if this doesn't work, the problem's with your Python setup. Python is playing it safe and not sending non-ASCII characters to your terminal. There are two ways to override this behavior.
如果它没有问题而Beautiful Soup不行,这可能是BeautifulSoup的一个bug。 但是,如果这个也有问题,就是Python本身的问题。Python为了安全缘故不支持发送non-ASCII 到终端。有两种方法可以解决这个限制。

  1. The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.
    最简单的方式是将标准输出重新映射到一个转换器,不在意发送到终端的字符类型是ISO-Latin-1还是UTF-8字符串。

    import codecs
    import sys
    streamWriter = codecs.lookup('utf-8')[-1]
    sys.stdout = streamWriter(sys.stdout)
    

    codecs.lookup returns a number of bound methods and other objects related to a codec. The last one is a StreamWriter object capable of wrapping an output stream.
    codecs.lookup返回一些绑定的方法和其它和codec相关的对象。 最后一行是一个封装了输出流的StreamWriter对象。

  2. The hard way is to create a sitecustomize.py file in your Python installation which sets the default encoding to ISO-Latin-1 or to UTF-8. Then all your Python programs will use that encoding for standard output, without you having to do something for each program. In my installation, I have a /usr/lib/python/sitecustomize.py which looks like this:
    稍微困难点的方法是创建一个sitecustomize.py文件在你的Python安装中, 将默认编码设置为ISO-Latin-1或UTF-8。这样你所有的Python程序都会使用这个编码作为标准输出, 不用在每个程序里再设置一下。在我的安装中,我有一个 /usr/lib/python/sitecustomize.py,内容如下:

    import sys
    sys.setdefaultencoding("utf-8")
    

For more information about Python's Unicode support, look at Unicode for Programmers or End to End Unicode Web Applications in Python. Recipes 1.20 and 1.21 in the Python cookbook are also very helpful.
更多关于Python的Unicode支持的信息,参考 Unicode for Programmers or End to End Unicode Web Applications in Python。Python食谱的给的菜谱1.20和1.21也很有用。

Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print.
但是即使你的终端显示被限制为ASCII,你也可以使用BeautifulSoup以UTF-8和其它的编码类型来剖析,处理和修改文档。 只是对于某些字符,你不能使用print来输出。

Beautiful Soup 弄丢了我给的数据!为什么?为什么?????

Beautiful Soup can handle poorly-structured SGML, but sometimes it loses data when it gets stuff that's not SGML at all. This is not nearly as common as poorly-structured markup, but if you're building a web crawler or something you'll surely run into it.
Beautiful Soup可以处理结构不太规范的SGML,但是给它的材料非常不规范, 它会丢失数据。如果你是在写一个网络爬虫之类的程序,你肯定会遇到这种,不太常见的结构有问题的文档。

The only solution is to sanitize the data ahead of time with a regular expression. Here are some examples that I and Beautiful Soup users have discovered:
唯一的解决方法是先使用正则表达式来规范数据。 下面是一些我和一些Beautiful Soup的使用者发现的例子:

  • Beautiful Soup treats ill-formed XML definitions as data. However, it loses well-formed XML definitions that don't actually exist:
    Beautiful Soup 将不规范德XML定义处理为数据(data)。然而,它丢失了那些实际上不存在的良好的XML定义:

    from BeautifulSoup import BeautifulSoup
    BeautifulSoup("< ! FOO @=>")
    # < ! FOO @=>
    BeautifulSoup("<b><!FOO>!</b>")
    # <b>!</b>
    
  • If your document starts a declaration and never finishes it, Beautiful Soup assumes the rest of your document is part of the declaration. If the document ends in the middle of the declaration, Beautiful Soup ignores the declaration totally. A couple examples:
    如果你的文档开始了声明但却没有关闭,Beautiful Soup假定你的文档的剩余部分都是这个声明的一部分。 如果文档在声明的中间结束了,Beautiful Soup会忽略这个声明。如下面这个例子:

    from BeautifulSoup import BeautifulSoup
    
    BeautifulSoup("foo<!bar") 
    # foo 
    
    soup = BeautifulSoup("<html>foo<!bar</html>") 
    print soup.prettify()
    # <html>
    #  foo<!bar</html>
    # </html>
    

    There are a couple ways to fix this; one is detailed here.
    有几种方法来处理这种情况;其中一种在 这里有详细介绍。

    Beautiful Soup also ignores an entity reference that's not finished by the end of the document:
    Beautiful Soup 也会忽略实体引用,如果它没有在文档结束的时候关闭:

    BeautifulSoup("&lt;foo&gt")
    # &lt;foo
    

    I've never seen this in real web pages, but it's probably out there somewhere. 我从来没有在实际的网页中遇到这种情况,但是也许别的地方会出现。

  • A malformed comment will make Beautiful Soup ignore the rest of the document. This is covered as the example in Sanitizing Bad Data with Regexps.
    一个畸形的注释会是Beautiful Soup回来文档的剩余部分。在使用正则规范数据这里有详细的例子。

The parse tree built by the BeautifulSoup class offends my senses!
BeautifulSoup类构建的剖析树让我感到头痛。

To get your markup parsed differently, check out
尝试一下别的剖析方法,试试 其他内置的剖析器,或者 自定义一个剖析器.

Beautiful Soup 太慢了!

Beautiful Soup will never run as fast as ElementTree or a custom-built SGMLParser subclass. ElementTree is written in C, and SGMLParser lets you write your own mini-Beautiful Soup that only does what you want. The point of Beautiful Soup is to save programmer time, not processor time.
Beautiful Soup 不会像ElementTree或者自定义的SGMLParser子类一样快。 ElementTree是用C写的,并且做那些你想要做的事。 Beautiful Soup是用来节省程序员的时间,而不是处理器的时间。

That said, you can speed up Beautiful Soup quite a lot by only parsing the parts of the document you need, and you can make unneeded objects get garbage-collected by using extract.
但是你可以加快Beautiful Soup通过解析部分的文档,