高级主题

优质
小牛编辑
128浏览
2023-12-01

That does it for the basic usage of Beautiful Soup. But HTML and XML are tricky, and in the real world they're even trickier. So Beautiful Soup keeps some extra tricks of its own up its sleeve.
那些是对Beautiful Soup的基本用法。但是现实中的HTML和XML是非常棘手的(tricky),即使他们不是trickier。 因此Beautiful Soup也有一些额外的技巧。

产生器

The search methods described above are driven by generator methods. You can use these methods yourself: they're called nextGenerator, previousGenerator, nextSiblingGenerator, previousSiblingGenerator, and parentGenerator. Tag and parser objects also have childGenerator and recursiveChildGenerator available.
以上的搜索方法都是由产生器驱动的。你也可以自己使用这些方法: 他们是nextGenerator, previousGenerator, nextSiblingGenerator, previousSiblingGenerator, 和parentGenerator. Tag和剖析对象 可以使用childGeneratorrecursiveChildGenerator

Here's a simple example that strips HTML tags out of a document by iterating over the document and collecting all the strings.
下面是一个简单的例子,将遍历HTML的标签并将它们从文档中剥离,搜集所有的字符串:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("""<div>You <i>bet</i>
<a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>
rocks!</div>""")

''.join([e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
# u'You bet\nBeautifulSoup\nrocks!'

Here's a more complex example that uses recursiveChildGenerator to iterate over the elements of a document, printing each one as it gets it. 这是一个稍微复杂点的使用recursiveChildGenerator的例子来遍历文档中所有元素, 并打印它们。

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("1<a>2<b>3")
g = soup.recursiveChildGenerator()
while True:
    try:
        print g.next()
    except StopIteration:
        break
# 1
# <a>2<b>3</b></a>
# 2
# <b>3</b>
# 3

其它内置的剖析器

Beautiful Soup comes with three parser classes besides BeautifulSoup and BeautifulStoneSoup:
除了BeautifulSoupBeautifulStoneSoup,还有其它三个Beautiful Soup剖析器:

  • MinimalSoup is a subclass of BeautifulSoup. It knows most facts about HTML like which tags are self-closing, the special behavior of the <SCRIPT> tag, the possibility of an encoding mentioned in a <META> tag, etc. But it has no nesting heuristics at all. So it doesn't know that <LI> tags go underneath <UL> tags and not the other way around. It's useful for parsing pathologically bad markup, and for subclassing.
    MinimalSoupBeautifulSoup的子类。对于HTML的大部分内容都可以处理, 例如自关闭的标签,特殊的标签<SCRIPT>,<META>中写到的可能的编码类型,等等。 但是它没有内置的智能判断能力。例如它不知道<LI>标签应该在<UL>下,而不是其他方式。 对于处理糟糕的标记和用来被继承还是有用的。

  • ICantBelieveItsBeautifulSoup is also a subclass of BeautifulSoup. It has HTML heuristics that conform more closely to the HTML standard, but ignore how HTML is used in the real world. For instance, it's valid HTML to nest <B> tags, but in the real world a nested <B> tag almost always means that the author forgot to close the first <B> tag. If you run into someone who actually nests <B> tags, then you can use ICantBelieveItsBeautifulSoup. ICantBelieveItsBeautifulSoup也是BeautifulSoup的子类。 它具有HTML的智能(heuristics)判断能力,更加符合标准的HTML,但是忽略实际使用的HTML。 例如:一个嵌入<B>标签的HTML是有效的,但是实际上一个嵌入的<B>通常意味着 那个HTML的作者忘记了关闭第一个<B>标签。如果你运行某些人确实使用嵌入的<B>标签的HTML, 这是你可以是使用ICantBelieveItsBeautifulSoup

  • BeautifulSOAP is a subclass of BeautifulStoneSoup. It's useful for parsing documents like SOAP messages, which use a subelement when they could just use an attribute of the parent element. Here's an example:
    BeautifulSOAPBeautifulStoneSoup的子类。对于处理那些类似SOAP消息的文档, 也就是处理那些可以将标签的子标签变为其属性的文档很方便。下面是一个例子:

    from BeautifulSoup import BeautifulStoneSoup, BeautifulSOAP
    xml = "<doc><tag>subelement</tag></doc>"
    print BeautifulStoneSoup(xml)
    # <doc><tag>subelement</tag></doc>
    print BeautifulSOAP(xml)
    <doc tag="subelement"><tag>subelement</tag></doc>
    

    With BeautifulSOAP you can access the contents of the <TAG> tag without descending into the tag.
    使用BeautifulSOAP,你可以直接存取<TAG>而不需要再往下解析。

定制剖析器(Parser)

When the built-in parser classes won't do the job, you need to customize. This usually means customizing the lists of nestable and self-closing tags. You can customize the list of self-closing tags by passing a selfClosingTags argument into the soup constructor. To customize the lists of nestable tags, though, you'll have to subclass.
当内置的剖析类不能做一些工作时,你需要定制它们。 这通常意味着重新定义可内嵌的标签和自关闭的标签列表。 你可以通过传递参数selfClosingTags 给soup的构造器来定制自关闭的标签。自定义可以内嵌的标签的列表,你需要子类化。

The most useful classes to subclass are MinimalSoup (for HTML) and BeautifulStoneSoup (for XML). I'm going to show you how to override RESET_NESTING_TAGS and NESTABLE_TAGS in a subclass. This is the most complicated part of Beautiful Soup and I'm not going to explain it very well here, but I'll get something written and then I can improve it with feedback.
非常有用的用来子类的类是MinimalSoup类(针对HTML)和BeautifulStoneSoup(针对XML)。 我会说明如何在子类中重写RESET_NESTING_TAGSNESTABLE_TAGS。这是Beautiful Soup 中 最为复杂的部分,所以我也不会在这里详细的解释,但是我会写些东西并利用反馈来改进它。

When Beautiful Soup is parsing a document, it keeps a stack of open tags. Whenever it sees a new start tag, it tosses that tag on top of the stack. But before it does, it might close some of the open tags and remove them from the stack. Which tags it closes depends on the qualities of tag it just found, and the qualities of the tags in the stack.
当Beautiful Soup剖析一个文档的时候,它会保持一个打开的tag的堆栈。任何时候只要它看到一个新的 开始tag,它会将这个tag拖到堆栈的顶端。但在做这步之前,它可能会关闭某些已经打开的标签并将它们从 堆栈中移除。

The best way to explain it is through example. Let's say the stack looks like ['html', 'p', 'b'], and Beautiful Soup encounters a <P> tag. If it just tossed another 'p' onto the stack, this would imply that the second <P> tag is within the first <P> tag, not to mention the open <B> tag. But that's not the way <P> tags work. You can't stick a <P> tag inside another <P> tag. A <P> tag isn't "nestable" at all.
我们最好还是通过例子来解释。我们假定堆栈如同['html','p','b'], 并且Beautiful Soup遇到一个<P>标签。如果它仅仅将另一个'p'拖到堆栈的顶端, 这意味着第二个<P>标签在第一个<P>内,而不会影响到打开的<B>。 但是这不是<P>应该的样子。你不能插入一个<P>到另一个<P>里面去。<P>标签不是可内嵌的。

So when Beautiful Soup encounters a <P> tag, it closes and pops all the tags up to and including the previously encountered tag of the same type. This is the default behavior, and this is how BeautifulStoneSoup treats every tag. It's what you get when a tag is not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also what you get when a tag shows up in RESET_NESTING_TAGS but has no entry in NESTABLE_TAGS, the way the <P> tag does.
因此当Beautiful Soup 遇到一个<P>时,它先关闭并弹出所有的标签,包括前面遇到的同类型的标签。 这是默认的操作,这也是Beautiful Soup对待每个标签的方式。当一个标签不在NESTABLE_TAGSRESET_NESTING_TAGS中时,你会遇到的处理方式。这也是当一个标签在RESET_NESTING_TAGS 中而不在NESTABLE_TAGS中时的处理方式,就像处理<P>一样。

from BeautifulSoup import BeautifulSoup
BeautifulSoup.RESET_NESTING_TAGS['p'] == None
# True
BeautifulSoup.NESTABLE_TAGS.has_key('p')
# False

print BeautifulSoup("<html><p>Para<b>one<p>Para two")
# <html><p>Para<b>one</b></p><p>Para two</p></html>
#                      ^---^--The second <p> tag made those two tags get closed

Let's say the stack looks like ['html', 'span', 'b'], and Beautiful Soup encounters a <SPAN> tag. Now, <SPAN> tags can contain other <SPAN> tags without limit, so there's no need to pop up to the previous <SPAN> tag when you encounter one. This is represented by mapping the tag name to an empty list in NESTABLE_TAGS. This kind of tag should not be mentioned in RESET_NESTING_TAGS: there are no circumstances when encountering a <SPAN> tag would cause any tags to be popped.
我们假定堆栈如同['html','span','b'],并且Beautiful Soup 遇到一个<SPAN>标签。 现在,<SPAN>可以无限制包含其他的<SPAN>,因此当再次遇到<SPAN>标签时没有必要弹出前面的<SPAN>标签。 这是通过映射标签名到NESTABLE_TAGS中的一个空列表里。这样的标签也需要在RESET_NESTING_TAGS里 设置:当再次遇到<SPAN>是不会再导致任何标签被弹出并关闭。

from BeautifulSoup import BeautifulSoup
BeautifulSoup.NESTABLE_TAGS['span']
# []
BeautifulSoup.RESET_NESTING_TAGS.has_key('span')
# False

print BeautifulSoup("<html><span>Span<b>one<span>Span two")
# <html><span>Span<b>one<span>Span two</span></b></span></html>

Third example: suppose the stack looks like ['ol','li','ul']: that is, we've got an ordered list, the first element of which contains an unordered list. Now suppose Beautiful Soup encounters a <LI> tag. It shouldn't pop up to the first <LI> tag, because this new <LI> tag is part of the unordered sublist. It's okay for an <LI> tag to be inside another <LI> tag, so long as there's a <UL> or <OL> tag in the way.
第三个例子:假定堆栈如同['ol','li','ul']: 也就是,我们有一个有序的list,且列表的第一个元素包含一个无序的list。现在假设,Beautiful Soup 遇到一个<LI>标签。它不会弹出第一个<LI>,因为这个新的<LI>是无序的子list一部分。 <LI>中内嵌一个<LI>是可以的,同样的<UL>和<OL>标签也可以这样。

from BeautifulSoup import BeautifulSoup
print BeautifulSoup("<ol><li>1<ul><li>A").prettify()
# <ol>
#  <li>
#   1
#   <ul>
#    <li>
#     A
#    </li>
#   </ul>
#  </li>
# </ol>

But if there is no intervening <UL> or <OL>, then one <LI> tag can't be underneath another:
如果<UL>和<OL>没有被干扰,这时一个<LI>标签也不能在另一个之下。[bad]

print BeautifulSoup("<ol><li>1<li>A").prettify()
# <ol>
#  <li>
#   1
#  </li>
#  <li>
#   A
#  </li>
# </ol>

We tell Beautiful Soup to treat <LI> tags this way by putting "li" in RESET_NESTING_TAGS, and by giving "li" a NESTABLE_TAGS entry showing list of tags under which it can nest.
Beautiful Soup这样对待<LI>是通过将"li"放入RESET_NESTING_TAGS,并给在NESTABLE_TAGS中给"li"一个可以内嵌接口。

BeautifulSoup.RESET_NESTING_TAGS.has_key('li')
# True
BeautifulSoup.NESTABLE_TAGS['li']
# ['ul', 'ol']

This is also how we handle the nesting of table tags:
这也是处理内嵌的table标签的方式:

BeautifulSoup.NESTABLE_TAGS['td']
# ['tr']
BeautifulSoup.NESTABLE_TAGS['tr']
# ['table', 'tbody', 'tfoot', 'thead']
BeautifulSoup.NESTABLE_TAGS['tbody']
# ['table']
BeautifulSoup.NESTABLE_TAGS['thead']
# ['table']
BeautifulSoup.NESTABLE_TAGS['tfoot']
# ['table']
BeautifulSoup.NESTABLE_TAGS['table']
# []

That is: <TD> tags can be nested within <TR> tags. <TR> tags can be nested within <TABLE>, <TBODY>, <TFOOT>, and <THEAD> tags. <TBODY>, <TFOOT>, and <THEAD> tags can be nested in <TABLE> tags, and <TABLE> tags can be nested in other <TABLE> tags. If you know about HTML tables, these rules should already make sense to you.
也就是<TD>标签可以嵌入到<TR>中。 <TR>可以被嵌入到<TABLE>, <TBODY>, <TFOOT>, 以及 <THEAD> 中。 <TBODY>,<TFOOT>, and <THEAD>标签可以嵌入到 <TABLE> 标签中, 而 <TABLE> 嵌入到其它的<TABLE> 标签中. 如果你对HTML有所了解,这些规则对你而言应该很熟悉。

One more example. Say the stack looks like ['html', 'p', 'table'] and Beautiful Soup encounters a <P> tag.
再举一个例子,假设堆栈如同['html','p','table'],并且Beautiful Soup遇到一个<P>标签。

At first glance, this looks just like the example where the stack is ['html', 'p', 'b'] and Beautiful Soup encounters a <P> tag. In that example, we closed the <B> and <P> tags, because you can't have one paragraph inside another. 首先,这看起来像前面的同样是Beautiful Soup遇到了堆栈['html','p','b']。 在那个例子中,我们关闭了<B>和<P>标签,因为你不能在一个段落里内嵌另一个段落。

Except... you can have a paragraph that contains a table, and then the table contains a paragraph. So the right thing to do is to not close any of these tags. Beautiful Soup does the right thing: 除非,你的段落里包含了一个table,然后这table包含了一个段落。因此,这种情况下正确的处理是 不关闭任何标签。Beautiful Soup就是这样做的:

from BeautifulSoup import BeautifulSoup
print BeautifulSoup("<p>Para 1<b><p>Para 2")
# <p>
#  Para 1
#  <b>
#  </b>
# </p>
# <p>
#  Para 2
# </p>

print BeautifulSoup("<p>Para 1<table><p>Para 2").prettify()
# <p>
#  Para 1
#  <table>
#   <p>
#    Para 2
#   </p>
#  </table>
# </p>

What's the difference? The difference is that <TABLE> is in RESET_NESTING_TAGS and <B> is not. A tag that's in RESET_NESTING_TAGS doesn't get popped off the stack as easily as a tag that's not.
有什么不同?不同是<TABLE>标签在RESET_NESTING_TAGS中,而<B>不在。 一个在RESET_NESTING_TAGS中标签不会像不在其里面的标签那样,会是堆栈中标签被弹出。

Okay, hopefully you get the idea. Here's the NESTABLE_TAGS for the BeautifulSoup class. Correlate this with what you know about HTML, and you should be able to create your own NESTABLE_TAGS for bizarre HTML documents that don't follow the normal rules, and for other XML dialects that have different nesting rules.
好了,希望你明白了(我被弄有点晕,有些地方翻译的不清,还请见谅)。 NESTABLE_TAGS用于BeautifulSoup类。 依据你所知道的HTML,你可以创建你自己NESTABLE_TAGS来处理那些不遵循标准规则的HTML文档。 以及那些使用不同嵌入规则XML的方言。

from BeautifulSoup import BeautifulSoup
nestKeys = BeautifulSoup.NESTABLE_TAGS.keys()
nestKeys.sort()
for key in nestKeys:
    print "%s: %s" % (key, BeautifulSoup.NESTABLE_TAGS[key])
# bdo: []
# blockquote: []
# center: []
# dd: ['dl']
# del: []
# div: []
# dl: []
# dt: ['dl']
# fieldset: []
# font: []
# ins: []
# li: ['ul', 'ol']
# object: []
# ol: []
# q: []
# span: []
# sub: []
# sup: []
# table: []
# tbody: ['table']
# td: ['tr']
# tfoot: ['table']
# th: ['tr']
# thead: ['table']
# tr: ['table', 'tbody', 'tfoot', 'thead']
# ul: []

And here's BeautifulSoup's RESET_NESTING_TAGS. Only the keys are important: RESET_NESTING_TAGS is actually a list, put into the form of a dictionary for quick random access.
这是BeautifulSoupRESET_NESTING_TAGS。只有键(keys)是重要的: RESET_NESTING_TAGS实际是一个list,以字典的形式可以快速随机存取。

from BeautifulSoup import BeautifulSoup
resetKeys = BeautifulSoup.RESET_NESTING_TAGS.keys()
resetKeys.sort()
resetKeys
# ['address', 'blockquote', 'dd', 'del', 'div', 'dl', 'dt', 'fieldset', 
#  'form', 'ins', 'li', 'noscript', 'ol', 'p', 'pre', 'table', 'tbody',
#  'td', 'tfoot', 'th', 'thead', 'tr', 'ul']

Since you're subclassing anyway, you might as well override SELF_CLOSING_TAGS while you're at it. It's a dictionary that maps self-closing tag names to any values at all (like RESET_NESTING_TAGS, it's actually a list in the form of a dictionary). Then you won't have to pass that list in to the constructor (as selfClosingTags) every time you instantiate your subclass.
因为无论如何都有使用继承,你最好还是在需要的时候重写SELF_CLOSING_TAGS。 这是一个映射自关闭标签名的字典(如同RESET_NESTING_TAGS,它实际是字典形式的list)。 这样每次实例化你的子类时,你就不用传list给构造器(如selfClosingTags)。

实体转换

When you parse a document, you can convert HTML or XML entity references to the corresponding Unicode characters. This code converts the HTML entity "&eacute;" to the Unicode character LATIN SMALL LETTER E WITH ACUTE, and the numeric entity "e" to the Unicode character LATIN SMALL LETTER E.
当你剖析一个文档是,你可以转换HTML或者XML实体引用到可表达Unicode的字符。 这个代码转换HTML实体"&eacute;"到Unicode字符 LATIN SMALL LETTER E WITH ACUTE,以及将 数量实体"e"转换到Unicode字符LATIN SMALL LETTER E.

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bleu!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

That's if you use HTML_ENTITIES (which is just the string "html"). If you use XML_ENTITIES (or the string "xml"), then only numeric entities and the five XML entities ("&quot;", "&apos;", "&gt;", "&lt;", and "&amp;") get converted. If you use ALL_ENTITIES (or the list ["xml", "html"]), then both kinds of entities will be converted. This last one is neccessary because &apos; is an XML entity but not an HTML entity.
这是针对使用HTML_ENTITIES(也就是字符串"html")。如果你使用XML_ENTITIES(或字符串"xml"), 这是只有数字实体和五个XML实体(("&quot;","&apos;", "&gt;", "&lt;", 和 "&amp;") 会被转换。如果你使用ALL_ENTITIES(或者列表["xml","html"]), 两种实体都会被转换。最后一种方式是必要的,因为&apos;是一个XML的实体而不是HTML的。

BeautifulStoneSoup("Sacr&eacute; bleu!", 
                   convertEntities=BeautifulStoneSoup.XML_ENTITIES)
# Sacr&eacute; bleu!

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Il a dit, &lt;&lt;Sacr&eacute; bleu!&gt;&gt;", 
                   convertEntities=BeautifulStoneSoup.XML_ENTITIES)
# Il a dit, <<Sacr&eacute; bleu!>>

If you tell Beautiful Soup to convert XML or HTML entities into the corresponding Unicode characters, then Windows-1252 characters (like Microsoft smart quotes) also get transformed into Unicode characters. This happens even if you told Beautiful Soup to convert those characters to entities.
如果你指定Beautiful Soup转换XML或HTML实体到可通信的Unicode字符时,Windows-1252(微软的smart quotes)也会 被转换为Unicode字符。即使你指定Beautiful Soup转换这些字符到实体是,也还是这样。

from BeautifulSoup import BeautifulStoneSoup
smartQuotesAndEntities = "Il a dit, \x8BSacr&eacute; bleu!\x9b"

BeautifulStoneSoup(smartQuotesAndEntities, smartQuotesTo="html").contents[0]
# u'Il a dit, &lsaquo;Sacr&eacute; bleu!&rsaquo;'

BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="html", 
                   smartQuotesTo="html").contents[0]
# u'Il a dit, \u2039Sacr\xe9 bleu!\u203a'

BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="xml", 
                   smartQuotesTo="xml").contents[0]
# u'Il a dit, \u2039Sacr&eacute; bleu!\u203a'

It doesn't make sense to create new HTML/XML entities while you're busy turning all the existing entities into Unicode characters.
将所有存在的实体转换为Unicode时,不会影响创建新的HTML/XML实体。

使用正则式处理糟糕的数据

Beautiful Soup does pretty well at handling bad markup when "bad markup" means tags in the wrong places. But sometimes the markup is just malformed, and the underlying parser can't handle it. So Beautiful Soup runs regular expressions against an input document before trying to parse it.
对于那些在错误的位置的"坏标签",Beautiful Soup处理的还不错。但有时有些 非常不正常的标签,底层的剖析器也不能处理。这时Beautiful Soup会在剖析之前运用正则表达式 来处理输入的文档。

By default, Beautiful Soup uses regular expressions and replacement functions to do search-and-replace on input documents. It finds self-closing tags that look like <BR/>, and changes them to look like <BR />. It finds declarations that have extraneous whitespace, like <! --Comment-->, and removes the whitespace: <!--Comment-->.
默认情况下,Beautiful Soup使用正则式和替换函数对输入文档进行搜索替换操作。 它可以发现自关闭的标签如<BR/>,转换它们如同<BR />(译注:多加了一个空格)。 它可以找到有多余空格的声明,如<! --Comment-->,移除空格:<!--Comment-->.

If you have bad markup that needs fixing in some other way, you can pass your own list of (regular expression, replacement function) tuples into the soup constructor, as the markupMassage argument.
如果你的坏标签需要以其他的方式修复,你也可以传递你自己的以(regular expression, replacement function) 元组的list到soup对象构造器,作为markupMassage参数。

Let's take an example: a page that has a malformed comment. The underlying SGML parser can't cope with this, and ignores the comment and everything afterwards: 我们举个例子:有一个页面的注释很糟糕。底层的SGML不能解析它,并会忽略注释以及它后面的所有内容。

from BeautifulSoup import BeautifulSoup
badString = "Foo<!-This comment is malformed.-->Bar<br/>Baz"
BeautifulSoup(badString)
# Foo

Let's fix it up with a regular expression and a function:
让我们使用正则式和一个函数来解决这个问题:

import re
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
BeautifulSoup(badString, markupMassage=myMassage)
# Foo<!--This comment is malformed.-->Bar

Oops, we're still missing the <BR> tag. Our markupMassage overrides the parser's default massage, so the default search-and-replace functions don't get run. The parser makes it past the comment, but it dies at the malformed self-closing tag. Let's add our new massage function to the default list, so we run all the functions.
哦呃呃,我们还是漏掉了<BR>标签。我们的markupMassage 重载了剖析默认的message,因此默认的搜索替换函数不会运行。 剖析器让它来处理注释,但是它在坏的自关闭标签那里停止了。让我加一些新的message函数到默认的list中去, 并让这些函数都运行起来。

import copy
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

Now we've got it all.
这样我们就搞定了。

If you know for a fact that your markup doesn't need any regular expressions run on it, you can get a faster startup time by passing in False for markupMassage.
如果你已经知道你的标签不需要任何正则式,你可以通过传递一个FalsemarkupMassage.

玩玩SoupStrainer

Recall that all the search methods take more or less the same arguments. Behind the scenes, your arguments to a search method get transformed into a SoupStrainer object. If you call one of the methods that returns a list (like findAll), the SoupStrainer object is made available as the source property of the resulting list.
回忆起所有的搜索方法都是或多或少使用了一些一样的参数。 在后台,你传递给搜索函数的参数都会传给SoupStrainer对象。 如果你所使用的函数返回一个list(如findAll),那是SoupStrainer对象使 结果列表的source属性变的可用。

from BeautifulSoup import BeautifulStoneSoup
xml = '<person><parent rel="mother">'
xmlSoup = BeautifulStoneSoup(xml)
results = xmlSoup.findAll(rel='mother')

results.source
# <BeautifulSoup.SoupStrainer instance at 0xb7e0158c>
str(results.source)
# "None|{'rel': 'mother'}"

The SoupStrainer constructor takes most of the same arguments as find: name, attrs, text, and **kwargs. You can pass in a SoupStrainer as the name argument to any search method:
SoupStrainer的构造器几乎使用和find一样的参数: name, attrs, text, 和**kwargs. 你可以在一个SoupStrainer中传递和其他搜索方法一样的name参数:

xmlSoup.findAll(results.source) == results
# True

customStrainer = BeautifulSoup.SoupStrainer(rel='mother')
xmlSoup.findAll(customStrainer) == results
#  True

Yeah, who cares, right? You can carry around a method call's arguments in many other ways. But another thing you can do with SoupStrainer is pass it into the soup constructor to restrict the parts of the document that actually get parsed. That brings us to the next section:
耶,谁会在意,对不对?你可以把一个方法的参数用在很多其他地方。 还有一件你可以用SoupStrainer做的事是,将它传递给soup的构建器,来部分的解析文档。 下一节,我们就谈这个。

通过剖析部分文档来提升效率

Beautiful Soup turns every element of a document into a Python object and connects it to a bunch of other Python objects. If you only need a subset of the document, this is really slow. But you can pass in a SoupStrainer as the parseOnlyThese argument to the soup constructor. Beautiful Soup checks each element against the SoupStrainer, and only if it matches is the element turned into a Tag or NavigableText, and added to the tree.
Beautiful Soup 将一个文档的每个元素都转换为Python对象并将文档转换为一些Python对象的集合。 如果你只需要这个文档的子集,全部转换确实非常慢。 但是你可以传递SoupStrainer作为parseOnlyThese参数的值给 soup的构造器。Beautiful Soup检查每一个元素是否满足SoupStrainer条件, 只有那些满足条件的元素会转换为Tag标签或NavigableText,并被添加到剖析树中。

If an element is added to to the tree, then so are its children—even if they wouldn't have matched the SoupStrainer on their own. This lets you parse only the chunks of a document that contain the data you want.
如果一个元素被加到剖析树中,那么的子元素即使不满足SoupStrainer也会被加入到树中。 这可以让你只剖析文档中那些你想要的数据块。

Here's a pretty varied document:
看看下面这个有意思的例子:

doc = '''Bob reports <a href="http://www.bob.com/">success</a>
with his plasma breeding <a
href="http://www.bob.com/plasma">experiments</a>. <i>Don't get any on
us, Bob!</i>

<br><br>Ever hear of annular fusion? The folks at <a
href="http://www.boogabooga.net/">BoogaBooga</a> sure seem obsessed
with it. Secret project, or <b>WEB MADNESS?</b> You decide!'''

Here are several different ways of parsing the document into soup, depending on which parts you want. All of these are faster and use less memory than parsing the whole document and then using the same SoupStrainer to pick out the parts you want.
有几种不同的方法可以根据你的需求来剖析部分文档.比起剖析全部文档,他们都更快并占用更少的内存,他们都是使用相同的 SoupStrainer来挑选文档中你想要的部分。

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>, 
#  <a href="http://www.boogabooga.net/">BoogaBooga</a>]

linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
[tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>]

mentionsOfBob = SoupStrainer(text=re.compile("Bob"))
[text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)]
# [u'Bob reports ', u"Don't get any on\nus, Bob!"]

allCaps = SoupStrainer(text=lambda(t):t.upper()==t)
[text for text in BeautifulSoup(doc, parseOnlyThese=allCaps)]
# [u'. ', u'\n', u'WEB MADNESS?']

There is one major difference between the SoupStrainer you pass into a search method and the one you pass into a soup constructor. Recall that the name argument can take a function whose argument is a Tag object. You can't do this for a SoupStrainer's name, because the SoupStrainer is used to decide whether or not a Tag object should be created in the first place. You can pass in a function for a SoupStrainer's name, but it can't take a Tag object: it can only take the tag name and a map of arguments.
SoupStrainer传递给搜索方法和soup构造器有一个很大的不同。 回忆一下,name参数可以使用以Tag对象为参数的函数。 但是你不能对SoupStrainername使用这招,因为SoupStrainer被用于决定 一个Tag对象是否可以在第一个地方被创建。 你可以传递一个函数给SoupStrainername,但是不能是使用Tag对象的函数: 只能使用tag的名字和一个参数映射。

shortWithNoAttrs = SoupStrainer(lambda name, attrs: \
                                len(name) == 1 and not attrs)
[tag for tag in BeautifulSoup(doc, parseOnlyThese=shortWithNoAttrs)]
# [<i>Don't get any on us, Bob!</i>, 
#  <b>WEB MADNESS?</b>]

使用extract改进内存使用

When Beautiful Soup parses a document, it loads into memory a large, densely connected data structure. If you just need a string from that data structure, you might think that you can grab the string and leave the rest of it to be garbage collected. Not so. That string is a NavigableString object. It's got a parent member that points to a Tag object, which points to other Tag objects, and so on. So long as you hold on to any part of the tree, you're keeping the whole thing in memory.
但Beautiful Soup剖析一个文档的时候,它会将整个文档以一个很大很密集的数据结构中载入内存。 如果你仅仅需要从这个数据结构中获得一个字符串, 你可能觉得为了这个字符串而弄了那么一堆要被当垃圾收集的数据会很不划算。 而且,那个字符串还是NavigableString对象。 也就是要获得一个指向Tag对象的parent的成员,而这个Tag又会指向其他的Tag对象,等等。 因此,你不得不保持一颗剖析树所有部分,也就是把整个东西放在内存里。

The extract method breaks those connections. If you call extract on the string you need, it gets disconnected from the rest of the parse tree. The rest of the tree can then go out of scope and be garbage collected, while you use the string for something else. If you just need a small part of the tree, you can call extract on its top-level Tag and let the rest of the tree get garbage collected.
extrace方法可以破坏这些链接。如果你调用extract来获得你需要字符串, 它将会从树的其他部分中链接中断开。 当你使用这个字符串做什么时,树的剩下部分可以离开作用域而被垃圾收集器捕获。 如果你即使需要一个树的一部分,你也可以讲extract使用在顶层的Tag上, 让其它部分被垃圾收集器收集。

This works the other way, too. If there's a big chunk of the document you don't need, you can call extract to rip it out of the tree, then abandon it to be garbage collected while retaining control of the (smaller) tree. 也可以使用extract实现些别的功能。如果文档中有一大块不是你需要,你也可以使用extract来将它弄出剖析树, 再把它丢给垃圾收集器同时对(较小的那个)剖析树的控制。

If you find yourself destroying big chunks of the tree, you might have been able to save time by not parsing that part of the tree in the first place.
如果你觉得你正在破坏树的大块头,你应该看看 通过剖析部分文档来提升效率来省省时间。