Searching Within the Parse Tree

优质

小牛编辑

131浏览

2023-12-01

上面说明的方法findAll及find，都是从剖析树的某一点开始并一直往下。他们反复的遍历对象的contents直到最低点。

也就是说你不能在 NavigableString对象上使用这些方法，因为NavigableString没有contents：它们是剖析树的叶子。

[这段翻译的不太准确]但是向下搜索不是唯一的遍历剖析树的方法。在Navigating剖析树中，我们可以使用这些方法：parent, nextSibling等。他们都有2个相应的方法：一个类似findAll,一个类似find. 由于NavigableString对象也支持这些方法，你可以像Tag一样使用这些方法。

为什么这个很有用?因为有些时候，你不能使用findAll或find 从Tag或NavigableString获得你想要的。例如，下面的HTML文档：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('''<ul>
 <li>An unrelated list
</ul>

<h1>Heading</h1>
<p>This is <b>the list you want</b>:</p>
<ul><li>The data you want</ul>''')

有很多方法去定位到包含特定数据的<LI> 标签。最明显的方式如下：

soup('li', limit=2)[1]
# <li>The data you want</li>

显然，这样获得所需的<LI>标签并不稳定。如果，你只分析一次页面，这没什么影响。但是如果你需要在一段时间分析很多次这个页面，就需要考虑一下这种方法。 If the irrelevant list grows another <LI> tag, you'll get that tag instead of the one you want, and your script will break or give the wrong data.
因为如果列表发生变化，你可能就得不到你想要的结果。

soup('ul', limit=2)[1].li
# <li>The data you want</li>

That's is a little better, because it can survive changes to the irrelevant list. But if the document grows another irrelevant list at the top, you'll get the first <LI> tag of that list instead of the one you want. A more reliable way of referring to the ul tag you want would better reflect that tag's place in the structure of the document.
这有一点好处，因为那些不相干的列表的变更生效了。但是如果文档增长的不相干的列表在顶部，你会获得第一个<LI>标签而不是你想要的标签。一个更可靠的方式是去引用对应的ul标签，这样可以更好的处理文档的结构。

在HTML里面，你也许认为你想要的list是<H1>标签下的<UL>标签。问题是那个标签不是在<H1>下，它只是在它后面。获得<H1>标签很容易，但是获得 <UL>却没法使用first和fetch，因为这些方法只是搜索<H1>标签的contents。你需要使用next或nextSibling来获得<UL>标签。

s = soup.h1
while getattr(s, 'name', None) != 'ul':
    s = s.nextSibling
s.li
# <li>The data you want</li>

或者，你觉得这样也许会比较稳定：

s = soup.find(text='Heading')
while getattr(s, 'name', None) != 'ul':
    s = s.next
s.li
# <li>The data you want</li>

但是还有很多困难需要你去克服。这里会介绍一下非常有用的方法。你可以在你需要的使用它们写一些遍历成员的方法。它们以某种方式遍历树，并跟踪那些满足条件的Tag 和NavigableString对象。代替上面那个例子的第一的循环的代码，你可以这样写：

soup.h1.findNextSibling('ul').li
# <li>The data you want</li>

第二循环，你可以这样写：

soup.find(text='Heading').findNext('ul').li
# <li>The data you want</li>

这些循环代替调用findNextString和findNext。本节剩下的内容是这种类型所用方法的参考。同时，对于遍历总是有两种方法：一个是返回list的findAll，一个是返回单一量的find。

下面，我们再举一个例子来说明：

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

`findNextSiblings(name, attrs, text, limit, kwargs)` and `findNextSibling(name, attrs, text, kwargs)`

这两个方法以nextSibling的成员为依据，获得满足条件的Tag或NavigableText对象。以上面的文档为例:

paraText = soup.find(text='This is paragraph ')
paraText.findNextSiblings('b')
# [<b>one</b>]

paraText.findNextSibling(text = lambda(text): len(text) == 1)
# u'.'

`findPreviousSiblings(name, attrs, text, limit, kwargs)` and `findPreviousSibling(name, attrs, text, kwargs)`

这两个方法以previousSibling成员为依据，获得满足条件的Tag和 NavigableText对象。以上面的文档为例:

paraText = soup.find(text='.')
paraText.findPreviousSiblings('b')
# [<b>one</b>]

paraText.findPreviousSibling(text = True)
# u'This is paragraph '

`findAllNext(name, attrs, text, limit, kwargs)` and `findNext(name, attrs, text, kwargs)`

这两个方法以next的成员为依据，获得满足条件的Tag和NavigableText对象。以上面的文档为例:

pTag = soup.find('p')
pTag.findAllNext(text=True)
# [u'This is paragraph ', u'one', u'.', u'This is paragraph ', u'two', u'.']

pTag.findNext('p')
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

pTag.findNext('b')
# <b>one</b>

`findAllPrevious(name, attrs, text, limit, kwargs)` and `findPrevious(name, attrs, text, kwargs)`

这两方法以previous的成员依据，获得满足条件的Tag和NavigableText对象。以上面的文档为例:

lastPTag = soup('p')[-1]
lastPTag.findAllPrevious(text=True)
# [u'.', u'one', u'This is paragraph ', u'Page title']
# Note the reverse order!

lastPTag.findPrevious('p')
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>

lastPTag.findPrevious('b')
# <b>one</b>

`findParents(name, attrs, limit, kwargs)` and `findParent(name, attrs, kwargs)`

这两个方法以parent成员为依据，获得满足条件的Tag和NavigableText对象。他们没有text参数,因为这里的对象的parent不会有NavigableString。以上面的文档为例:

bTag = soup.find('b')

[tag.name for tag in bTag.findParents()]
# [u'p', u'body', u'html', '[document]']
# NOTE: "u'[document]'" means that that the parser object itself matched.

bTag.findParent('body').name
# u'body'

Searching Within the Parse Tree

findNextSiblings(name, attrs, text, limit, **kwargs) and findNextSibling(name, attrs, text, **kwargs)

findPreviousSiblings(name, attrs, text, limit, **kwargs) and findPreviousSibling(name, attrs, text, **kwargs)

findAllNext(name, attrs, text, limit, **kwargs) and findNext(name, attrs, text, **kwargs)

findAllPrevious(name, attrs, text, limit, **kwargs) and findPrevious(name, attrs, text, **kwargs)

findParents(name, attrs, limit, **kwargs) and findParent(name, attrs, **kwargs)

`findNextSiblings(name, attrs, text, limit, kwargs)` and `findNextSibling(name, attrs, text, kwargs)`

`findPreviousSiblings(name, attrs, text, limit, kwargs)` and `findPreviousSibling(name, attrs, text, kwargs)`

`findAllNext(name, attrs, text, limit, kwargs)` and `findNext(name, attrs, text, kwargs)`

`findAllPrevious(name, attrs, text, limit, kwargs)` and `findPrevious(name, attrs, text, kwargs)`

`findParents(name, attrs, limit, kwargs)` and `findParent(name, attrs, kwargs)`