修改剖析树

优质
小牛编辑
125浏览
2023-12-01

现在你已经知道如何在剖析树中寻找东西了。但也许你想对它做些修改并输出出来。 你可以仅仅将一个元素从其父母的contents中分离,但是文档的其他部分仍然 拥有对这个元素的引用。Beautiful Soup 提供了几种方法帮助你修改剖析树并保持其内部的一致性。

修改属性值

你可以使用字典赋值来修改Tag对象的属性值。

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<b id="2">Argh!</b>")
print soup
# <b id="2">Argh!</b>
b = soup.b

b['id'] = 10
print soup
# <b id="10">Argh!</b>

b['id'] = "ten"
print soup
# <b id="ten">Argh!</b>

b['id'] = 'one "million"'
print soup
# <b id='one "million"'>Argh!</b>

你也可以删除一个属性值,然后添加一个新的属性:

del(b['id'])
print soup
# <b>Argh!</b>

b['class'] = "extra bold and brassy!"
print soup
# <b class="extra bold and brassy!">Argh!</b>

删除元素

要是你引用了一个元素,你可以使用extract将它从树中抽离。 下面是将所有的注释从文档中移除的代码:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

这段代码是从文档中移除一个子树:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<a1></a1><a><b>Amazing content<c><d></a><a2></a2>")
soup.a1.nextSibling
# <a><b>Amazing content<c><d></d></c></b></a>
soup.a2.previousSibling
# <a><b>Amazing content<c><d></d></c></b></a>

subtree = soup.a
subtree.extract()

print soup
# <a1></a1><a2></a2>
soup.a1.nextSibling
# <a2></a2>
soup.a2.previousSibling
# <a1></a1>

extract方法将一个剖析树分离为两个不连贯的树。naviation的成员也因此变得看起来好像这两个树 从来不是一起的。

soup.a1.nextSibling
# <a2></a2>
soup.a2.previousSibling
# <a1></a1>
subtree.previousSibling == None
# True
subtree.parent == None
# True

使用一个元素替换另一个元素

replaceWith方法抽出一个页面元素并将其替换为一个不同的元素。 新元素可以为一个Tag(它可能包含一个剖析树)或者NavigableString。 如果你传一个字符串到replaceWith, 它会变为NavigableString。 这个Navigation成员会完全融入到这个剖析树中,就像它本来就存在一样。

下面是一个简单的例子:

The new element can be a Tag (possibly with a whole parse tree beneath it) or a NavigableString. If you pass a plain old string into replaceWith, it gets turned into a NavigableString. The navigation members are changed as though the document had been parsed that way in the first place.

Here's a simple example:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<b>Argh!</b>")
soup.find(text="Argh!").replaceWith("Hooray!")
print soup
# <b>Hooray!</b>

newText = soup.find(text="Hooray!")
newText.previous
# <b>Hooray!</b>
newText.previous.next
# u'Hooray!'
newText.parent
# <b>Hooray!</b>
soup.b.contents
# [u'Hooray!']

这里有一个更复杂点的,相互替换标签(tag)的例子:

from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>

You can even rip out an element from one part of the document and stick it in another part:
你也可以将一个元素抽出然后插入到文档的其他地方:

from BeautifulSoup import BeautifulSoup
text = "<html>There's <b>no</b> business like <b>show</b> business</html>"
soup = BeautifulSoup(text)

no, show = soup.findAll('b')
show.replaceWith(no)
print soup
# <html>There's  business like <b>no</b> business</html>

添加一个全新的元素

The Tag class and the parser classes support a method called insert. It works just like a Python list's insert method: it takes an index to the tag's contents member, and sticks a new element in that slot.
标签类和剖析类有一个insert方法,它就像Python列表的insert方法: 它使用索引来定位标签的contents成员,然后在那个位置插入一个新的元素。

This was demonstrated in the previous section, when we replaced a tag in the document with a brand new tag. You can use insert to build up an entire parse tree from scratch:
在前面那个小结中,我们在文档替换一个新的标签时有用到这个方法。你可以使用insert来重新构建整个剖析树:

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
soup = BeautifulSoup()
tag1 = Tag(soup, "mytag")
tag2 = Tag(soup, "myOtherTag")
tag3 = Tag(soup, "myThirdTag")
soup.insert(0, tag1)
tag1.insert(0, tag2)
tag1.insert(1, tag3)
print soup
# <mytag><myOtherTag></myOtherTag><myThirdTag></myThirdTag></mytag>

text = NavigableString("Hello!")
tag3.insert(0, text)
print soup
# <mytag><myOtherTag></myOtherTag><myThirdTag>Hello!</myThirdTag></mytag>

An element can occur in only one place in one parse tree. If you give insert an element that's already connected to a soup object, it gets disconnected (with extract) before it gets connected elsewhere. In this example, I try to insert my NavigableString into a second part of the soup, but it doesn't get inserted again. It gets moved:
一个元素可能只在剖析树中出现一次。如果你给insert的元素已经和soup对象所关联, 它会被取消关联(使用extract)在它在被连接别的地方之前。在这个例子中,我试着插入我的NavigableString到 soup对象的第二部分,但是它并没有被再次插入而是被移动了:

tag2.insert(0, text)
print soup
# <mytag><myOtherTag>Hello!</myOtherTag><myThirdTag></myThirdTag></mytag>

This happens even if the element previously belonged to a completely different soup object. An element can only have one parent, one nextSibling, et cetera, so it can only be in one place at a time.
即使这个元素属于一个完全不同的soup对象,还是会这样。 一个元素只可以有一个parent,一个nextSibling等等,也就是说一个地方只能出现一次。