剖析树

优质

小牛编辑

136浏览

2023-12-01

到目前为止，我们只是载入文档，然后再输出它。现在看看更让我们感兴趣的剖析树： Beautiful Soup剖析一个文档后生成的数据结构。

剖析对象 (BeautifulSoup或 BeautifulStoneSoup的实例)是深层嵌套(deeply-nested), 精心构思的(well-connected)的数据结构，可以与XML和HTML结构相互协调。剖析对象包括2个其他类型的对象，Tag对象，用于操纵像<TITLE> ，<B>这样的标签；NavigableString对象，用于操纵字符串，如"Page title"和"This is paragraph"。

NavigableString的一些子类 (CData, Comment, Declaration, and ProcessingInstruction), 也处理特殊XML结构。它们就像NavigableString一样, 除了但他们被输出时，他们会被添加一些额外的数据。下面是一个包含有注释(comment)的文档：

from BeautifulSoup import BeautifulSoup
import re
hello = "Hello! <!--I've got to be nice to get what I want.-->"
commentSoup = BeautifulSoup(hello)
comment = commentSoup.find(text=re.compile("nice"))

comment.__class__
# <class 'BeautifulSoup.Comment'>
comment
# u"I've got to be nice to get what I want."
comment.previousSibling
# u'Hello! '

str(comment)
# "<!--I've got to be nice to get what I want.-->"
print commentSoup
# Hello! <!--I've got to be nice to get what I want.-->

现在，我们深入研究一下我们开头使用的那个文档:

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

`Tag`的属性

Tag和NavigableString对象有很多有用的成员，在 Navigating剖析树和 Searching剖析树中我们会更详细的介绍。现在，我们先看看这里使用的Tag成员：属性

SGML标签有属性：.例如，在上面那个HTML 中每个<P>标签都有"id"属性和"align"属性。你可以将Tag看成字典来访问标签的属性：

firstPTag, secondPTag = soup.findAll('p')

firstPTag['id']
# u'firstPara'

secondPTag['id']
# u'secondPara'

NavigableString对象没有属性;只有Tag 对象有属性。

剖析树

Tag的属性

`Tag`的属性