Python学习-------bs4解析（一）

柴宝

2023-12-01

BeautifulSoup中文文档：https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Parsing%20HTML

前言

我们已经可以用requests库来和网页做一些简单的交互工作，比如说get到url中的一些内容，但是我们可以看到，无论是r.text还是r.content，我们得到的都是一大堆内容，如下面的代码，可以看出来这居然是一行！是一行！！一行！！！（因为前后各有一个单引号[抠鼻][抠鼻][抠鼻]）不但看着多且乱，容易造成视觉疲劳，而且也不容易对这些内容进行更深一步的操作。

# url = " http://www.baidu.com"
# r = requests.get(url)
# r.content

# 这是获取网址"http://www.baidu.com"后content出来的内容
b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a>&nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

其实，Python大牛们已经想到了这一点（或者经历了这些痛苦吧），给我们构建了一个名为beautifulsoup（美味的汤？）的第三方库。beautifulsoup是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作，它可以大大节省你的编程时间。这个库现在的版本被放在了bs4中，我推荐用lxml解析器，功能更强大，速度更快，以下是我们需要用到的库，没安装的快去安装哦。

- pip install requests
- pip install bs4
- pip install lxml

一、使用beautifulsoup

我们使用前先导入库：from bs4 import BeautifulSoup
把get到的页面代码放入BeautifulSoup中，并加上解析器：soup_info = BeautifulSoup(r.content, 'lxml')，然后我们来输出一下soup_info看一看情况。

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head> <body link="#00
00cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value=" 百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://new
s.baidu.com" name="tj_trnews">新闻</a> <a class="mnav" href="http://www.hao123.com" nam
e="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a> <a class=
"mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a> <noscript> <a class="lb
" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.writ
e('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a class="bri" href="//www.baidu.com/mor
e/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div> <div id
="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a
 href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http: //www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.b
aidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> <
/div> </div> </div> </body> </html>

ennnn，看起来好一点了，至少开始和结尾没有单引号了，这就意味着它不再是一个简简单单的内容了！他被解析成了真正的页面代码！他变成了真正的HTML文档了！它成长了！！2333333....
路人：然后呢？去掉了两边的单引号又怎么了不起了[抠鼻][抠鼻][抠鼻][抠鼻][抠鼻][抠鼻][抠鼻]？
作者：（os：嗯？这个路人是怎么肥死）nonononono！你想错了，我们来看一看两者对象的类型吧：

[28]: type(r.content)
Out[28]: bytes

In [29]: type(soup_info)
Out[29]: bs4.BeautifulSoup

路人：哇！好厉害，类型变了耶！变成了没有见过的bs4.BeautifulSoup类型[抠鼻]抠鼻]抠鼻]抠鼻]抠鼻]抠鼻]抠鼻]抠鼻]抠鼻]。
作者：(os：这人怎么一直在抠鼻)我们知道每个类型的数据都有自己的方法，我们来看看bs4.BeautifulSoup的方法有啥吧。

二、bs4.BeautifulSoup的方法

我们知道（或者看过），HTML文档是用尖括号对<>来写的，称之为标签，全屏都被各种标签所占领，BeautifulSoup则是把整个文档解析成一个树形结构，每一对标签都是一个节点，每个节点都是Python对象,所有对象可以归纳为4种，每种类型会在下面介绍：

BeautifulSoup（文档）
Tag（标签）：最基本的信息组织单元，分别用<>和</>标明开头和结尾
NavigableString（内容）：标签内非属性字符串，<>...</>中字符串，格式：<tag>.string
Comment（注释）：b标签内字符串的注释部分，一种特殊的Comment类型

1、既然是树结构，那我们可以按树的结构输出，也叫格式化输出，这样好看一些。

soup_info.prettify()

In [46]: print(soup_info.prettify())
<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
       <form action="//www.baidu.com/s" class="fm" id="form" name="f">
        <input name="bdorz_come" type="hidden" value="1"/>
        <input name="ie" type="hidden" value="utf-8"/>
        <input name="f" type="hidden" value="8"/>
        <input name="rsv_bp" type="hidden" value="1"/>
        <input name="rsv_idx" type="hidden" value="1"/>
        <input name="tn" type="hidden" value="baidu"/>
        <span class="bg s_ipt_wr">
         <input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
        </span>
        <span class="bg s_btn_wr">
         <input class="bg s_btn" id="su" type="submit" value="百度一下"/>
        </span>
       </form>
      </div>
     </div>
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新闻
      </a>
      <a class="mnav" href="http://www.hao123.com" name="tj_trhao123">
       hao123
      </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地图
      </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       视频
      </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       贴吧
      </a>
      <noscript>
       <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
        登录
       </a>
      </noscript>
      <script>
       document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
      </script>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多产品
      </a>
     </div>
    </div>
   </div>
   <div id="ftCon">
    <div id="ftConw">
     <p id="lh">
      <a href="http://home.baidu.com">
       关于百度
      </a>
      <a href="http://ir.baidu.com">
       About Baidu
      </a>
     </p>
     <p id="cp">
      ©2017 Baidu
      <a href="http://www.baidu.com/duty/">
       使用百度前必读
      </a>
      <a class="cp-feedback" href="http://jianyi.baidu.com/">
       意见反馈
      </a>
      京ICP证030173号
      <img src="//www.baidu.com/img/gs.gif"/>
     </p>
    </div>
   </div>
  </div>
 </body>
</html>

2、我们可以通过标签和属性进行操作，属性标签可以看做是成员变量，可以直接yoghurt成员标识符“.”获取

查找第一个title标签及内容

In [43]: print(soup_info.title.prettify())
<title>
 百度一下，你就知道
</title>

查找第一个body标签及内容

In [48]: print(soup_info.body.prettify())
<body link="#0000cc">
 <div id="wrapper">
  <div id="head">
   <div class="head_wrapper">
    <div class="s_form">
     <div class="s_form_wrapper">
      <div id="lg">
       <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
      </div>
      <form action="//www.baidu.com/s" class="fm" id="form" name="f">
       <input name="bdorz_come" type="hidden" value="1"/>
       <input name="ie" type="hidden" value="utf-8"/>
       <input name="f" type="hidden" value="8"/>
       <input name="rsv_bp" type="hidden" value="1"/>
       <input name="rsv_idx" type="hidden" value="1"/>
       <input name="tn" type="hidden" value="baidu"/>
       <span class="bg s_ipt_wr">
        <input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
       </span>
       <span class="bg s_btn_wr">
        <input class="bg s_btn" id="su" type="submit" value="百度一下"/>
       </span>
      </form>
     </div>
    </div>
    <div id="u1">
     <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
      新闻
     </a>
     <a class="mnav" href="http://www.hao123.com" name="tj_trhao123">
      hao123
     </a>
     <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
      地图
     </a>
     <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
      视频
     </a>
     <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
      贴吧
     </a>
     <noscript>
      <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
       登录
      </a>
     </noscript>
     <script>
      document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
     </script>
     <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
      更多产品
     </a>
    </div>
   </div>
  </div>
  <div id="ftCon">
   <div id="ftConw">
    <p id="lh">
     <a href="http://home.baidu.com">
      关于百度
     </a>
     <a href="http://ir.baidu.com">
      About Baidu
     </a>
    </p>
    <p id="cp">
     ©2017 Baidu
     <a href="http://www.baidu.com/duty/">
      使用百度前必读
     </a>
     <a class="cp-feedback" href="http://jianyi.baidu.com/">
      意见反馈
     </a>
     京ICP证030173号
     <img src="//www.baidu.com/img/gs.gif"/>
    </p>
   </div>
  </div>
 </div>
</body>

查看第一个meta标签后面的参数，字典形式。<meta content="text/html;charset=utf-8" http-equiv="content-type"/>

In [67]: print(soup_info.meta.attrs)
{'http-equiv': 'content-type', 'content': 'text/html;charset=utf-8'}

获取meta的content属性

In [71]: print(soup_info.meta.attrs["content"])
text/html;charset=utf-8

string返回单个文本内容。如果一个标签里面没有标签了，那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容，string 的输出结果是 None。
返回第一个a标签的内容

In [75]: print(soup_info.a.string)
新闻

返回第一个meta标签的内容

In [72]: print(soup_info.meta.string)
None

找到所有a标签及内容

In [79]: for i in soup_info.find_all('a'):
    ...:     print(i)
    ...:
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a>
<a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a>
<a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;"> 更多产品</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
<a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a>

找到属性为value="1"的标签

In [81]: print(soup_info.find(value="1"))
<input name="bdorz_come" type="hidden" value="1"/>

找到所有属性为value="1"的标签

In [82]: print(soup_info.find_all(value="1"))
[<input name="bdorz_come" type="hidden" value="1"/>, <input name="rsv_bp" type="hidden" value="1"/>, <input name="rsv_idx" type="hidden" value="1"/>]

3、既然是树，那就有节点，我们也可以进行节点层次的操作

查看第一个title标签的父结点标签及内容

In [52]: print(soup_info.title.parent.prettify())
<head>
 <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
 <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
 <meta content="always" name="referrer"/>
 <link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
 <title>
  百度一下，你就知道
 </title>
</head>

查看第一个title标签的父结点标签名

In [54]: print(soup_info.title.parent.name)
head

查看第一个title标签的父结点的父结点（爷爷节点？）的标签名

In [55]: print(soup_info.title.parent.parent.name)
html

parents：返回某节点的所有父辈及以上辈的节点：
contents：获取标签下面的所有子节点，返回列表：

In [83]: print(soup_info.head.contents)
[<meta content="text/html;charset=utf-8" http-equiv="content-type"/>, <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>, <meta content="always" name="referrer"/>, <link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>, <title>百度一下，你就知道</title>]

children：获取所有子节点，返回列表生成器：

In [87]: for i in soup_info.head.children:
    ...:     print(i)
    ...:
    ...:
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道</title>

contents 和 children 都是只能获取仅包含tag的直接子节点，非直接子节点即使有tag也不会获取到。举个形象的例子，我的孩子叫小明，你的孩子也叫小明，咱们两个是兄弟，用contents 和 children 获取我的小明的时候，获取不到你的孩子。这个要和下面的descendants方法区别开。
get_text()方法：返回当前节点和子节点的文本内容。

In [96]: print(soup_info.find("div", id="u1").get_text())
 新闻 hao123 地图 视频 贴吧  登录  document.write('<a href="http://www.baidu.com/bdorz/
login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
更多产品

parents：返回某节点的所有父辈及以上辈的节点：

In [105]: for i in soup_info.a.parents:
     ...:     print(i.name)
     ...:
div
div
div
div
body
html
[document]

next_sibling ：获取该节点的下一个兄弟节点，结果通常是字符串或空白，因为空白或者换行也可以被视作一个节点。

In [106]: print(soup_info.a.next_sibling)
# 此处结果是个空白

In [111]: print(soup_info.a.next_sibling.next_sibling)
<a class="mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a>

previous_sibling ：获取该节点的上一个兄弟节点。
next_siblings、previous_siblings：迭代获取全部兄弟节点
next_element、previous_element：获取下一个/上一个节点。不只是针对于兄弟节点，而是在于所有节点，不分层次的前一个和后一个节点。
next_elements、previous_elements：迭代获取所有前/后节点。
descendants 可以对所有tag的子孙节点进行递归循环，这个是我认为最好玩的了，这个方法会有一层一层的寻找标签对，然后输出这个标签对里的所有内容，注意！！是一层一层的获取该层的所有，如果有兄弟节点，那就先获取一个，获取完毕后再获取另一个。我们来举个形象的例子，就像回音一样：下面的例子里“今天好吗”和“我吃饭了”就是兄弟节点
获取：你今天好吗，我吃饭了！
结果：今天好吗，天好吗，好吗，吗，我吃饭了，吃饭了，饭了，了

In [120]: for i in soup_info.form.descendants:
     ...:     print(i)
     ...:
     ...:

<input name="bdorz_come" type="hidden" value="1"/>

<input name="ie" type="hidden" value="utf-8"/>

<input name="f" type="hidden" value="8"/>

<input name="rsv_bp" type="hidden" value="1"/>

<input name="rsv_idx" type="hidden" value="1"/>

<input name="tn" type="hidden" value="baidu"/>
<span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span>
<input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
<span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下
"/></span>
<input class="bg s_btn" id="su" type="submit" value="百度一下"/>

4、搜索文档树，最常用的就是find_all（）函数

（1）find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉。返回一个列表

1）传字符串

最简单的过滤器是字符串，在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<a>标签

In [122]: print(soup_info.find_all("a"))
[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>, <a class="mnav
" href="http://www.hao123.com" name="tj_trhao123">hao123</a>, <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>, <a class="mnav" href="http://v.baidu.com"
name="tj_trvideo">视频</a>, <a class="mnav" href="http://tieba.baidu.com" name="tj_trti
eba">贴吧</a>, <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=m
n&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a>, <a cl
ass="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产
品</a>, <a href="http://home.baidu.com">关于百度</a>, <a href="http://ir.baidu.com">Abo
ut Baidu</a>, <a href="http://www.baidu.com/duty/">使用百度前必读</a>, <a class="cp-fee
dback" href="http://jianyi.baidu.com/">意见反馈</a>]

2）传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以i开头的标签

In [17]: for i in soup_info.find_all(re.compile("^i")):
    ...:     print(i.name)
    ...:
    ...:
img
input
input
input
input
input
input
input
input
img

3）传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和<title>标签

In [22]: print(soup_info.find_all(["a","title"]))
[<title>百度一下，你就知道</title>, <a class="mnav" href="http://news.baidu.com" name="
tj_trnews">新闻</a>, <a class="mnav" href="http://www.hao123.com" name="tj_trhao123">ha
o123</a>, <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>, <a clas
s="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>, <a class="mnav" href="ht
tp://tieba.baidu.com" name="tj_trtieba">贴吧</a>, <a class="lb" href="http://www.baidu.
com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a>, <a class="bri" href="//www.baidu.com/more/" name="tj_brii
con" style="display: block;">更多产品</a>, <a href="http://home.baidu.com">关于百度</a>
, <a href="http://ir.baidu.com">About Baidu</a>, <a href="http://www.baidu.com/duty/"> 使用百度前必读</a>, <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a>
]

4）传 True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点。以下是在form标签下查找True

In [25]: for i in soup_info.form.find_all(True):
    ...:     print(i.name)
    ...:
input
input
input
input
input
input
span
input
span
input

5）传函数

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False。下面方法校验了当前元素，如果包含 class 属性却不包含 id 属性，那么将返回查找到的列表。

In [26]: def has_class_but_no_id(tag):
    ...:     return tag.has_attr('class') and not tag.has_attr('id')

In [27]: print(soup_info.find_all(has_class_but_no_id))
[<div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <
div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <a
class="mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.ba
idu.com" name="tj_trvideo">视频</a> <a class="mnav" href="http://tieba.baidu.com" name=
"tj_trtieba">贴吧</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.g
if?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/log
in.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</scr
ipt> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div>, <div class="s_form"> <div class="s_form_wrapper"> <div
id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </di
v>, <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div>, <span class="bg s_ipt_wr"><input autocomplete="off"
autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span>, <span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下"/></sp
an>, <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>, <a class="
mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a>, <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>, <a class="mnav" href="http://v.baidu.c
om" name="tj_trvideo">视频</a>, <a class="mnav" href="http://tieba.baidu.com" name="tj_
trtieba">贴吧</a>, <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;t
pl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a>, <
a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更 多产品</a>, <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a>]

keyword参数

　　注意的是，如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含一个名字为 style 的参数,BeautifulSoup会搜索每个tag的”style”属性：

In [29]: print(soup_info.find_all(style="display: block;"))
[<a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a>]

attrs参数

　　有些特殊的tag属性在搜索不能使用,比如HTML5中的 " data-* " 自定义属性：

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression


# 此时我们就需要attrs参数来寻找了
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

text参数

　　通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，text 参数接受字符串、正则表达式、列表、True。

In [31]: print(soup_info.find_all(text=re.compile("百度")))
['百度一下，你就知道', '关于百度', '使用百度前必读']

limit参数

　　find_all() 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。

In [32]: print(soup_info.find_all(text=re.compile("百度"), limit=2))
['百度一下，你就知道', '关于百度']

recursive参数

　　调用tag的 find_all() 方法时，BeautifulSoup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数 recursive=False。

In [33]: print(soup_info.find_all("title", recursive=False))
[]

In [34]: print(soup_info.find_all("title"))
[<title>百度一下，你就知道</title>]

（2）find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回查找到的第一个结果。

（3）find_parents() 和 find_parent()

　　find_all() 和 find() 只搜索当前节点的所有子节点，孙子节点等。find_parents() 和 find_parent() 用来搜索当前节点的父辈节点，搜索方法与普通tag的搜索方法相同，搜索文档搜索文档包含的内容。

（4）find_next_siblings() 和 find_next_sibling()　　

　　这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代， find_next_siblings() 方法返回所有符合条件的后面的兄弟节点，find_next_sibling() 只返回符合条件的后面的第一个tag节点。

（5）find_previous_siblings() 和 find_previous_sibling()

　　这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代， find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点，find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。

（6）find_all_next() 和 find_next()

　　这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代， find_all_next() 方法返回所有符合条件的节点， find_next() 方法返回第一个符合条件的节点。

（7）find_all_previous() 和 find_previous()

　　这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代，find_all_previous() 方法返回所有符合条件的节点， find_previous()方法返回第一个符合条件的节点。

Python学习-------bs4解析（一）

BeautifulSoup中文文档：https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Parsing%20HTML

前言

一、使用beautifulsoup

二、bs4.BeautifulSoup的方法

1、既然是树结构，那我们可以按树的结构输出，也叫格式化输出，这样好看一些。

2、我们可以通过标签和属性进行操作，属性标签可以看做是成员变量，可以直接yoghurt成员标识符“.”获取

3、既然是树，那就有节点，我们也可以进行节点层次的操作

4、搜索文档树，最常用的就是find_all（）函数

（1）find_all( name , attrs , recursive , text , **kwargs )

name 参数

keyword参数

attrs参数

text参数

limit参数

recursive参数

（2）find( name , attrs , recursive , text , **kwargs )

（3）find_parents() 和 find_parent()

（4）find_next_siblings() 和 find_next_sibling()

（5）find_previous_siblings() 和 find_previous_sibling()

（6）find_all_next() 和 find_next()

（7）find_all_previous() 和 find_previous()

相关阅读

相关文章

相关问答

相关文档