Python Reading HTML Pages

优质

小牛编辑

148浏览

2023-12-01

图书馆称为beautifulsoup。使用此库，我们可以搜索html标记的值，并获取特定数据，如页面标题和页面中的标题列表。

安装Beautifulsoup

使用Anaconda软件包管理器安装所需的软件包及其相关软件包。

conda install Beaustifulsoap

阅读HTML文件

在下面的示例中，我们请求将URL加载到python环境中。然后使用html parser参数读取整个html文件。接下来，我们打印html页面的前几行。

import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('https://www.xnip.cn/doc/qcciyeyy')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])

当我们执行上面的代码时，它会产生以下结果。

<!DOCTYPE html>
<!--[if IE 8]><html> <![endif]-->
<!--[if IE 9]><html> <![endif]-->
<!--[if gt IE 9]><!-->
<html>
 <!--<![endif]-->
 <head>
  <!-- Basic -->
  <meta charset="utf-8"/>
  <title>

提取标记值

我们可以使用以下代码从标记的第一个实例中提取标记值。

import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('https://www.xnip.cn/doc/qcciyeyy')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

当我们执行上面的代码时，它会产生以下结果。

<title>Python Overview</title>
Python Overview
None
Python is Interpreted

提取所有标签

我们可以使用以下代码从标记的所有实例中提取标记值。

import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('https://www.xnip.cn/doc/qcciyeyy')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('b'): print(x.string)

当我们执行上面的代码时，它会产生以下结果。

Python is Interpreted
Python is Interactive
Python is Object-Oriented
Python is a Beginner's Language
Easy-to-learn
Easy-to-read
Easy-to-maintain
A broad standard library
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable