8. 网络爬虫实战

优质

小牛编辑

153浏览

2023-12-01

案例：爬取百度新闻首页的新闻标题信息
url地址：http://news.baidu.com/
具体实现步骤：
- 导入urlib库和re正则
- 使用urllib.request.Request()创建request请求对象
- 使用urllib.request.urlopen执行信息爬取,并返回Response对象
- 使用read()读取信息，使用decode()执行解码
- 使用re正则解析结果
- 遍历输出结果信息

具体代码如下：

import urllib.request
import re

url = "http://news.baidu.com/"
#伪装浏览器用户
headers = {'User-Agent':'User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'}
req = urllib.request.Request(url,headers=headers)

#执行请求获取响应信息
res = urllib.request.urlopen(req)

# 从响应对象中读取信息并解码
html = res.read().decode("utf-8")

#print(len(html))
#使用正则解析出新闻标题信息
pat = '<a href="(.*?)" .*? target="_blank">(.*?)</a>'
dlist = re.findall(pat,html)

# 遍历输出结果
for v in dlist:
    print(v[1]+":"+v[0])