Python3 urllib.request.urlopen()API使用

萧玮

2023-12-01

urllib库的基本使用—request模块

本文代码基本来源于Python3 网络爬虫开发实战。

urllib库包含如下四个基本模块:

request：最基本的HTTP请求模块，模拟请求的发送。
error：异常处理模块。
parse：工具模块。对URL提供拆分、解析、合并等功能。
robotparser：主要用来识别网站的robots.txt文件，该文件中设定了爬虫的权限，即服务器允许哪些爬虫可以爬取哪些网页。

这里记录了request模块一些基本API函数的使用。

请求网页-urllib.request.urlopen()

直接使用urllib.request.urlopen()发送网页请求

API规范:

urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)。

参数解释:

url:请求网址

data：请求时传送给指定url的数据，当给出该参数时，请求方式变为POST，未给出时为GET。在添加该参数时需要使用bytes方法将参数转化为字节流编码格式的内容，后面举例介绍。

timeout:设定超时时间。如果在设定时间内未获取到响应，则抛出异常。

cafile, capath分别为CA证书及其路径，cadefault, context不做介绍。

使用示例:
import urllib.request
response = urllib.request.urlopen('https://www.baidu.com')
print(type(response)) #打印获取到的响应对象的数据类型
print(response.read().decode('utf-8')) #打印获取到的网页HTML源码
使用urlopen函数后，服务器返回的对象存储在response中，打印response对象的数据类型，为http.client.HTTPResponse。

如果要在请求中添加数据，则可以使用data参数。

使用示例:
import urllib.request
import urllib.parse
dic = {
    'name': 'Tom'
}

data = bytes(urllib.parse.urlencode(dic), encoding='utf-8')

response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)
通过data参数传递的字典数据，需要先使用urllib.parse.urlencode()转换为字符串，然后通过bytes()方法转码为字节类型。

timeout:指定超时时间。以秒为单位。

response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)

在0.01秒内如果未接收到服务器的响应，便抛出异常。

可以看出，urlopen的参数太少，这也意味着，我们能够设置的请求头信息太少。

构造更为完整的请求：使用urllib.request.Request对象，该对象是对请求头的封装，通过使用Request对象，我们能够将请求头单独分离，以便设置，而不是像上一种方法一样，仅仅只是传递URL。

Request的构造方法:
class urllib.request.Request(url, data=None, headers={},
            origin_req_host=None, unverifiable=False, method=None)
使用示例:
from urllib import request, parse

url = 'https://www.httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'www.httpbin.org'
}
dict = {'name': 'Tom'}
data = bytes(urllib.parse.urlencode(dict), encoding='utf-8')
request = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
事先构造了一个Request对象，然后将其作为参数传递给urlopen()方法。

参考资料

Python3网络爬虫开发实战(崔庆才著）

Python3 urllib.request.urlopen()API使用

urllib库的基本使用—request模块

请求网页-urllib.request.urlopen()

参考资料

相关阅读

相关文章

相关问答

相关文档