9. 网络爬虫中的异常处理
优质
小牛编辑
143浏览
2023-12-01
在网络爬虫运行时出现异常,若不处理则会因报错而终止运行,导致爬取数据中断,所以异常处理还是十分重要的。
urllib.error
可以接收有urllib.request
产生的异常。urllib.error
有两个类,URLError
和HTTPError
。URLError
内有一个属性:reason
返回错误的原因
# 测试URLError的异常处理
from urllib import request
from urllib import error
url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
response = request.urlopen(req)
html = response.read().decode('utf-8')
print(len(html))
except error.URLError as e:
print(e.reason) #输出错误信息
print("ok")
- 报错信息,但程序继续执行:
[Errno 8] nodename nor servname provided, or not known
ok
HTTPError
内有三个属性:code
返回HTTP状态码,如404 ;reason
返回错误原因;headers
返回请求头
# 测试HTTPError的异常处理
from urllib import request
from urllib import error
url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
req = request.Request(url)
try:
response = request.urlopen(req)
html = response.read().decode('utf-8')
print(len(html))
except error.HTTPError as e:
print(e.reason) #输出错误信息
print(e.code) #输出HTTP状态码
print("ok")
- 报的错误
Not Found
404
ok
URLError
是OSError
的一个子类,HTTPError
是URLError
的一个子类:注意:父类一定要在后面:
from urllib import request
from urllib import error
#url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
response = request.urlopen(req)
html = response.read().decode('utf-8')
print(len(html))
except error.HTTPError as e:
print("HTTPError")
print(e.reason)
print(e.code)
except error.URLError as e:
print("URLError")
print(e.reason) # 输出错误信息
print("ok")
- 不什么错误都去处理:
from urllib import request
from urllib import error
url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
#url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
response = request.urlopen(req)
html = response.read().decode('utf-8')
print(len(html))
except Exception as e:
if hasattr(e,'reason'):
print(e.reason)
if hasattr(e,'code'):
print(e.code)
print("ok")
Not Found
404
ok
zhang