9. 网络爬虫中的异常处理

优质

小牛编辑

150浏览

2023-12-01

在网络爬虫运行时出现异常，若不处理则会因报错而终止运行，导致爬取数据中断，所以异常处理还是十分重要的。
urllib.error可以接收有urllib.request产生的异常。urllib.error有两个类，URLError和HTTPError。
URLError内有一个属性：reason 返回错误的原因

# 测试URLError的异常处理
from urllib import request
from urllib import error

url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.URLError as e:
    print(e.reason) #输出错误信息

print("ok")

报错信息，但程序继续执行：

[Errno 8] nodename nor servname provided, or not known
ok

HTTPError内有三个属性：code 返回HTTP状态码，如404 ； reason 返回错误原因； headers 返回请求头

# 测试HTTPError的异常处理
from urllib import request
from urllib import error

url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.HTTPError as e:
    print(e.reason) #输出错误信息
    print(e.code)    #输出HTTP状态码

print("ok")

报的错误

Not Found
404
ok

URLError是OSError的一个子类，HTTPError是URLError的一个子类:
注意：父类一定要在后面：

from urllib import request
from urllib import error

#url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.HTTPError as e:
    print("HTTPError")
    print(e.reason)
    print(e.code)
except error.URLError as e:
    print("URLError")
    print(e.reason) # 输出错误信息

print("ok")

不什么错误都去处理：

from urllib import request
from urllib import error

url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
#url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except Exception as e:
    if hasattr(e,'reason'):
        print(e.reason)

    if hasattr(e,'code'):
        print(e.code)

print("ok")

Not Found
404
ok
zhang