当前位置: 首页 > 文档资料 > Python 全栈 >

9. 网络爬虫中的异常处理

优质
小牛编辑
143浏览
2023-12-01
  • 在网络爬虫运行时出现异常,若不处理则会因报错而终止运行,导致爬取数据中断,所以异常处理还是十分重要的。

  • urllib.error可以接收有urllib.request产生的异常。urllib.error有两个类,URLErrorHTTPError

  • URLError内有一个属性:reason 返回错误的原因

# 测试URLError的异常处理
from urllib import request
from urllib import error

url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.URLError as e:
    print(e.reason) #输出错误信息

print("ok")
  • 报错信息,但程序继续执行:
[Errno 8] nodename nor servname provided, or not known
ok
  • HTTPError内有三个属性:code 返回HTTP状态码,如404 ; reason 返回错误原因; headers 返回请求头
# 测试HTTPError的异常处理
from urllib import request
from urllib import error

url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.HTTPError as e:
    print(e.reason) #输出错误信息
    print(e.code)    #输出HTTP状态码

print("ok")
  • 报的错误
Not Found
404
ok
  • URLErrorOSError的一个子类,HTTPErrorURLError的一个子类:

  • 注意:父类一定要在后面:

from urllib import request
from urllib import error

#url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except error.HTTPError as e:
    print("HTTPError")
    print(e.reason)
    print(e.code)
except error.URLError as e:
    print("URLError")
    print(e.reason) # 输出错误信息

print("ok")
  • 不什么错误都去处理:
from urllib import request
from urllib import error

url = "https://img-ads.csdn.net/2018/20180420184005werqwefsd9410.png"
#url = "http://www.wer3214e13wer3.com/"
req = request.Request(url)
try:
    response = request.urlopen(req)
    html = response.read().decode('utf-8')
    print(len(html))
except Exception as e:
    if hasattr(e,'reason'):
        print(e.reason)

    if hasattr(e,'code'):
        print(e.code)

print("ok")
Not Found
404
ok
zhang