我正在尝试使用请求模块下载PDF文件,代码如下:
import requests
url = ""
r = requests.get(url, stream=True, timeout=(60, 120), headers={'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136'})
print(r.headers)
print(r.status_code)
try:
with open('blah.pdf', 'wb') as f:
for chunk in r:
# print(chunk)
f.write(chunk)
except Exception as e:
print(e)
输出如下:
{'Cache-Control': 'private', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/pdf', 'Server': 'Microsoft-IIS/7.5', 'X-AspNet-Version': '4.0.30319', 'X-Powered-By': 'ASP.NET', 'Date': 'Wed, 02 Oct 2019 05:17:11 GMT', 'Set-Cookie': 'bbb=rd102o00000000000000000000ffff978433aao80; path=/; Httponly; Secure'}
200
('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
这是完整的堆栈跟踪:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 755, in read_chunked
chunk = self._handle_chunk(amt)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 709, in _handle_chunk
self._fp._safe_read(2) # Toss the CRLF at the end of the chunk.
File "/storage/anaconda3/lib/python3.7/http/client.py", line 612, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 560, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/storage/anaconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 12, in
for chunk in r:
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
当我在网络浏览器(例如Google Chrome)上打开pdf时,chrome的内置pdf插件可以正确加载它,并且可以在浏览器中阅读。 但是,如果我尝试通过单击下载图标来下载它,则会出现Failed - Network Error Firefox无法加载/下载它。 (Firefox和Chrome均已升级到最新版本)当我在Windows计算机上对其进行测试时,Microsoft edge能够下载pdf,但是...
我尝试了一些命令行工具,例如curl,wget,aria2c(已将适当的标头设置为浏览器请求)都无法下载pdf。
wget输出:
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘blah.pdf’
[ <=> ] 101.68K 66.1KB/s in 1.5s
2019-10-02 11:29:50 (69.1 KB/s) - Read error at byte 108786 (Success).
使用wget下载的文件已损坏。
我尝试过的另一件事是使用mitm和chromedriver + Selenium组合对其进行检查。
自动Chrome浏览器无法加载pdf并显示错误:
502 Bad Gateway
HttpSyntaxException('Malformed chunked body',)
如何使用requests模块下载此pdf文件? 任何帮助将不胜感激。