使用httplib的IncompleteRead

闽涵蓄

2023-03-14

问题内容：

我一直遇到从特定网站获取RSS提要的持续问题。我结束了编写一个相当丑陋的过程来执行此功能的工作，但是我很好奇为什么会发生这种情况，以及是否有任何更高级别的接口能够正确处理此问题。这个问题并不是真正的问题，因为我不需要经常检索提要。

我已经阅读了一个捕获异常并返回部分内容的解决方案，但是由于不完整的读取在实际获取的字节数方面有所不同，因此我不确定这种解决方案是否会真正起作用。

#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead

url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'

content = feedparser.parse(url)
if 'bozo_exception' in content:
    print content['bozo_exception']
else:
    print "Success!!"
    sys.exit(0)

print "If you see this, please tell me what happened."

# try using mechanize
b = Browser()
r = b.open(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using mechanize", e

# try using urllib2
r = urllib2.urlopen(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using urllib2", e


# try using requests
try:
    r = requests.request('GET', url)
except IncompleteRead, e:
    print "IncompleteRead using requests", e

# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to 
# learn what's happening.  Please help me put this function into
# eternal rest.
def get_rss_feed(url):
    response = urllib2.urlopen(url)
    read_it = True
    content = ''
    while read_it:
        try:
            content += response.read(1)
        except IncompleteRead:
            read_it = False
    return content, response.info()


content, info = get_rss_feed(url)

feed = feedparser.parse(content)

如前所述，这不是一个关键任务问题，而是一个好奇心，即使我可以期望urllib2出现此问题，但我也很惊讶在机械化和请求中也遇到此错误。feedparser模块甚至不会引发错误，因此检查错误取决于’bozo_exception’键的存在。

编辑：我只想提到wget和curl都可以完美地执行该功能，每次都正确地检索完整的有效负载。除了难看的骇客之外，我还没有找到一种可以工作的纯python方法，而且我很想知道httplib后端发生了什么。百思不得其解，前几天我决定也用斜纹布尝试此操作，并收到相同的httplib错误。

PS：有一件事也令我感到非常奇怪。IncompleteRead始终在有效负载中的两个断点之一处发生。似乎feedparser和请求在读取926个字节后失败，但是机械化和urllib2在读取1854个字节后失败。这种行为是偶然的，我没有任何解释或理解。

问题答案：

在一天结束时，所有其它模块的（feedparser，mechanize，和urllib2）调用httplib其是异常被抛出的位置。

现在，首先，我还使用wget下载了此文件，结果文件为1854字节。接下来，我尝试了urllib2：

>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
 'Content-Type: text/xml; charset=utf-8\r\n',
 'Server: Microsoft-IIS/7.5\r\n',
 'X-AspNet-Version: 4.0.30319\r\n',
 'X-Powered-By: ASP.NET\r\n',
 'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
 'Via: 1.1 BC1-ACLD\r\n',
 'Transfer-Encoding: chunked\r\n',
 'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)

因此，它正在读取所有1854个字节，但随后认为还会有更多。如果我们明确告诉它仅读取1854个字节，则它的工作原理是：

>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

显然，这只有在我们总是提前知道确切长度的情况下才有用。我们可以使用以下事实：部分读取作为异常的属性返回，以捕获全部内容：

>>> try:
...     contents = f.read()
... except httplib.IncompleteRead as e:
...     contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

这篇博客文章暗示这是服务器的故障，并描述了如何httplib.HTTPResponse.read()用try..except上面的块来猴子修补该方法来处理幕后的事情：

import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial

    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

我应用了补丁，然后feedparser工作：

>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
 'encoding': 'utf-8',
 'entries': ...
 'status': 200,
 'version': 'rss20'}

这不是做事的最好方法，但似乎可行。我对HTTP协议不够专业，无法确定服务器是在做错事情，还是httplib在处理边缘情况。

使用httplib的IncompleteRead

相关阅读

相关文章

相关问答

相关工具

相关文档