Python httplib2获取网页数据(基本用法)

章哲彦

2023-12-01

安装httplib2

命令 pip install httplib2

C:\Users\yulei10>pip install httplib2
Collecting httplib2
  Downloading httplib2-0.10.3.tar.gz (204kB)
    45% |██████████████▍                 | 92kB 22kB/s eta 0:00:0
    50% |████████████████                | 102kB 24kB/s eta 0:00
    55% |█████████████████▋              | 112kB 23kB/s eta 0:
    60% |███████████████████▎            | 122kB 23kB/s eta
    65% |████████████████████▉           | 133kB 22kB/s eta
    70% |██████████████████████▍         | 143kB 24kB/s e
    75% |████████████████████████        | 153kB 43kB/s
    80% |█████████████████████████▋      | 163kB 36kB/
    85% |███████████████████████████▎    | 174kB 42k
    90% |████████████████████████████▉   | 184kB 32
    95% |██████████████████████████████▌ | 194kB
    100% |████████████████████████████████| 204k
B 34kB/s
Installing collected packages: httplib2
  Running setup.py install for httplib2 ... done
Successfully installed httplib2-0.10.3

获取网页内容

import httplib2
hObj = httplib2.Http('test_result')  #获取的内容写入test_result目录下
response, content = hObj.request('http://www.w3.org/2005/Atom')

hObj为Http对象，在test_result目录下生成文件：

www.w3.org,2005,Atom,26412cbca625df198e21ace51847c6bb

用记事本打开www.w3.org,2005,Atom,26412cbca625df198e21ace51847c6bb文件，查看内容：

status: 200
date: Sun, 18 Mar 2018 07:07:31 GMT
content-location: Atom.html
vary: negotiate,Accept-Encoding,upgrade-insecure-requests
tcn: choice
last-modified: Sat, 13 Oct 2007 02:19:32 GMT
etag: "90a-43c56773a3500;4bc4eec134740-gzip"
cache-control: max-age=21600
expires: Sun, 18 Mar 2018 13:07:31 GMT
p3p: policyref="http://www.w3.org/2014/08/p3p.xml"
content-length: 2314
content-type: text/html; charset=utf-8
accept-ranges: bytes
age: 4941
via: 1.1 wsg-cn-3.hikvision.com
-content-encoding: gzip
-varied-accept-encoding: gzip, deflate

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />

  <title>Atom Syndication Format namespace</title>
  <link rel="stylesheet" type="text/css" href=
  "http://www.w3.org/StyleSheets/TR/base" />
</head>

<body>
  <div class="head">
    <a href="../"><img src="http://www.w3.org/Icons/w3c_home" alt=
    "W3C" /></a> <a href="http://www.ietf.org/"><img src=
    "ietf_logo.png" alt="IETF" /></a>
.........
.........
.........
略

httplib2的主要接口是Http对象。创建Http对象时, 应始终传递目录名。
目录不是必须填写的;如果需要, httplib2将创建它。

一旦我们有了一个Http对象, 检索数据就像用所需数据的地址调用request()方法一样简单。这将对该url发出http GET请求。
request()方法返回两个值:
1.第一个是httplib2的Response对象, 其中包含服务器返回的所有http报头。例如, 200的状态代码指示请求已成功。
2.content变量包含http服务器返回的实际数据。数据以字节对象而不是字符串形式返回。如果我们想要它作为字符串, 我们需要确定字符编码和转换它自己。

httplib2 缓存

要使用缓存, 我们应该始终创建一个具有目录名的httplib2的Http对象。

import httplib2
httplib2.debuglevel = 1
hObj = httplib2.Http('test_result')
response, content = hObj.request('http://www.w3.org/2005/Atom')
print("\nresponse.status is : ", response.status)
print("content len is : ", len(content))
print("is from cache : ", response.fromcache)

运行结果如下：

send: b'GET /2005/Atom HTTP/1.1\r\nHost: www.w3.org\r\nuser-agent: Python-httplib2/0.10.3 (gzip)\r\naccept-encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: Content-Location header: Vary header: TCN header: Last-Modified header: ETag header: Content-Encoding header: Cache-Control header: Expires
header: P3P header: Content-Length header: Content-Type header: Accept-Ranges header: Age header: Via
response.status is :  200
content len is :  2314
is from cache :  True

通过httplib2.debuglevel = 1打开调试，httplib2将发送到服务器的所有数据、服务器回复的一些信息打印出来。
我们创建的一个httplib2的Http对象具有和以前相同目录名。然后, 请求与以前相同的url 。更确切地说, 没有任何东西被发送到服务器, 也没有任何东西从服务器返回，绝对没有任何网络活动。
但是, 我们确实收到了一些数据–事实上, 我们收到了所有信息。我们还收到一个http状态代码200, 指示请求已成功。
实际上, 此响应是从httplib2的本地缓存生成的。
在创建 httplib2 时传入的目录名。Http object(对象)的目录保存httplib2对它所执行的所有操作的缓存。

我们以前请求此url中的数据，该请求已成功 (状态: 200)。
该响应不仅包含提要数据, 还包括一组缓存标头, 这些消息标题告诉任何正在侦听的人, 他们可以将此资源缓存最多6小时 (缓存控制: max-age=21600, 该值是6小时 (以秒为单位的)。
httplib2了解并遵照这些缓存头, 并将以前的响应存储在test_result目录中 (我们在创建Http对象时传入的路径)。该缓存尚未过期, 因此, 当我们第二次请求此url上的数据时, httplib2只返回缓存结果而不命中网络。

httplib2默认自动处理 http 缓存

现在, 假设我们已经缓存了数据, 但我们希望绕过缓存并从远程服务器重新请求它。
如果用户特意请求, 浏览器有时会这样做。
例如, 按F5刷新当前页, 但按Ctrl+F5可绕过缓存, 并从远程服务器重新请求当前页。我们可能会认为哦, 我只会从本地缓存中删除数据, 然后再次请求。
我们可以这样做, 但请记住, 可能会有更多的参与方, 而不仅仅是我们和远程服务器。那些中间的代理服务器呢？它们完全超出了我们的控制范围, 他们可能仍然有缓存的数据, 并将愉快地返回给我们, 因为 (就他们而言) 他们的缓存仍然有效。
我们应该使用http的功能来确保我们的请求实际到达远程服务器, 而不是通过手动操作我们的本地缓存。

通过request函数加参数即可做到

response2, content2 = h.request('http://www.w3.org/2005/Atom', headers={'cache-control':'no-cache'})

httplib2允许我们向任何传出请求添加任意http标头。为了绕过所有缓存 (不仅是本地磁盘缓存, 还有我们和远程服务器之间的任何缓存代理), 请在标题字典中添加无缓存标头（上面headers={'cache-control':'no-cache'}）。
httplib2注意到我们添加了无缓存标头, 因此它完全绕过了它的本地缓存, 然后就会打开网络请求数据。

原文地址：

http://www.bogotobogo.com/python/python_http_web_services.php

Python httplib2获取网页数据(基本用法)

安装httplib2

获取网页内容

httplib2 缓存

httplib2默认自动处理 http 缓存

相关阅读

相关文章

相关问答

相关文档