http/https header头部参数Connection分为短连接和长连接,对应属性为:close和keep-alive。http/https短连接是一次性读写完成后断开的连接,长连接则是在连接保持范围内可分多次传输数据。不管是长连接还是短连接都是包含读写时间限制的。http/https协议头部header的Connection属性的意义是提高互联网资源利用效率和服务质量。
国内出现一些片面推荐长连接的“流派”。长连接和短连接应该是有其不同的用途。比如说,最常见的下载文件,应该使用的是短连接;浏览多媒体,则用长连接更好。事实上,会经常发现一些网站几乎使用的都是keep-alive的长连接!
理论上来说,长连接和短连接本质是一样的。长连接更像是多个省略重复握手的短连接的组合。从整个http/https连接生命周期来说,短连接是一个短会话,而长连接是一个长会话。短连接在完成一次完整的对话后即时丢弃会话。长连接在会话周期内包含一个或多个完整的对话。
bash环境下用pip安装python requests库。(msdos的差不多)
~$: pip install requests
Session是requests库保持长连接的一个途径。使用urllib库里的request.Request包含添加“Connection: keep-alive”的header头部的时候,会发现最终得到的回复很可能是Connection: close的短连接。由此猜测,默认的http/https连接可能是短连接。当然,现在一些Server服务器软件均默认支持keep-alive属性。keep-alive的长连接需要维护一个交互的时效会话。
更新自定义HTTP工具包:httpkit.py。
# -*- coding: utf-8 -*-
"""
@file: httpkit
@author: MR.N
@created: 2022/4/2 4月
@updated: 2022/5/23 5月
@version: 1.0
@blog: https://blog.csdn.net/qq_21264377
"""
import time
import urllib.parse
import urllib.request
import requests
import urllib3
import http.cookiejar
import ssl
import socket
import gzip
from uas import *
import random
SOCKET_TIMEOUT = 30
HTTPS_TIMEOUT = 10
# ...(略)
def request_res(remote_task=None, ret=[], dtype=0, max_retry=3):
if remote_task is None:
ret += ['', '', '', -1]
return 'err'
if not isinstance(remote_task, RemoteTask):
ret += ['', '', '', -1]
return 'err'
url = remote_task.url
if not valid_https(url):
ret += ['', '', '', -1]
return 'err'
referer = remote_task.referer
cookies = remote_task.cookies
ua = remote_task.ua
if ua is None:
ua = unspecific_ua()
headers = {
'User-Agent': ua,
# 'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Site': 'none',
'Upgrade-Insecure-Requests': '1',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection': 'keep-alive',
}
if referer is None:
headers['Referer'] = url
else:
headers['Referer'] = referer
if cookies is not None:
headers['Cookie'] = cookies
headers['host'] = url.split('/')[2]
# print(headers)
socket.setdefaulttimeout(SOCKET_TIMEOUT)
ssl._create_default_https_context = ssl._create_unverified_context
session = requests.Session()
attempts = 0
status_code = -1
response = None
while attempts < max_retry and status_code != 200:
attempts += 1
try:
response = session.request(method='GET', url=url, headers=headers, timeout=HTTPS_TIMEOUT)
status_code = response.status_code
except TimeoutError:
status_code = 404
except requests.exceptions.Timeout:
status_code = 404
except requests.exceptions.ReadTimeout:
status_code = 404
finally:
if status_code != 200:
time.sleep(.11)
if response is not None and response.status_code == 200:
data = response.content
content_encoding = response.headers.get('Content-Encoding')
if content_encoding is not None and content_encoding.strip() != '' and content_encoding.lower() in ['gzip',
'deflate']:
# data = gzip.decompress(data)
pass
content_type = response.headers.get('Content-Type')
if content_type is not None and 'charset=' in content_type:
encoding = content_type.split(';')[-1].split('=')[-1]
else:
encoding = response.apparent_encoding
if encoding is not None and encoding.strip() != '':
# print(encoding)
data = data.decode(encoding=encoding)
else:
data = data.decode('UTF-8')
# print(content_encoding, len(data))
cookies = response.cookies
cookie_res = ''
for cookie in cookies:
cookie_res += cookie.name + '=' + cookie.value + ';'
ret += [data, url, cookie_res, status_code]
if session is not None:
session.close()
if response is not None:
response.close()
return 'success'
else:
if session is not None:
session.close()
if response is not None:
response.close()
ret += ['', '', '', status_code]
return 'failure'
http/https header属性Connection除了提高互联网资源重复使用效益,bz还发现一些网站用在其他的用途。例如,流量统计和安全审计。从流量统计的角度,更多的是在Server服务器日志方面。目前比较广泛的两种流量统计是在前端入手为主后端为辅和后端记录。前端为主的流量记录在记录用户行为特征丰富度和肖像逼真程度方面具有更好的优势。安全审计方面,可以识别爬虫和鉴别某种流量攻击行为。根据网站Server服务器设置、预设情景和连接请求比较,可以快速简单筛选出异常信息,从而分析不合规或非法行为。