在500个请求被刮擦后添加延迟

汤英豪

2023-03-14

问题内容：

我有一个开始2000网址的列表，并且正在使用：

DOWNLOAD_DELAY = 0.25

为了控制请求的速度，但是我还想在n个请求之后添加更大的延迟。例如，我希望每个请求延迟0.25秒，每500个请求延迟100秒。

编辑：

样例代码：

import os
from os.path import join
import scrapy
import time

date = time.strftime("%d/%m/%Y").replace('/','_')

list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',                 
                 'http://runrun.es/':'runrunes',
                 'http://www.noticierodigital.com/':'noticiero_digital',
                 'http://www.eluniversal.com/':'el_universal',
                 'http://www.el-nacional.com/':'el_nacional',
                 'http://globovision.com/':'globovision',
                 'http://www.talcualdigital.com/':'talcualdigital',
                 'http://www.maduradas.com/':'maduradas',
                 'http://laiguana.tv/':'laiguana',
                 'http://www.aporrea.org/':'aporrea'}

root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)

class TestSpider(scrapy.Spider):
    name = "news_spider"
    download_delay = 1

    start_urls = list_of_pages.keys()

    def parse(self, response):
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        filename =   list_of_pages[response.url]
        print time.time()
        with open(join(output_dir,filename), 'wb') as f:
            f.write(response.body)

在这种情况下，列表较短，但想法是相同的。我想将延迟级别设置为每个请求一个，每个“ N”个请求一个。我不抓取链接，只是保存主页。

问题答案：

您可以考虑使用AutoThrottle扩展，该扩展没有严格控制延迟，而是拥有自己的算法，可以根据响应时间和并发请求数即时调整它，从而降低Spider的速度。

如果您需要对抓取过程中某些阶段的延迟进行更多控制，则可能需要
自定义中间件

或自定义扩展（类似于AutoThrottle-
source）。

您还可以即时更改.download_delay蜘蛛的属性。顺便说一句，这正是AutoThrottle扩展在后台进行的工作-
它可以动态更新.download_delay值。

一些相关主题：

每个请求延迟
可为每个请求配置请求延迟

在500个请求被刮擦后添加延迟

相关阅读

相关文章

相关问答

相关工具

相关文档