当前位置: 首页 > 知识库问答 >
问题:

如何在Scrapy中运行蜘蛛几次,在“defstart_requests(自我)”中更改网址的一部分

敖涵容
2023-03-14

我对这只蜘蛛的逻辑有个疑问。我想爬网的Castbox网站有无限分页类别之一。因此,我认为我可以拆分JSON文件的URL,然后切片,最后重新连接URL以便能够解析它。因此,我使用while循环来确定爬行器继续爬行所需元素的条件。

让我解释清楚。

当我检查Castbox网站的JSON URL时,我发现每次通过向下滚动页面重新加载时,只有一部分URL会发生变化。这部分被称为“跳过”,它在0到200之间变化,你会在网址中看到它。所以,我想如果我能写一个“defstart_requests(自我)”,其中网址的“跳过”部分可以从0变成200,我就可以得到我想要的。这样的功能是否可以每次更改URL?如果是,我蜘蛛的“defstart_requests(自我)”部分有什么问题?

顺便说一下,在运行它时,我得到了以下错误:ModuleNotFoundError:没有名为“urlparse”的模块

这是我的蜘蛛:

-- coding: utf-8 --
import scrapy
import json

class ArtsPodcastsSpider(scrapy.Spider):
    name = 'arts_podcasts'
    allowed_domains = ['www.castbox.fm']
    

    def start_requests(self):
        
        try:
            if response.request.meta['skip']:
                skip=response.request.meta['skip']
            else:
                skip=0
                
            while skip < 201:
                url = 'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1'
                split_url = urlparse.urlsplit(url)
                path = split_url.path
                path.split('&')
                path.split('&')[:-5]
                '&'.join(path.split('&')[:-5])
                parsed_query = urlparse.parse_qs(split_url.query)
                query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
                query['skip'] = skip
                updated = split_url._replace(path='&'.join(base_path.split('&')[:-5]+['limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1', '']),
                    query=urllib.urlencode(query, doseq=True))
                updated_url=urlparse.urlunsplit(updated)
                
                
                yield scrapy.Request(url= updated_url, callback= self.parse_id, meta={'skip':skip})
    
                def parse_id(self, response):

                    skip=response.request.meta['skip']
                    data=json.loads(response.body)
                    category=data.get('data').get('category').get('name')
                    arts_podcasts=data.get('data').get('list')
                    for arts_podcast in arts_podcasts:
                        yield scrapy.Request(url='https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={0}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'.format(arts_podcast.get('list')[2].get('cid')), meta={'category':category,'skip':skip}, callback= self.parse)


                def parse(self, response):

                    skip=response.request.meta['skip']
                    category=response.request.meta['category']
                    arts_podcast=json.loads(response.body).get('data')
                    yield scrapy.Request(callback=self.start_requests,meta={'skip':skip+1})
                    yield{

                        'title':arts_podcast.get('title'),
                        'category':arts_podcast.get('category'),
                        'sub_category':arts_podcast.get('categories'),
                        'subscribers':arts_podcast.get('sub_count'),
                        'plays':arts_podcast.get('play_count'),
                        'comments':arts_podcast.get('comment_count'),
                        'episodes':arts_podcast.get('episode_count'),
                        'website':arts_podcast.get('website'),
                        'author':arts_podcast.get('author'),
                        'description':arts_podcast.get('description'),
                        'language':arts_podcast.get('language')
                        }

谢谢你!

---编辑---

这是我运行蜘蛛后得到的日志的一部分,帕特里克·克莱因说:

2020-11-14 15:51:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1> (referer: None)
2020-11-14 15:51:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1> (referer: None)
Traceback (most recent call last):
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\shima\projects\castbox_arts_podcasts\castbox_arts_podcasts\spiders\arts_podcasts.py", line 27, in parse_id
    url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={arts_podcast.get("list")[2].get("cid")}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'
TypeError: 'NoneType' object is not subscriptable

编辑2---

2020-11-15 13:14:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=2583691&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1> (referer: https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=8&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1)
2020-11-15 13:14:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=2946683&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1>
{'sub_category': None, 'title': None, 'subscribers': None, 'plays': None, 'comments': None, 'episodes': None, 'downloads': None, 'website': None, 'author': None, 'description': None, 'language': None}
2020-11-15 13:14:47 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2020-11-15 13:14:47 [scrapy.core.downloader.handlers.http11] WARNING: Got data loss in https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=12&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests
2020-11-15 13:14:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=12&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: Chunked decoder in 'CHUNK_LENGTH' state, still expecting more data to get to 'FINISHED' state.>]

JSON对象的一部分,用于一个需要刮取的项:

{
    "msg": "OK",
    "code": 0,
    "data": {
        "category": {
            "sub_categories": [
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10022",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Books"
                },
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10023",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Design"
                },
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10024",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Fashion & Beauty"
                },
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10025",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Food"
                },
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10026",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Performing Arts"
                },
                {
                    "image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "id": "10027",
                    "night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
                    "name": "Visual Arts"
                }
            ],
            "id": "10021",
            "name": "Arts"
        },
        "list": [
            {
                "provider_id": 125443881,
                "episode_count": 256,
                "x_play_base": 0,
                "stat_cover_ext_color": false,
                "keywords": [
                    "Arts",
                    "Literature",
                    "TV & Film",
                    "Society & Culture",
                    "freshair",
                    "npr",
                    "terrygross",
                    "news",
                    "facts",
                    "interesting",
                    "worldwide",
                    "international",
                    "best",
                    "awardwinning",
                    "jay z"
                ],
                "cover_ext_color": "-8610134",
                "mongo_id": "5e74365585a4e5dcff18d769",
                "show_id": "56a0a3399eb9a8dd9758c9c2",
                "copyright": "Copyright 2015-2019 NPR - For Personal Use Only",
                "author": "NPR",
                "is_key_channel": true,
                "audiobook_categories": [],
                "comment_count": 29,
                "website": "http://www.npr.org/programs/fresh-air/",
                "rss_url": "https://feeds.npr.org/381444908/podcast.xml",
                "description": "Fresh Air from WHYY, the Peabody Award-winning weekday magazine of contemporary arts and issues, is one of public radio's most popular programs. Hosted by Terry Gross, the show features intimate conversations with today's biggest luminaries.",
                "tags": [
                    "from-itunes"
                ],
                "editable": true,
                "play_count": 8890966,
                "link": "http://www.npr.org/programs/fresh-air/",
                "twitter_names": [
                    "nprfreshair"
                ],
                "categories": [
                    10021,
                    10022,
                    10125,
                    10001,
                    10101,
                    10014,
                    10015
                ],
                "x_subs_base": 25254,
                "small_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/200x200bb.jpg",
                "big_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/600x600bb.jpg",
                "language": "en",
                "cid": 2698788,
                "latest_eid": 326888897,
                "topic_tags": [
                    "FreshAir",
                    "NPR"
                ],
                "release_date": "2020-11-14T05:01:15Z",
                "title": "Fresh Air",
                "uri": "/ch/2698788",
                "https_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/400x400bb.jpg",
                "channel_type": "private",
                "channel_id": "47b5be27cc1ca68aa80f8f7bbccedb47a40992d3",
                "sub_count": 361101,
                "internal_product_id": "cb.ch.2698788",
                "social": {
                    "website": "http://www.npr.org/programs/fresh-air/",
                    "youtube": [
                        {
                            "name": "channel/UCwly5-E5e0EUY-SsnttN4Sg"
                        }
                    ],
                    "twitter": [
                        {
                            "name": "nprfreshair"
                        }
                    ],
                    "facebook": [
                        {
                            "name": "freshairwithterrygross"
                        }
                    ],
                    "instagram": [
                        {
                            "name": "nprfreshair"
                        }
                    ]
                }
            }

共有1个答案

席宜修
2023-03-14

我注意到您正在将categoryskip传递给您的解析函数,但不要在spider中真正使用它们。实际上有很多未使用的,可能不必要的进口。另外,您在parse\u id中使用的URL与start\u请求方法中使用的URL几乎相同。

我已经把你的蜘蛛改写成了某种东西,我觉得它有点像你想要实现的东西。

import scrapy
import json

class ArtsPodcastsSpider(scrapy.Spider):
    name = 'arts_podcasts'

    def start_requests(self):
        for skip in range(201):
            url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={skip}&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1'
            yield scrapy.Request(
                url=url, 
                callback=self.parse_id, 
            )

    def parse_id(self, response):
        data = json.loads(response.body)
        arts_podcasts = data.get('data').get('list')
        for arts_podcast in arts_podcasts:
            url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={arts_podcast["cid"]}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'
            yield scrapy.Request(
                url=url, 
                callback=self.parse
            )

    def parse(self, response):
        arts_podcasts=json.loads(response.body).get('data')
        for arts_podcast in arts_podcasts['list']:
            yield {
                'title':arts_podcast.get('title'),
                'category':arts_podcast.get('category'),
                'sub_category':arts_podcast.get('categories'),
                'subscribers':arts_podcast.get('sub_count'),
                'plays':arts_podcast.get('play_count'),
                'comments':arts_podcast.get('comment_count'),
                'episodes':arts_podcast.get('episode_count'),
                'website':arts_podcast.get('website'),
                'author':arts_podcast.get('author'),
                'description':arts_podcast.get('description'),
                'language':arts_podcast.get('language')
            }
 类似资料:
  • 问题内容: 有没有一种方法可以在不使用Scrapy守护程序的情况下运行Scrapy项目中的所有蜘蛛程序?曾经有一种使用来运行多个Spider的方法,但是该语法已删除,Scrapy的代码也进行了很多更改。 我尝试创建自己的命令: 但是,一旦在上注册了蜘蛛,我就会得到所有其他蜘蛛的断言错误: 有什么办法吗?我不想开始子类化核心Scrapy组件,而只是为了运行所有我的蜘蛛。 问题答案: 这是一个不在自定

  • 问题内容: 我在一个要刮擦多个站点(可能是数百个站点)的项目中使用了scrapy,并且我必须为每个站点编写特定的蜘蛛。我可以使用以下命令在部署要抓取的项目中安排 一只 蜘蛛: 但是,如何一次计划一个项目中的 所有 蜘蛛呢? 所有帮助非常感谢! 问题答案: 我一次运行200个以上Spider的解决方案是为该项目创建一个自定义命令。有关实现自定义命令的更多信息,请参见http://doc.scrapy

  • 问题内容: 我需要创建一个用户可配置的网络爬虫/爬虫,并且我正在考虑使用Scrapy。但是,我无法对域和允许的URL regex:es进行硬编码- 而是可以在GUI中对其进行配置。 如何(尽可能简单)使用Scrapy创建一个蜘蛛或一组蜘蛛,其中域和允许的URL regex:es是可动态配置的?例如,我将配置写入文件,然后蜘蛛程序以某种方式读取它。 问题答案: 警告:此答案适用于Scrapy v0.

  • 我无法更改分析方法中的爬行器设置。但这肯定是一种方式。 例如: 但是项目将由FirstPipeline处理。新项目参数不工作。开始爬网后如何更改设置?提前谢谢!

  • 我写一个Scrapy蜘蛛,每天一次爬行一组网址。然而,其中一些网站非常大,所以我不能每天抓取整个网站,也不想产生大量的流量。 一个老问题(这里)问了一些类似的问题。但是,向上投票的响应只是指向一个代码段(此处),它似乎需要请求实例的某些内容,尽管响应中以及包含代码段的页面上都没有对此进行解释。 我试图理解这一点,但发现中间件有点混乱。无论是否使用链接的中间件,一个完整的刮板机示例都非常有用,该刮板

  • 问题内容: 有没有一种方法可以在Spider类终止之前触发它? 我可以自己终止蜘蛛,如下所示: 但是我找不到任何有关如何确定蜘蛛何时自然退出的信息。 问题答案: 看来您可以通过来注册信号监听器。 我会尝试类似的东西: 在较新版本的scrapy中已弃用。相反,您可以使用from 。