当前位置: 首页 > 知识库问答 >
问题:

程序员 - scrapy 爬虫,始终获取不到数据,如何解决呢?

应瀚
2023-12-20

求助 scrapy 爬取数据失败,排查了好久都没有找到问题了,实在找不到了

目标:爬取欣欣旅游网的某一城市 各大景点的基本信息

这是我的 sipder 以及 item 代码

spider:

from scrapy import Requestfrom scrapy.spiders import Spiderfrom XXtourism.items import XxtourismItemclass TourismSpider(Spider):    name = "tourism"    # 初始请求    def start_requests(self):        url = "https://tianjin.cncn.com/jingdian/"        yield Request(url,dont_filter=True)    # 解析函数    def parse(self, response,*args,**kwargs):        spots_list = response.xpath('//div[@class="city_spots_list"]/ul/li')        for i in spots_list:            try:                #景点名称                name = i.xpath('./a/div[@class="title"]/b/text()').extract_first()                #景点简介                introduce = i.xpath('./div[@class="text_con"]/p/text()').extract_first()                item = XxtourismItem()                item["name"] = name                item["introduce"] = introduce                #生成详细页请求                url = i.xpath("./a/@href").extract_first()                yield Request(url,meta={"item":item},callback=self.pif_parse,dont_filter=True)            except:                pass    def pif_parse(self,response):        try:            address = response.xpath("//div[@class='type']/dl[1]/dd/text()").extract_first()            time = response.xpath("//div[@class='type']/dl[4]/dd/p/text()").extract_first()            ticket = response.xpath("//div[@class='type']/dl[5]/dd/p/text()").extract_first()            response.find_element_by_xpath("//div[@class='type']/dl[3]//dd/a/text()")            type = response.xpath("//div[@class='type']/dl[3]//dd/a/text()").extract_first()            if type:                type = type            else:                type = ' '            item = response.meta["item"]            item["address"] = address            item["time"] = time            item["ticket"] = ticket            item["type"] = type            yield item            # url = response.xpath("//div[@class='spots_info']/div[@class='type']/div[@class='introduce']/dd/a/@href").extract_first()            # yield Request(url,meta={"item":item},callback=self.fin_parse)        except:            type = ' '    # def fin_parse(self,response):    #     try:    #         traffic = response.xpath("//div[@class='type']/div[@class='top']/div[3]/text()").extract()    #    #         item = response.meta["item"]    #         item["traffic"] = traffic    #    #         yield item    #    #     except:    #         pass

item:

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass XxtourismItem(scrapy.Item):    # define the fields for your item here like:    # 景点名称    name = scrapy.Field()    # 景点地址    address = scrapy.Field()    # 景点简介    introduce = scrapy.Field()    # 景点类型    type = scrapy.Field()    # 开放时间    time = scrapy.Field()    # 门票概况    ticket = scrapy.Field()    # 交通概况    traffic = scrapy.Field()

这是执行日志:

PS D:\Python\XXtourism\XXtourism> scrapy crawl tourism -o tourism.csv2023-12-20 18:16:56 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: XXtourism)2023-12-20 18:16:56 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.7, Platform Windows-10-10.0.19045-SP02023-12-20 18:16:56 [scrapy.crawler] INFO: Overridden settings:{'BOT_NAME': 'XXtourism', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'NEWSPIDER_MODULE': 'XXtourism.spiders', 'SPIDER_MODULES': ['XXtourism.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '               '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}2023-12-20 18:16:56 [py.warnings] WARNING: D:\Ana\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.  return cls(crawler)2023-12-20 18:16:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor2023-12-20 18:16:56 [scrapy.extensions.telnet] INFO: Telnet Password: c388126d14d4b80a2023-12-20 18:16:56 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats']2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled item pipelines:[]2023-12-20 18:16:57 [scrapy.core.engine] INFO: Spider opened2023-12-20 18:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2023-12-20 18:16:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:60232023-12-20 18:16:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/> (referer: None)2023-12-20 18:17:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> from <GET http://Tianjin.cncn.com/jingdian/tianjindaxue/>2023-12-20 18:17:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> from <GET http://Tianjin.cncn.com/jingdian/shuishanggongyuan/>2023-12-20 18:17:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> from <GET http://Tianjin.cncn.com/jingdian/nankaidaxue/>2023-12-20 18:17:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> from <GET http://Tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/>2023-12-20 18:17:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shijizhong/> from <GET http://Tianjin.cncn.com/jingdian/shijizhong/>2023-12-20 18:17:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/jingyuan/> from <GET http://Tianjin.cncn.com/jingdian/jingyuan/>2023-12-20 18:17:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> from <GET http://Tianjin.cncn.com/jingdian/dagukoupaotai/>2023-12-20 18:17:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> from <GET http://Tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/>2023-12-20 18:17:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> from <GET http://Tianjin.cncn.com/jingdian/huoyuanjiaguju/>2023-12-20 18:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/>2023-12-20 18:17:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> from <GET http://Tianjin.cncn.com/jingdian/xikaijiaotang/>2023-12-20 18:17:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> from <GET http://Tianjin.cncn.com/jingdian/tianjinziranbowuguan/>2023-12-20 18:17:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> from <GET http://Tianjin.cncn.com/jingdian/dongwuyuan/>2023-12-20 18:17:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhuanlegu/>2023-12-20 18:17:57 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)2023-12-20 18:17:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/cifangzi/> from <GET http://Tianjin.cncn.com/jingdian/cifangzi/>2023-12-20 18:18:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> from <GET http://Tianjin.cncn.com/jingdian/haiheyishifengqingqu/>2023-12-20 18:18:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> (referer: None)2023-12-20 18:18:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> (referer: None)2023-12-20 18:18:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> (referer: None)2023-12-20 18:18:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> (referer: None)2023-12-20 18:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shijizhong/> (referer: None)2023-12-20 18:18:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/jingyuan/> (referer: None)2023-12-20 18:18:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> (referer: None)2023-12-20 18:18:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> (referer: None)2023-12-20 18:18:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> (referer: None)2023-12-20 18:18:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> (referer: None)2023-12-20 18:18:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> (referer: None)2023-12-20 18:18:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> (referer: None)2023-12-20 18:18:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> (referer: None)2023-12-20 18:18:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> (referer: None)2023-12-20 18:18:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/cifangzi/> (referer: None)2023-12-20 18:18:57 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 15 pages/min), scraped 0 items (at 0 items/min)2023-12-20 18:19:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> (referer: None)2023-12-20 18:19:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinguwenhuajie/>2023-12-20 18:19:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/wudadao/> from <GET http://Tianjin.cncn.com/jingdian/wudadao/>2023-12-20 18:19:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> from <GET http://Tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/>2023-12-20 18:19:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> (referer: None)2023-12-20 18:19:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/wudadao/> (referer: None)2023-12-20 18:19:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> (referer: None)2023-12-20 18:19:20 [scrapy.core.engine] INFO: Closing spider (finished)2023-12-20 18:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 12810, 'downloader/request_count': 39, 'downloader/request_method_count/GET': 39, 'downloader/response_bytes': 151805, 'downloader/response_count': 39, 'downloader/response_status_count/200': 20, 'downloader/response_status_count/301': 19, 'elapsed_time_seconds': 142.80337, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2023, 12, 20, 10, 19, 20, 220207), 'httpcompression/response_bytes': 458357, 'httpcompression/response_count': 20, 'log_count/DEBUG': 40, 'log_count/INFO': 12, 'log_count/WARNING': 1, 'request_depth_max': 1, 'response_received_count': 20, 'scheduler/dequeued': 39, 'scheduler/dequeued/memory': 39, 'scheduler/enqueued': 39, 'scheduler/enqueued/memory': 39, 'start_time': datetime.datetime(2023, 12, 20, 10, 16, 57, 416837)}2023-12-20 18:19:20 [scrapy.core.engine] INFO: Spider closed (finished)

跟着老师讲的一步一步来的,自己多爬取了几个信息(打开对应的详细网页进行爬取)
始终获取不到任何信息,301重定向错误也试了很多方法,但都没有解决 救救我吧 大佬们

共有1个答案

范安歌
2023-12-20

scrapy跑代码有些麻烦,用下面的吧。

import requests as rfrom lxml.etree import HTMLdef main():    resp = r.get('https://tianjin.cncn.com/jingdian/')    content = resp.content.decode('gb2312')    #    html = HTML(content)    nodes = html.xpath('//div[@class="city_spots_list"]/ul/li')    for n in nodes:        title = n.xpath('./a/div[@class="title"]//b//text()')        print(title)    x = 3    passif __name__ == '__main__':    main()

['天津之眼摩天轮']['五大道']['天津古文化街']['海河意式风情区']['瓷房子']['天津欢乐谷']['动物园']['天津自然博物馆']['西开教堂']['海昌极地海洋世界']['霍元甲故居']['天津航母主题公园']['大沽口炮台']['静园']['世纪钟']['塘沽滨海世纪广场']['南开大学']['水上公园']['天津大学']
 类似资料:
  • 安装MySQL-python [root@centos7vm ~]# pip install MySQL-python 执行如下不报错说明安装成功: [root@centos7vm ~]# python Python 2.7.5 (default, Nov 20 2015, 02:00:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2 T

  • 本文向大家介绍python爬虫爬取网页数据并解析数据,包括了python爬虫爬取网页数据并解析数据的使用技巧和注意事项,需要的朋友参考一下 1.网络爬虫的基本概念 网络爬虫(又称网络蜘蛛,机器人),就是模拟客户端发送网络请求,接收请求响应,一种按照一定的规则,自动地抓取互联网信息的程序。 只要浏览器能够做的事情,原则上,爬虫都能够做到。 2.网络爬虫的功能 网络爬虫可以代替手工做很多事情,比如可以

  • 主要内容:Scrapy下载安装,创建Scrapy爬虫项目,Scrapy爬虫工作流程,settings配置文件Scrapy 是一个基于 Twisted 实现的异步处理爬虫框架,该框架使用纯 Python 语言编写。Scrapy 框架应用广泛,常用于数据采集、网络监测,以及自动化测试等。 提示:Twisted 是一个基于事件驱动的网络引擎框架,同样采用 Python 实现。 Scrapy下载安装 Scrapy 支持常见的主流平台,比如 Linux、Mac、Windows 等,因此你可以很方便的安装它

  • 本文向大家介绍Python爬虫爬取、解析数据操作示例,包括了Python爬虫爬取、解析数据操作示例的使用技巧和注意事项,需要的朋友参考一下 本文实例讲述了Python爬虫爬取、解析数据操作。分享给大家供大家参考,具体如下: 爬虫 当当网 http://search.dangdang.com/?key=python&act=input&page_index=1 获取书籍相关信息 面向对象思想 利用不

  • 本文向大家介绍Python Scrapy多页数据爬取实现过程解析,包括了Python Scrapy多页数据爬取实现过程解析的使用技巧和注意事项,需要的朋友参考一下 1.先指定通用模板 url = 'https://www.qiushibaike.com/text/page/%d/'#通用的url模板 pageNum = 1 2.对parse方法递归处理 parse第一次调用表示的是用来解析第一页对