我正在从国家美术馆的在线目录中检索信息。由于目录的结构,我无法通过提取和跟踪条目之间的链接来导航。幸运的是,集合中的每个对象都有一个可预测的url。我希望我的爬行器通过生成开始URL来导航集合。
我试图通过实现这个线程中的解决方案来解决我的问题。不幸的是,这似乎打破了我蜘蛛的另一部分。错误日志显示我的网址正在成功生成,但它们没有被正确处理。如果我正确地解释了日志——我怀疑我没有——在重新定义允许我生成我需要的网址的start_urls和蜘蛛的规则部分之间存在冲突。就目前情况而言,蜘蛛也不尊重我要求它爬行的页数。
您将在下面找到我的蜘蛛和一个典型错误。我感谢你能提供的任何帮助。
蜘蛛网:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 10
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % starting_number]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def __init__(self):
self.page_number = starting_number
def start_requests(self):
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
return CatalogRecord.load_item()
典型错误:
2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None)
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51, in _requests_to_follow
for n, rule in enumerate(self._rules):
AttributeError: 'NGASpider' object has no attribute '_rules'
根据eLRuLL的解决方案进行更新
只需删除def\uuu init\uu
和start\u URL
即可让我的爬行器对生成的URL进行爬网。但是,它似乎也会阻止应用“def parse_CatalogRecord(self,response)”。当我现在运行spider时,它只从生成的URL范围之外刮取页面。下面是我修改过的spider和log输出。
蜘蛛网:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def start_requests(self):
self.page_number = starting_number
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
return CatalogRecord.load_item()
日志:
2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-05-02 15:50:02 [scrapy] INFO: Spider opened
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None>
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html>
{'accession': u'1942.9.163.b',
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'],
'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa',
'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg',
'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}],
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished)
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 631,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26324,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 3,
'file_count': 1,
'file_status_count/uptodate': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)}
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished)
如果不打算调用super
,请不要重写\uuuuu init\uuuu
方法。
现在,如果要使用start\u请求
,您不需要声明start\u URL
,您的爬行器就可以工作。
只需删除您的def__init__
方法,不需要start_urls
存在。
使现代化
好吧,我的错误,看起来像CrawlSpider
需要start_urls
属性,所以只需创建它,而不是使用start_requests
方法:
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
并删除start\u请求
我有一个工作应用程序,但我想改进一点。方法(“private fun save”),负责保存我需要的信息,我希望使其异步。 但问题是,当我把它改成——“私有挂起fun save”的时候,我要做suspend和override fun拦截方法。但是由于它被覆盖,我得到一个错误: 冲突重载:public open suspend fun intercept(链:Interceptor.chain):c
简介 Laravel 提供了几个辅助函数来为应用程序生成 URL。主要用于在模板和 API 响应中构建 URL 或者在应用程序的其它部分生成重定向响应。 基础 生成基础 URL 辅助函数 url 可以用于应用的任何一个 URL。生成的 URL 将自动使用当前请求中的方案( HTTP 或 HTTPS )和主机: $post = App\Post::find(1); echo url("/posts
URL 生成 我们推荐使用助手函数 url 进行 url 的生成: url('portal/List/index',['id'=>1,'name'=>'cmf5']); url('portal/List/index','id=1&name=cmf5'); 生成美化的 URL 这个功能要在后台URL 美化里增加相应的 url美化规则,用法和 url方法类似 cmf_url('portal/List
ThinkCMF遵循ThinkPHP的url生成方法: U方法的定义规则如下(方括号内参数根据实际应用决定): U('地址表达式',['参数'],['伪静态后缀'],['是否显示域名']) U('Blog/Index/index') // 生成Blog应用Index控制器的index操作的URL地址 U('Portal/Article/index?id=1') // 生成Portal应用Artic
URL 生成 我们推荐使用助手函数 url 进行 url 的生成: url('portal/List/index',['id'=>1,'name'=>'cmf']); url('portal/List/index','id=1&name=cmf'); 生成美化的 URL 这个功能要在后台URL 美化里增加相应的 url美化规则,用法和 url方法类似 但请写全第一个参数并注意大小写,格式:应用
ThinkPHP支持路由URL地址的统一生成,并且支持所有的路由方式,以及完美解决了路由地址的反转解析,无需再为路由定义和变化而改变URL生成。 如果你开启了路由延迟解析,需要生成路由映射缓存才能支持全部的路由地址的反转解析。 URL生成使用 \think\facade\Url::build() 方法或者使用系统提供的助手函数url(),参数一致: Url::build('地址表达式',['参数'