问题：

使用Scrapy下载图像时遇到问题

施喜

2023-03-14

当我尝试使用带有Scrapy的蜘蛛下载图像时，会出现以下错误。

File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py",
line 61, in _set_url
            raise ValueError('Missing scheme in request url: %s' % self._url)
        exceptions.ValueError: Missing scheme in request url: h

就我所能理解的而言，我好像在某个地方的url中少了一个“h”？但我一辈子也看不出在哪里。如果我不想下载图片，一切都正常。但是一旦我将适当的代码添加到下面的四个文件中，我就无法使任何东西正常工作。谁能帮我弄明白这个错误吗？

items.py

import scrapy

class ProductItem(scrapy.Item):
    model = scrapy.Field()
    shortdesc = scrapy.Field()
    desc = scrapy.Field()
    series = scrapy.Field()
    imageorig = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

设置.py

BOT_NAME = 'allenheath'

SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

IMAGES_STORE = 'c:/allenheath/images'

pipelines.py

class AllenheathPipeline(object):
    def process_item(self, item, spider):
        return item

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

products.py（我的蜘蛛）

import scrapy

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/ahproducts/ilive-80/",
        "http://www.allen-heath.com/ahproducts/ilive-112/"
    ]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]
            item['image_urls'] = 'http://www.allen-heath.com' + item['image_urls']
            yield item

如有任何帮助，将不胜感激。

共有1个答案

罗渝

2023-03-14

问题就在这里：

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url)

还有这里：

item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]

您正在提取该字段并获取第一个元素。这意味着一旦您在管道中对它进行迭代,实际上就是对URL中的字符进行迭代,该URL以http开头--当第一个字母试图被处理时,就解释您看到的错误消息：

Missing scheme in request url: h

从该行中删除[0]。在此过程中，获取图像的src，而不是整个元素：

item['image_urls'] = sel.css('#tab1 #productcontent img').xpath('./@src').extract()

之后，如果图像url是相对的，您还应该更新下一行，以便将其转换为绝对的：

import urlparse  # put this at the top of the script
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]

但如果src中的图像URL实际上是绝对的，则不需要最后这一部分，因此只需删除它即可。

类似资料：

使用scrapy下载图像时出现问题

我用python scrapy编写了一个脚本，从一个网站下载一些图片。当我运行我的脚本时，我可以在控制台中看到图像的链接（它们都是格式）。然而，当我打开下载完成时应该保存图像的文件夹时，我什么也没有看到。我犯错的地方？这是我的蜘蛛（我正在从Sublime文本编辑器运行）：这是我在中为要保存的图像定义的内容：为了让事情更清楚：我希望保存图像的文件夹名为，我已将其放在项目下的文件夹中。文件夹
有时会用Scrapy-Works下载图片时遇到麻烦

到目前为止，我的蜘蛛代码一直运行得很好，但现在当我尝试运行一批这些蜘蛛时，所有的东西都正常工作，只是有些蜘蛛，scrapy下载了图像，其余的什么都没有。除了start_urls之外，所有的蜘蛛都是相同的。感谢任何帮助！这是我的管道 settings.py： items.py： myspider.py：我真的很想知道为什么这只蜘蛛有时会抓取图像，而有时却不抓取图像。除了来自同一个allowed_
使用Python从url下载图像时出现问题

我试图下载一个图像从一个URL与Python使用请求和Shutil库。我的代码如下：这段代码适用于我尝试过的大多数其他图像URL（例如：https://tinyjpg.com/images/social/website.jpg)但是，对于代码中的图像url，创建了一个1kb的文件，其中有一个错误：“看起来我们不支持此文件格式。” 我也尝试过：可以使用Seleniumwire执行此操作-我使用的
使用node.js下载图像

问题内容：我正在尝试编写一个脚本来使用node.js下载图像。这是我到目前为止的内容：但是，我想使它更强大：有图书馆这样做并且做得更好吗？响应头是否有可能说谎（关于长度，关于内容类型）？我还应该关注其他状态代码吗？我应该麻烦重定向吗？我想我读过某个地方会不赞成使用编码。那我该怎么办？我怎样才能在Windows上使用它？还有其他方法可以使此脚本更好吗？原因：对于类似于imgur的功
使用URL下载图像

问题内容：我正在尝试使用我的应用程序中的URL和按钮下载图像。当我在手机上运行它时，我无法下载该图像。任何人都可以指出这个问题。我在这里先向您的帮助表示感谢：）这是我的代码。问题答案：您可以通过两种方式从url下载图像 1。您可以使用Glide库从url加载图像，看下面的代码，它可以轻松地为您提供帮助编译这个库而不是像这样加载图像 2。如果您不想使用第三方库，请尝试此创建一个异步
Eclipse Oxygo4.7下载cucumber插件时遇到的问题

使用Scrapy下载图像时遇到问题

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档