如何创建自定义的Scrapy项导出器？

花品

2023-03-14

问题内容：

我正在尝试基于JsonLinesItemExporter创建一个自定义的Scrapy Item Exporter，以便我可以稍微更改它产生的结构。

我已经在http://doc.scrapy.org/en/latest/topics/exporters.html上阅读了文档，但未说明如何创建自定义导出器，在何处存储或如何将其链接到管道。
。

我已经确定了如何对Feed导出程序进行自定义，但这不符合我的要求，因为我想从管道中调用此导出程序。

这是我想出的代码，它存储在项目根目录下的文件中 exporters.py

from scrapy.contrib.exporter import JsonLinesItemExporter

class FanItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write("""{
            'product': [""")

    def finish_exporting(self):
        self.file.write("]}")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict))

我只是尝试使用FanItemExporter从我的管道中调用此方法，并尝试导入的变体，但不会产生任何结果。

问题答案：

确实，Scrapy文档没有明确说明放置项目导出器的位置。要使用项目导出器，请按照以下步骤操作。

选择一个Item Exporter类并将其导入到pipeline.py项目目录中。它可以是预定义的Item Exporter（例如XmlItemExporter），也可以是用户定义的（如FanItemExporter问题中定义的）
在中创建Item Pipeline类pipeline.py。在此类中实例化导入的Item Exporter。详细信息将在答案的后面部分进行解释。
现在，在settings.py文件中注册该管道类。

以下是每个步骤的详细说明。该问题的解决方案包含在每个步骤中。

第1步

如果使用预定义的Item Exporter类，则从scrapy.exporters模块导入它。
例如： from scrapy.exporters import XmlItemExporter
如果需要自定义导出器，请在文件中定义一个自定义类。我建议将类放在exporters.py文件中。放置在项目文件夹（这个文件settings.py，items.py驻留）。

创建新的子类时，导入始终是一个好主意BaseItemExporter。如果我们打算完全更改功能，那将是适当的。但是，在这个问题上，大多数功能都接近JsonLinesItemExporter。

因此，我将附加同一ItemExporter的两个版本。一个版本扩展了BaseItemExporter类，另一个版本扩展了JsonLinesItemExporter类

版本1 ：扩展BaseItemExporter

既然BaseItemExporter是父类，start_exporting()，finish_exporting()，export_item()必须overrided，以满足我们的需要。

from scrapy.exporters import BaseItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes

class FanItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(b',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(to_bytes(self.encoder.encode(itemdict)))

第2版 ：扩展JsonLinesItemExporter

JsonLinesItemExporter提供与export_item()方法完全相同的实现。因此，仅start_exporting()和finish_exporting()方法被覆盖。

JsonLinesItemExporter在文件夹中可以看到执行python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site- packages\scrapy\exporters.py

from scrapy.exporters import JsonItemExporter

class FanItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object using JsonItemExporter's constructor
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

注意：将数据写入文件时，请务必注意，标准的Item
Exporter类需要二进制文件。因此，必须以二进制模式（b）打开文件。由于相同的原因，write()两个版本中的方法都将写入bytes文件。

第2步

创建一个Item Pipeline类。

from project_name.exporters import FanItemExporter

class FanExportPipeline(object):
    def __init__(self, file_name):
        # Storing output filename
        self.file_name = file_name
        # Creating a file handle and setting it to None
        self.file_handle = None

    @classmethod
    def from_crawler(cls, crawler):
        # getting the value of FILE_NAME field from settings.py
        output_file_name = crawler.settings.get('FILE_NAME')

        # cls() calls FanExportPipeline's constructor
        # Returning a FanExportPipeline object
        return cls(output_file_name)

    def open_spider(self, spider):
        print('Custom export opened')

        # Opening file in binary-write mode
        file = open(self.file_name, 'wb')
        self.file_handle = file

        # Creating a FanItemExporter object and initiating export
        self.exporter = FanItemExporter(file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        print('Custom Exporter closed')

        # Ending the export to file from FanItemExport object
        self.exporter.finish_exporting()

        # Closing the opened output file
        self.file_handle.close()

    def process_item(self, item, spider):
        # passing the item to FanItemExporter object for expoting to file
        self.exporter.export_item(item)
        return item

第三步

由于定义了“项目导出管道”，因此将该管道注册到settings.py文件中。还将字段添加FILE_NAME到settings.py文件。该字段包含输出文件的文件名。

将以下行添加到settings.py文件。

FILE_NAME = 'path/outputfile.ext'
ITEM_PIPELINES = {
    'project_name.pipelines.FanExportPipeline' : 600,
}

如果ITEM_PIPELINES已经取消注释，则将以下行添加到ITEM_PIPELINES字典中。

'project_name.pipelines.FanExportPipeline' : 600,

这是创建自定义项目导出管道的一种方法。

如何创建自定义的Scrapy项导出器？

第1步

第2步

第三步

相关阅读

相关文章

相关问答

相关工具

相关文档