Scrapy: Feed exports

百里雅珺

2023-12-01

Feed exports

New in version 0.10.

当执行抓取的时候，一个最经常需要的特性就是正确并经常的保存榨取的数据，也就是说生成有抓取来的数据的导出文件（通常叫做export feed ），来给其他系统使用。

scrapy提供了使用feed exports导出箱子的应用，允许你使用抓取的item 生成feeds，使用多种序列化格式和存储后端。

Serialization formats

为了序列化抓取的数据，feed exports 使用 Item exporters 。这些格式支持开箱。

JSON
JSON lines
CSV
XML

但是你也可以扩展支持的格式，通过FEED_EXPORTERS设置。

JSON

FEEDS中设置格式键的值:json
Exporter used: JsonItemExporter
See this warning if you’re using JSON with large feeds.

JSON lines

Value for the format key in the FEEDS setting: jsonlines# FEED中设置格式键值为 jsonlines
导出器使用: JsonLinesItemExporter

CSV

Value for the format key in the FEEDS setting: csv
Exporter used: CsvItemExporter
指定导出的列和他们的顺序使用 FEED_EXPORT_FIELDS. 其他的feed exporters 也可以使用这个选项，但是对CSV 是很重要的，因为不像许多其他导出格式，CSV使用一个复杂的 header

XML

Value for the format key in the FEEDS setting: xml
Exporter used: XmlItemExporter

Pickle

Value for the format key in the FEEDS setting: pickle
Exporter used: PickleItemExporter

Marshal

Value for the format key in the FEEDS setting: marshal
Exporter used: MarshalItemExporter

Storages

当使用feed 导出时，你可以使用一个或多个urls 来定义feed的保存位置（通过FEEDS设置）. feed exports 支持很多使用URL 方案定义的存储后端种类。

支持的存储后端。

Local filesystem
FTP
S3 (requires botocore)
Standard output

如需要的外部库不可用，一些存储后端就不能用。例如S3后端要依靠botocore 库才能使用。

Storage URI parameters

存储的url 可以包含在feed创建时替换的参数，

%(time)s - gets replaced by a timestamp when the feed is being created# 在feed创建时被时间戳替换。
%(name)s - gets replaced by the spider name# 被爬虫名替换。

其他命名的参数可以被爬虫属性中相同名字的替换，例如 %(site_id)s 可以被spider.site_id替换，在feed 被创建的那一刻/

一些声明的例子。

Store in FTP using one directory per spider:# 每个爬虫使用一个目录保存在FTP中。
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json

Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

Storage backends

Local filesystem

将feeds 保存在 filesystem中。

URI scheme: file# URL 格式
Example URI: file:///tmp/export.csv
Required external libraries: none

注意，对本地文件系统存储，如果你指定了绝对路径像/tmp/export.csv，你可以省略格式，这只适用于 Unix.

FTP

The feeds are stored in a FTP server.

URI scheme: ftp
Example URI: ftp://user:pass@ftp.example.com/path/to/export.csv
Required external libraries: none

FTP 支持两种简洁模式，active or passive. scrapy 默认使用passive 连接格式。为了使用active, 设置FEED_STPOAGE_FIP_ACTIVE 为True。 active or passive

S3

The feeds are stored on Amazon S3.

URI scheme: s3
Example URIs:
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv

Required external libraries: botocore

AWS凭证可以作为用户/密码在URL 里传递，或者可以通过下面的设置传递。

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

还可以使用以下设置为导出的feeds自定义 ACL。

FEED_STORAGE_S3_ACL

Standard output

feed 被写入scrapy 进程的标准输出。

URI scheme: stdout
Example URI: stdout:
Required external libraries: none

Settings

这些设置来配置 feed exports。

FEEDS (mandatory)
FEED_EXPORT_ENCODING
FEED_STORE_EMPTY
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT
FEED_STORAGES
FEED_STORAGE_FTP_ACTIVE
FEED_STORAGE_S3_ACL
FEED_EXPORTERS

FEEDS

New in version 2.1.

Default: {}

字典的每个值时一个feed URL (or pathlib.Path 对象)，每个值是一个嵌套的字典，包含指定feed 的配置参数。这设置是使用 feed 导出特性必须的。

See Storage backends for supported URI schemes.

For instance:

{
    'items.json': {
        'format': 'json',
        'encoding': 'utf8',
        'store_empty': False,
        'fields': None,
        'indent': 4,
    },
    '/home/user/documents/items.xml': {
        'format': 'xml',
        'fields': ['name', 'price'],
        'encoding': 'latin1',
        'indent': 8,
    },
    pathlib.Path('items.csv'): {
        'format': 'csv',
        'fields': ['price', 'name'],
    },
}

下面是接受关键字和设置的列表，如果键没有提供指定的feed 定义，就使用列表中的值作为备用。

format: feed 序列化的格式，See Serialization formats for possible values. 强制性，无后备设置
encoding: falls back to FEED_EXPORT_ENCODING# 退回到
fields: falls back to FEED_EXPORT_FIELDS
indent: falls back to FEED_EXPORT_INDENT
store_empty: falls back to FEED_STORE_EMPTY

FEED_EXPORT_ENCODING

Default: None

feed 的编码

如果没设置或设置为None（默认），就使用utf-8来进行除了JSON 的所有输出，由于历史原因JSON使用安全的数字编码（\uXXXX序列）

JSON 也可以使用UTF-8

FEED_EXPORT_FIELDS

Default: None

导出字段的列表，例如：FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]

Use FEED_EXPORT_FIELDS 使用这个选项定义导出的字段和他们的顺序。

当时空或None(默认)时，scrapy使用定义在字典或spider生成的item 子类中定义的字段。

如果导出器需要一组固定的字段（例如CSV导出格式），并且这个是空或none，scrapy尝试从已经导出的数据（当前第一个item中使用的字段名）中推断字段名。

FEED_EXPORT_INDENT

Default: 0

在每个级别中的输出使用的缩进空间量。如果FEED_EXPORT_INDET 是一个非负整数，数据元素和对象成员将使用缩进级别进行完美打印。如果这个是0（默认）或是负数，将每一个item放到新行中，None 选择最紧凑的表示方式。？？？？

当前仅由 JsonItemExporter and XmlItemExporter实现，只有你导出 .json .xml才能用。

FEED_STORE_EMPTY

Default: False

是否导出空feed。（即没有items 的feeds。）

FEED_STORAGES

Default: {}

字典包含你的项目支持的格外的feed存储后端。关键字是URL 格式，值是储存类的路径。

FEED_STORAGE_FTP_ACTIVE

Default: False

Whether to use the active connection mode when exporting feeds to an FTP server (True) or use the passive connection mode instead (False, default).# 当将feeds导出到FTP服务器时，是使用主动连接方式（True) 还是被动连接模式（默认 False)

For information about FTP connection modes, see What is the difference between active and passive FTP?.# 更多FTP连接模式，详见

FEED_STORAGE_S3_ACL

Default: '' (empty string)

字符串，包含自定义的ACL ，用来将你的项目导出到Amazon S3.

为了获得可用值的完整列表，access the Canned ACL section on Amazon S3 docs

FEED_STORAGES_BASE

Default:

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

字典包含了srapy支持的内置feed 储存后端。你可以通过在FEED_STORAGES中分配None 给他们的URL 格式来禁用这些后端。例如，禁用内置的FTP后端（而不替换），把这个放到你的setting.py 文件里。

FEED_STORAGES = {
    'ftp': None,
}

FEED_EXPORTERS

Default: {}

字典包含你项目支持的额外的导出器，键是序列化格式，值是item导出器类的路径，

FEED_EXPORTERS_BASE

Default:

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

一个字典包含了scrapy 支持的内置feed 导出器，你可以在FEED_EXPORTERS里分配None来禁用他们。例如，禁用内置的CSV导出器，在你的setting.py文件里放置：

FEED_EXPORTERS = {
    'csv': None,
}

额，，啥也没看懂。。。。

Scrapy: Feed exports

Feed exports

Serialization formats

JSON

JSON lines

CSV

XML

Pickle

Marshal

Storages

Storage URI parameters

Storage backends

Local filesystem

FTP

S3

Standard output

Settings

FEEDS

FEED_EXPORT_ENCODING

FEED_EXPORT_FIELDS

FEED_EXPORT_INDENT

FEED_STORE_EMPTY

FEED_STORAGES

FEED_STORAGE_FTP_ACTIVE

FEED_STORAGE_S3_ACL

FEED_STORAGES_BASE

FEED_EXPORTERS

FEED_EXPORTERS_BASE

相关阅读

相关文章

相关问答