[Feed exports] - 数据导出配置详解

2018-07-13 本文已影响17人 seven1010

通过执行爬虫命令时添加可选参数来到处数据到文件：
scrapy runspider toscrape-css -o quotes.json
保存的数据是什么样的：

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
...
]

可以看到数据中包含了一些类似：\u201c、\u201d这样的不可读字符，其实这是保存数据时编码格式没有设置，导致使用类似\uXXXX 这样的序列进行保存。这里我们就要好好理理数据保存时的参数设置问题了。

Feed exports参数详解

FEED_URI (指向文件)FEED_FORMAT（数据格式）
FEED_STORAGES（额外存储方式，即存到哪)
FEED_STORAGES_BASE（基础存储方式，即存到哪）
FEED_EXPORTERS（额外输出方式)
FEED_EXPORTERS_BASE（基础输出方式)
FEED_STORE_EMPTY（是否输出空数据，默认不输出)
FEED_EXPORT_ENCODING（文件编码格式)
FEED_EXPORT_FIELDS（指定数据输出项及顺序)
FEED_EXPORT_INDENT（添加数据缩，优雅输出）

下面开始说明（上面加粗参数为重点掌握，比较实用）：
1 、FEED_URI

指定文件存储的位置以及文件名，支持输出到：

本地文件

D://tmp/filename.csv

FTP

ftp://user:pass@ftp.example.com/path/to/filename.csv

2、FEED_FORMAT

指定数据输出格式，支持的输出格式有（分别示例）：

JSON
FEED_FORMAT: json
Exporter used: JsonItemExporter

实际上是JsonItemExporter，示例：

[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]

注意：如果数据量太多的话不建议使用json格式，因为它是把整个对象放入内存中，所以大数据量简易使用jsonlines 或者分块输出数据到文件。

JSON lines
FEED_FORMAT: jsonlines
Exporter used: JsonLinesItemExporter

实际上是JsonLinesItemExporter，示例：

{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}

CSV
FEED_FORMAT: csv
Exporter used: CsvItemExporter
To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

实际上为CsvItemExporter，示例：

product,price
Color TV,1200
DVD player,200

第一行为输出数据项的名称，下面每行为一组数据。

XML
FEED_FORMAT: xml
Exporter used: XmlItemExporter

实际上为XmlItemExporter，示例：

<?xml version="1.0" encoding="utf-8"?>
<items>
  <item>
    <name>Color TV</name>
    <price>1200</price>
 </item>
  <item>
    <name>DVD player</name>
    <price>200</price>
 </item>
</items>

剩余的还有Pickle、Marshal暂时不做不了解。

3、存储方式

FEED_STORAGES

默认为{}，如果要进行设置则以URL方案名作为key，值为该存储类的路径。

FEED_STORAGES_BASE

基础存储方式，默认的为：

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

4、文件存储格式

FEED_EXPORTERS

默认为{}，定义扩展的文件存储方法，以格式为key，值为该格式类的路径。

FEED_EXPORTERS_BASE

默认存储格式有：

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

5、编码及数据输出

FEED_EXPORT_ENCODING

存储文件编码，默认为None，一般设置为utf-8。

FEED_EXPORT_FIELDS

设定输出哪些字段，以及字段的顺序，例子：

FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]

FEED_EXPORT_INDENT

默认值为0，单值为0或负数时将在新一行输出数据，设置大于0则为每一级的数据添加等量倍的空格缩进。

3 使用范例

# -*- coding: utf-8 -*-
import scrapy


class QuotesItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            # item = dict(text=text, author=author)
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

在同目录下cmd输入：

（保存数据到爬虫中定义的文件中）

scrapy runspider Quotes_Spider.py -a category=love

（保存数据到命令行中指定的文件）

scrapy runspider Quotes_Spider.py -a category=love -o new_quotes.json

[Feed exports] - 数据导出配置详解

Feed exports参数详解

猜你喜欢

热点阅读