[Feed exports] - 数据导出配置详解

2018-07-13  本文已影响17人  seven1010
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
...
]

Feed exports参数详解

下面开始说明(上面加粗参数为重点掌握,比较实用):
1 、FEED_URI

指定文件存储的位置以及文件名,支持输出到:

本地文件

D://tmp/filename.csv

FTP

ftp://user:pass@ftp.example.com/path/to/filename.csv

2、FEED_FORMAT

指定数据输出格式,支持的输出格式有(分别示例):

实际上是JsonItemExporter,示例:

[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]

注意:如果数据量太多的话不建议使用json格式,因为它是把整个对象放入内存中,所以大数据量简易使用jsonlines 或者分块输出数据到文件。

实际上是JsonLinesItemExporter,示例:

{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}

实际上为CsvItemExporter,示例:

product,price
Color TV,1200
DVD player,200

第一行为输出数据项的名称,下面每行为一组数据。

实际上为XmlItemExporter,示例:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <item>
    <name>Color TV</name>
    <price>1200</price>
 </item>
  <item>
    <name>DVD player</name>
    <price>200</price>
 </item>
</items>

剩余的还有Pickle、Marshal暂时不做不了解。

3、存储方式

FEED_STORAGES

默认为{},如果要进行设置则以URL方案名作为key,值为该存储类的路径。

FEED_STORAGES_BASE

基础存储方式,默认的为:

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

4、文件存储格式

FEED_EXPORTERS

默认为{},定义扩展的文件存储方法,以格式为key,值为该格式类的路径。

FEED_EXPORTERS_BASE

默认存储格式有:

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

5、编码及数据输出

FEED_EXPORT_ENCODING

存储文件编码,默认为None,一般设置为utf-8。

FEED_EXPORT_FIELDS

设定输出哪些字段,以及字段的顺序,例子:

FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]

FEED_EXPORT_INDENT

默认值为0,单值为0或负数时将在新一行输出数据,设置大于0则为每一级的数据添加等量倍的空格缩进。

3 使用范例

# -*- coding: utf-8 -*-
import scrapy


class QuotesItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            # item = dict(text=text, author=author)
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

在同目录下cmd输入:

(保存数据到爬虫中定义的文件中)

scrapy runspider Quotes_Spider.py -a category=love

(保存数据到命令行中指定的文件)

scrapy runspider Quotes_Spider.py -a category=love -o new_quotes.json
上一篇下一篇

猜你喜欢

热点阅读