Scrapy笔记-常用指令

2018-06-21 本文已影响0人 hhuua

常用指令

创建项目

设置一个新的Scrapy项目。

scrapy startproject projectname

运行爬虫

scrapy crawl spidername

数据提取测试

scrapy shell 'hhttp://www.xxx.com'

css选择器

使用 shell，您可以尝试使用带有 response 对象的 CSS 选择元素：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

要从上面的标题中提取文本，您可以：

>>> response.css('title::text').extract()
['Quotes to Scrape']

我们在CSS查询中添加了 ::text ，这意味着我们只想直接在 <title> 元素中选择文本元素。如果我们不指定 ::text ，我们将获得完整的 title 元素，包括其标签：

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

可以使用 re 方法使用正则表达式进行提取：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

Xpath

Scrapy 选择器还支持使用 XPath 表达式：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

数据存储

Feed

存储抓取数据的最简单方法是使用 Feed 导出(Feed exports)

scrapy crawl spidername -o xxxx.json

这将生成一个 quotes.json 文件，其中包含所有被抓取的项目，以 JSON 序列化。

使用其他格式，如JSON Lines：

scrapy crawl spidername -o xxxx.jl

由于每条记录都是单独的行，因此您可以处理大文件，而无需将所有内容都放在内存中

爬虫参数

在运行爬虫时，可以使用 -a 选项为您的爬虫提供命令行参数：

scrapy crawl spidername -o xxxx-humor.json -a tag=xxx

这些参数传递给 Spider 的 __init__ 方法，默认成为spider属性。

您可以使用此方法使您的爬虫根据参数构建 URL来实现仅抓取带有特定tag的数据：

def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)