Scrapy爬取文件

2017-12-17 本文已影响0人松爱家的小秦

这次爬取http://matplotlib.org/examples/index.html所有源码文件

在Setting.py里设置FILES_STORE设置你下载文件的地址

ITEM_PIPELINES = {

#'matplot_example.pipelines.MatplotExamplePipeline': 300,

'matplot_example.pipelines.MyFilesPipeline':1,#注意要设置正确不然会出错

}

这个MyFilesPipeline等等会在Pipline实现

FILES_STORE ='examples_src3'

在Item.py设置Item

> classExampleItem(scrapy.Item):

> file_urls = scrapy.Field()

> files = scrapy.Field()

file_urls需要你传入，file会爬好后自动放入你的文件

用scrapy shell确定你要爬到的所有文件连接

le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')

links = le.extract_links(response)

[link.url for link in links]

设置你的爬虫

可以用 scrapy genspider matplotlib.org

生成一个默认的

思路如下：

1.用scrapy shell 分析你要爬的文件连接

2.在Item.py定义好你的Item 字段一定要有 file_urls 和file

3.setting中设置你的好文件储存地

4.设置好你的爬虫

5.重写FilePipline，让你输出的文件不是乱码