五. 项目实战：爬取matplotlib源码文件

2018-03-07 本文已影响0人橄榄的世界

爬取网址：https://matplotlib.org/examples/
爬取信息：爬取所有例子源码
爬取方式：scrapy框架
存储方式：FilesPipeline

matplotlib是著名的python绘图库，通过例子列表进入页面阅读代码，点击‘source code’按钮即可下载源码文件。

1.用scrapy shell 分析页面：
scrapy shell https://matplotlib.org/examples/index.html

image.png

获取了每个例子的详情链接

image.png

-获取例子详情中source code的下载链接。

image.png

2.实现代码：
1)创建matplotlib项目，并根据genspider创建spider。
2)配置FilesPipeline，并指定下载目录。
3)实现Item
4)实现spider文件

创建项目

image.png
在settings.py中进行设置，并制定下载目录：

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1,
}
FILES_STORE = 'source_download'

在item.py添加file_urls和files两个字段。

import scrapy

class MatplotlibDownloadItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

实现\spiders\matplot.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from matplotlib_download.items import MatplotlibDownloadItem

class MatplotSpider(scrapy.Spider):
    name = 'matplot'
    allowed_domains = ['matplotlib.org']
    start_urls = ['https://matplotlib.org/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_xpaths='//li[@class="toctree-l2"]/a')
        detail_links = le.extract_links(response)
        for detail_link in detail_links:
            yield scrapy.Request(detail_link.url,callback=self.parse_url)

    def parse_url(self,response):
        item = MatplotlibDownloadItem()
        le2 = LinkExtractor(restrict_xpaths='//div[@class="section"]/p[1]/a')
        download_link = le2.extract_links(response)[0].url
        item['file_urls'] = [download_link]
        yield item

运行代码: scrapy crawl matplot -o matplot.json，结果为：
image.png

下载的文件目录被安置在：source_download/full目录下，而且文件名字是长度相等的奇怪数字，这些数字是下载文件urlde sha1散列值，虽然这样能避免名字重复，但是文件名不直观，很难对应文件内容，所以需要重新写一个脚本，依据matplot.json文件中的信息进行重命名。

下面生成FilesPipeline的子类，对file_path方法中的命名规则进行重写。以具体文件为例：
https://matplotlib.org/examples/animation/animate_decay.py
animation为类别，animate_decay.py为文件名，animation/animate_decay.py为文件路径。
在pipelines.py中添加代码如下：

from scrapy.pipelines.files import FilesPipeline
import os

class MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        folder = request.url.split('/')[-2]
        filename = request.url.split('/')[-1]
        return os.path.join(folder,filename)

在settings.py中添加代码如下：

ITEM_PIPELINES = {
    # 'scrapy.pipelines.files.FilesPipeline':1,
    'matplotlib_download.pipelines.MyFilesPipeline':1,
}

结果如下，正是我们需要的结果：

image.png

五. 项目实战：爬取matplotlib源码文件

猜你喜欢

热点阅读