8.CrawlSpider（增量模板爬虫）

2018-10-31 本文已影响0人学飞的小鸡
创建爬虫时，需要用scrapy genspider -t crawl 爬虫名域名
例如：本例子 scrapy genspider -t crawl dushu dushu.com
dushu.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
# 导入LinkExtractor用于提取链接
from scrapy.spiders import CrawlSpider, Rule
# Rule定义一个规则，然后让LinkExtractor取根据这些规则提取url

from CrawlSpiderDemo.items import CrawlspiderdemoItem

# 在scrapy框架中包了两个分类的爬虫分别是：Spider（基本爬虫）和CrawlSpider（增量模板爬虫）
# CrawlSpider是Spider的一个派生类，spider类设计原则只从start_urls列表中提取内容，CrawlSpider定义了一些规则，这些规则可以跟踪链接，从而可以使得一个页面中所有的符合规则的链接都被提取出来放入调度器中
# 在不断访问url的过程中，爬虫匹配到的url越来越多

class DushuSpider(CrawlSpider):
    name = 'dushu'
    allowed_domains = ['dushu.com']
    start_urls = ['https://www.dushu.com/book/1002.html']

    rules = (
        Rule(LinkExtractor(allow=r'/book/1002_\d+\.html'), callback='parse_item', follow=True),
    )
    # rules 规则: 包含若干个Rule对象，每一个Rule对象对我们爬取网站的规则都做了一些特定的操作，根据LinkExtractor里面的规则提取出所有的链接，然后把这些链接通过引擎压入调度器的调度队列中，调度器进而去调度下载，然后回调parse_item  (这里的回调方法写成了字符串形式) ，再从二次请求的这些url对应的页面中根据LinkExtractor的规则继续匹配（如果有重复，自动剔除），依次类推，直到匹配到所有的页面

    # LinkExtractor的匹配规则：
    # 用正则表达式来匹配：LinkExtractor(allow="某正则") # /book/1002_\d\.html
    # 用xpath匹配：LinkExtractor(restrict_xpath="某xpath路径")
    # 用css选择器：LinkExtractor(restrict_css="某css选择器")

    def parse_item(self, response):
        print(response.url)
        # 解析页面
        book_list = response.xpath("//div[@class='bookslist']//li")
        for book in book_list:
            item = CrawlspiderdemoItem()
            item["book_name"] = book.xpath(".//h3/a/text()").extract_first()

            # 其他自己解析

            # 获取到二级页面的url
            next_url = "https://www.dushu.com" + book.xpath(".//h3/a/@href").extract_first()

            yield scrapy.Request(url=next_url,callback=self.parse_next,meta={"item":item})

    def parse_next(self, response):
        item = response.meta["item"]
        item["price"] = response.xpath("//span[@class='num']/text()").extract_first()
        m = response.xpath("//div[@class='text txtsummary']")[2]
        item["mulu"] = m.xpath(".//text()").extract()

        yield item
8.CrawlSpider（增量模板爬虫）

dushu.py

猜你喜欢

热点阅读