scrapy rules 规则的使用

2018-07-14 本文已影响381人 seven1010

参考
一般爬虫的逻辑是：给定起始页面，发起访问，分析页面包含的所有其他链接，然后将这些链接放入队列，再逐次访问这些队列，直至边界条件结束。 为了针对列表页+详情页这种模式，需要对链接抽取（link extractor）的逻辑进行限定。好在scrapy已经提供，关键是你知道这个接口，并灵活运用

rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
        Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
    )

解释：

参数含义
Rule是在定义抽取链接的规则，上面的两条规则分别对应列表页的各个分页页面和详情页，关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
CrawlSpider的rules属性是直接从起始url请求返回的response对象中提取url，然后自动创建新的请求返回response, 由callback解析规则提取url返回的的response。
follow用途
第一：这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), )，在这条规则下，只会爬取首页(start_urls)中的和规则符合的链接。假设我把follow修改为True，那么爬虫会在爬取的页面中再寻找符合规则的url，如此循环，直到把全站爬取完毕。
CrawlSpider已经重写了parse函数，所有自动创建新的请求返回的response, 都由parse函数解析, rule无论有无callback，都由同一个_parse_response函数处理，只不过他会判断是否有follow和callback

案例

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ToscrapeRuleSpider(CrawlSpider):
    name = 'toscrape-rule'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'FEED_FORMAT': 'Json',
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'rule1.json'
    }
    # 必须是列表
    rules = [
        # follow=False(不跟进), 只提取首页符合规则的url，然后爬取这些url页面数据，callback解析
        # Follow=True(跟进链接), 在次级url页面中继续寻找符合规则的url,如此循环，直到把全站爬取完毕
        Rule(LinkExtractor(allow=(r'/page/'), deny=(r'/tag/')), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a/text()').extract()
            }

结果(follow=True): 爬取了所有的索引页

2018-07-14 22:36:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:36:41 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/3/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/1/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/4/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/5/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/6/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/7/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/8/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/9/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/10/
2018-07-14 22:36:44 [scrapy.core.engine] INFO: Closing spider (finished)

结果(follow=False): 只爬取page2的数据，因为在首页只提取到/page/2/这一个链接

2018-07-14 22:44:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:44:08 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:44:08 [scrapy.core.engine] INFO: Closing spider (finished)

爬虫.png

scrapy rules 规则的使用

案例

猜你喜欢

热点阅读