scrapy rules 规则的使用

2018-07-14  本文已影响381人  seven1010
rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
        Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
    )

解释:

案例

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ToscrapeRuleSpider(CrawlSpider):
    name = 'toscrape-rule'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'FEED_FORMAT': 'Json',
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'rule1.json'
    }
    # 必须是列表
    rules = [
        # follow=False(不跟进), 只提取首页符合规则的url,然后爬取这些url页面数据,callback解析
        # Follow=True(跟进链接), 在次级url页面中继续寻找符合规则的url,如此循环,直到把全站爬取完毕
        Rule(LinkExtractor(allow=(r'/page/'), deny=(r'/tag/')), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a/text()').extract()
            }
2018-07-14 22:36:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:36:41 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/3/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/1/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/4/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/5/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/6/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/7/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/8/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/9/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/10/
2018-07-14 22:36:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-14 22:44:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:44:08 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:44:08 [scrapy.core.engine] INFO: Closing spider (finished)
爬虫.png
上一篇 下一篇

猜你喜欢

热点阅读