scrapy基础笔记2-第一个小例子

2019-02-27 本文已影响0人 BigBigTang

1.在爬虫文件quotes.py中的parse方法下，编写下面代码

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next_page = response.css('.pager .next a::attr(href)').extract_first()
        print('page：{}'.format(next_page))
        url = response.urljoin(next_page)
        yield scrapy.Request(url=url, callback=self.parse)

这段parse中的代码做的事情是：
1.获取每一个classname=quote的名言的大框
2.获取每一个大框中的内容，文本，作者，标签，存在属于QuoteItem()的这个item中，并通过yield返回
3.获取下一页的url（拼接成完整url）
4.通过yield一个scrapy.Request，并将parse自身作为回调函数，类似递归，这样就可以循环的执行执行1-3步骤，直到最后一页,next_page为None，scrapy会检查url是否已经爬取过，如果已经爬过就会跳过，所以最后一个yield scrapy.Request会直接跳过，就不会再调用回调函数parse，于是就退出了循环。

scrapy基础笔记2-第一个小例子

猜你喜欢

热点阅读