python学习

Python学习八十九天:Crawl Spider 模板的使用

2019-05-21  本文已影响1人  暖A暖

1.Spider模板

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CsdnSpider(CrawlSpider):
    name = 'csdn'
    allowed_domains = ['www.csdn.net']
    start_urls = ['https://www.csdn.net/']
    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        return item

2.CrawlSpider类介绍

3.rules规则列表

4.LinkExtractors

主要参数:

5.爬取CSDN的文章, 且提取URL和文章标题

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class DoubanSpider(CrawlSpider):
    name = 'csdn'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net']
    # 指定链接提取的规律
    rules = (
        # follow:是指爬取了之后,是否还继续从该页面提取链接,然后继续爬下去
        Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print('-'*100)
        print(response.url)
        title = response.css('h1::text').extract()[0]
        print(title)
        print('-' * 100)
        return None

参考:https://www.9xkd.com/user/plan-view.html?id=3716132715

上一篇 下一篇

猜你喜欢

热点阅读