SCRAPY 框架

2018-11-01  本文已影响0人  梦亦殇灬

在写之前 先来一张图片

scrapy思维导图

一、简介

scrapy 共分为五大模块

二、创建项目

1. 普通爬虫 :

scrapy startproject 项目名称
cd /项目名称/spider/ 
scrapy genspider baidu baidu.com

2.通用爬虫

scrapy startproject 项目名称
cd /项目名称/spider/ 
scrapy genspider-t crawl baidu baidu.com

三、用法

    name = 'bdly' # 爬虫名称
    allowed_domains = ['lvyou.baidu.com'] #域名
    start_urls = ['https://lvyou.baidu.com/scene/t-all_s-all_a-all_l-all'] # 起始url

    def parse(self, response):
        con_url_list = response.xpath('//ul[@class="filter-result"]/li')
        for con_url in con_url_list:
            ful_url = 'https://lvyou.baidu.com' + con_url.xpath('.//div[@class="img-wrap"]/a/@href').extract_first()
            # print("--->>",ful_url)
            yield scrapy.Request(url=ful_url, callback=self.con_parse) # 再次请求 并回调 下一个函数做解析
    def con_parse(self, response):
        con_item = BdlyprojectItem()
        con_item['title'] = response.xpath('//span[@class="main-name clearfix"]/a/text()').extract_first()
        con_item['pingfen'] = ''.join(response.xpath('//div[@class="main-score"]/text()').extract()).replace('\n', '')
        con_item['intor'] = response.xpath('//div[@class="main-desc"]/p/text()').extract_first().replace(' ', '')
        con_item['pinglun'] = response.xpath('//div[@class="main-score"]//a/text()').extract_first()
        con_item['jianyi'] = response.xpath(
            '//div[@class="main-intro"]/span[@class="main-besttime"]/span/text()').extract_first()
        con_item['days'] = response.xpath(
            '//div[@class="main-intro"]/span[@class="main-dcnt"]/span/text()').extract_first()
        yield con_item  # 返回给 piplines.py

在这里必须要用yield 因为该函数会变成生成器

class BdlyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    pingfen = scrapy.Field()

类似于 model 的作用 是你存数据的字段 在爬虫文件导入类 并赋值

class BdlyprojectPipeline(object):
   def process_item(self, item, spider):

       item.write_to_file(str(dict(item)))

item 是爬虫文件给返回的解析完的 数据
在这里可以做数据的 过滤 和 持久化

class BdlyprojectSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

中间件模块 可以加如请求头,拦截....

!!共勉!!

上一篇 下一篇

猜你喜欢

热点阅读