scrapy与scrapy-redis的使用（一）-基础

2018-07-11 本文已影响69人蜡笔小姜和畅畅

v2-8c591d54457bb033812a2b0364011e9c_1200x500.jpg

爬虫框架scrapy

介绍scrapy这个爬虫框架的Spider(爬虫器)、Scheduler(调度器)、Downloader(下载器)、Pipeline(数据通道)基本使用，以及scrapy-redis的基本使用。

具体内容

scrapy
- Spider
  xpath的使用
  hxs = HtmlXPathSelector(response=response)
  pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
  将爬虫网址yield到调度器
  def parse(self, response):
  soup = BeautifulSoup(response.text,'html.parser')
  获取a标签 a = soup.find(name='a',attrs={'class': 'p_n_p_prefix'})
  获取所有数字 pattern = re.compile(r'\d+')
  post_id = pattern.findall(a.get('href'))[0]
  拼接字符串 next_url = 'http://www.cnblogs.com/post/prevnext?postId= {0}&blogId=133379&dateCreated=2018%2F5%2F23+20%3A28%3A00&postType=1'.format(post_id)
  yield Request(url=next_url, callback=self.parse)
- Scheduler
  调度器的起始方法，配置
  def from_crawler(cls, crawler):
  配置文件 settings = crawler.settings
  去重 dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
  dupefilter = dupefilter_cls.from_settings(settings)
  优先级 pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
  硬盘存储 dqclass = load_object(settings['SCHEDULER_DISK_QUEUE'])
  内存存储 mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE'])
  日志 logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG'))
  return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser, stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass)
- Downloader
  下载器中间件
- Pipeline
  数据持久化
  过程 def process_item(self, item, spider):
  开始 def open_spider(self, spider):
  结束 def close_spider(self, spider):
scrapy-redis
- 去重
  DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
- 设置起始URL
  import redis
  conn = redis.Redis(host='127.0.0.1',port=6379)
  (起始url的Key： chouti:start_urls) conn.lpush("chouti:start_urls",'https://dig.chouti.com')
  清空 redis conn.flushdb()
- 数据持久化
  启用pipelines
  ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, }
  编写你自己的item pipeline
  process_item(self, item, spider)
- 调度器
  SCHEDULER = "scrapy_redis.scheduler.Scheduler"
  SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 默认使用优先级队列（默认），其他：PriorityQueue（有序集合），FifoQueue（列表）、LifoQueue（列表）
  SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 调度器中请求存放在redis中的key
  SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化，默认使用pickle
  SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录，True=保留，False=清空
  SCHEDULER_FLUSH_ON_START = False # 是否在开始之前清空调度器和去重记录，True=清空，False=不清空
  SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时，如果为空，最多等待时间（最后没数据，未获取到）
  SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重规则，在redis中保存时对应的key
  SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# 去重规则对应处理的类

项目地址

lll-scrapy

scrapy与scrapy-redis的使用（一）-基础

爬虫框架scrapy

具体内容

相关文档

项目地址

猜你喜欢

热点阅读