爬虫技术

scrapy AutoThrottle

2020-03-06  本文已影响0人  cdz620

主要有两种限速方式

DOWNLOAD_DELAY 和 (CONCURRENT_REQUESTS_PER_DOMAIN 或 CONCURRENT_REQUESTS_PER_IP) 组合控制

RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 0.75
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0

访问数量限制

使用限速模块AutoThrottle

详细算法位置:scrapy.extensions.throttle line:_adjust_delay

DOWNLOAD_DELAY = 12
CONCURRENT_REQUESTS_PER_IP = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = True
  1. 启动时采用AUTOTHROTTLE_START_DELAY当做起始的延迟, 此时previous_delay = AUTOTHROTTLE_START_DELAY
  2. 收到应答后的时间为latency,计算下次目标延时:tag_delay = latency / AUTOTHROTTLE_TARGET_CONCURRENCY
  3. next_delay = download_delay + (tag_delay + previous_tag_delay) / 2
  4. 非200的代码不会降低延迟速度
  5. 下载延迟不会少于DOWNLOAD_DELAY 或大于AUTOTHROTTLE_MAX_DELAY
  6. AutoThrottle 是基于计算服务器响应能力的算法,DOWNLOAD_DELAY + 预估服务器响应能力延迟。算法不与DOWNLOAD_DELAY冲突,会尊重DOWNLOAD_DELAY机制。
上一篇 下一篇

猜你喜欢

热点阅读