关于scrapy-splash使用以及如何设置代理ip
2017-08-23 本文已影响1867人
sunoath
首先我们先介绍下如何使用scrapy-splash:
1、安装:$ pip install scrapy-splash
2、启动docker:$ docker run -p 8050:8050 scrapinghub/splash
3、在setting.py文件中配置:
3.1、SPLASH_URL = 'http://192.168.59.103:8050'
3.2、 DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
3.3、SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
3.4、DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
3.5、HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
以上就已经配置好scrapy-splash了,接着就是我们如何来使用。
这里我们以京东某商品为例抓取:
spider.py
from scrapy.spiders import CrawlSpider, Spider
from scrapy_splash import SplashRequest
class TaoBaoSpider(CrawlSpider):
name = 'taobao_spider'
start_urls = ['https://item.jd.com/4736647.html?cpdad=1DLSUE']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, args={'wait': '0.5'})
def parse(self, response):
pic = response.xpath('//span[@class="price J-p-4736647"]/text()').extract()[0]
print pic
抓取到商品价格:
image.png现在我们需要给我们的scrapy添加代理中间件
middlewares.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['splash']['args']['proxy'] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
- 这里我们需要注意的是设置代理不再是
request.meta['proxy'] = proxyServer
而是request.meta['splash'] ['args']['proxy'] = proxyServer
接着我们把ProxyMiddleware添加到setting.py中
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'Spider.middlewares.ProxyMiddleware': 843,
}
- 自定义的中间件的权重需要在scrapy-splash的后面才行。