crawlspider-zhihu总结
2017-03-08 本文已影响0人
gogoforit
1)解决500和423错误403错误
在settings里面设置header可以解决500错误
限速可以解决423错误
403错误,使用ip中间件以后,可能该ip已经被网站封了
2)allowed_domains
域很重要,这里决定了可以访问的网址范围,加上dont_filter=True
以后不受限制
3)异常处理
try:
except Exception as e:
print(e)
4)response.status response.url
5)对异常ip的处理,虽然我不明白原理
from scrapy.core.downloader.handlers.http11 import TunnelError
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
class RetryMiddleware(RetryMiddleware):
def process_exception(self, request, exception, spider):
if ( isinstance(exception, self.EXCEPTIONS_TO_RETRY) or isinstance(exception, TunnelError) ) \
and 'dont_retry' not in request.meta:
return self._retry(request, exception, spider)
settings.py设置如下
DOWNLOADER_MIDDLEWARES = {
# 'zhihu_basic.middlewares.UAMiddleware': 543,
'zhihu_basic.middlewares.RetryMiddleware': 200,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None
}
6)settings里面设置header,cookies,可以用来访问
def make_requests_from_url(self, url):
return scrapy.Request(url, method='GET', headers=settings['ZHIHU_HEADER'], cookies=settings['ZHIHU_COOKIE'])