crawlspider-zhihu总结

2017-03-08  本文已影响0人  gogoforit

1)解决500和423错误403错误
在settings里面设置header可以解决500错误
限速可以解决423错误
403错误,使用ip中间件以后,可能该ip已经被网站封了
2)allowed_domains域很重要,这里决定了可以访问的网址范围,加上dont_filter=True以后不受限制
3)异常处理

try:

 except Exception as e:
            print(e)

4)response.status response.url
5)对异常ip的处理,虽然我不明白原理

from scrapy.core.downloader.handlers.http11 import TunnelError

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
class RetryMiddleware(RetryMiddleware):
    def process_exception(self, request, exception, spider):
        if ( isinstance(exception, self.EXCEPTIONS_TO_RETRY) or isinstance(exception, TunnelError) ) \
                and 'dont_retry' not in request.meta:
            return self._retry(request, exception, spider)

settings.py设置如下
DOWNLOADER_MIDDLEWARES = {
   # 'zhihu_basic.middlewares.UAMiddleware': 543,
   'zhihu_basic.middlewares.RetryMiddleware': 200,
   'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None
}

6)settings里面设置header,cookies,可以用来访问

    def make_requests_from_url(self, url):
        return scrapy.Request(url, method='GET', headers=settings['ZHIHU_HEADER'], cookies=settings['ZHIHU_COOKIE'])
   
上一篇下一篇

猜你喜欢

热点阅读