python爬虫日记本大数据 爬虫Python AI SqlWEB前端程序开发

Scrapy爬虫——Selenium集成到scrapy

2017-10-03  本文已影响364人  youyuge

有条件的请支持慕课实战正版课程,本blog仅仅是归纳总结,自用。

理论上,控制流量,selenium完全模拟人类操作,是不会被反扒的,集成到scrapy之后,将大大发挥其作用。

一、准备知识

  1. selenium官方英文document

  2. scrapy v1.0有关下载中间件的中文文档,有条件的可选英文

二、自定义中间件

from selenium import webdriver
from scrapy.http import HtmlResponse

class JSPageMiddleware(object):
    # 通过edge请求动态网页,代替scrapy的downloader
    def process_request(self, request, spider):
        if spider.name == "jobbole":
            browser = webdriver.Edge(
                executable_path='F:/PythonProjects/Scrapy_Job/JobSpider/tools/MicrosoftWebDriver.exe')
            browser.get(request.url)
            import time
            time.sleep(3)
            print ("访问:{0}".format(request.url))

            #直接返回给spider,而非再传给downloader
            return HtmlResponse(url=browser.current_url, body=browser.page_source, encoding="utf-8",
                                request=request)
from selenium import webdriver
from scrapy.http import HtmlResponse

class JSPageMiddleware(object):

    def __init__(self):
        self.browser = webdriver.Edge(
                executable_path='F:/PythonProjects/Scrapy_Job/JobSpider/tools/MicrosoftWebDriver.exe')
        super(JSPageMiddleware,self).__init__()

    # 通过edge请求动态网页,代替scrapy的downloader
    def process_request(self, request, spider):
        #判断该spider是否为我们的目标
        if spider.name == "jobbole":
            # browser = webdriver.Edge(
            #     executable_path='F:/PythonProjects/Scrapy_Job/JobSpider/tools/MicrosoftWebDriver.exe')
            self.browser.get(request.url)
            import time
            time.sleep(3)
            print ("访问:{0}".format(request.url))

            #直接返回给spider,而非再传给downloader
            return HtmlResponse(url=self.browser.current_url, body=self.browser.page_source, encoding="utf-8",
                                request=request)
from scrapy.http import HtmlResponse

class JSPageMiddleware(object):

    # 通过edge请求动态网页,代替scrapy的downloader
    def process_request(self, request, spider):
        #判断该spider是否为我们的目标
        if spider.browser:
            # browser = webdriver.Edge(
            #     executable_path='F:/PythonProjects/Scrapy_Job/JobSpider/tools/MicrosoftWebDriver.exe')
            spider.browser.get(request.url)
            import time
            time.sleep(3)
            print ("访问:{0}".format(request.url))

            #直接返回给spider,而非再传给downloader
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8",
                                request=request)
    def __init__(self):
        from selenium import webdriver

        self.browser = webdriver.Edge(
            executable_path='F:/PythonProjects/Scrapy_Job/JobSpider/tools/MicrosoftWebDriver.exe')
        super(JobboleSpider, self).__init__()

        from scrapy.xlib.pydispatch import dispatcher
        from scrapy import signals
        
        # 绑定信号量,当spider关闭时调用我们的函数
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        print 'spider closed'
        self.browser.quit()
DOWNLOADER_MIDDLEWARES = {
   'JobSpider.middlewares.JSPageMiddleware': 1,
}

三、缺点

我们将scrapy的异步下载改成了浏览器同步模式,大大降低了性能,为了实现selenium也异步,需要重写scrapy里的downloader,必须要熟悉twist异步框架,GitHub上就有许多插件,比如scrapy-phantomjs-downloader

上一篇 下一篇

猜你喜欢

热点阅读