大师兄的Python学习笔记(三十一): 爬虫(十二)
2020-09-21 本文已影响0人
superkmi
大师兄的Python学习笔记(三十): 爬虫(十一)
大师兄的Python学习笔记(三十二): 爬虫(十三)
十一、Scrapy框架
10. 使用Selenium抓举动态页面
- 可以使用Scrapy对接Selenium模拟浏览器抓取动态页面。
- 以爬取华为商城搜索页面为例。
1) 定义Item
- 定义要爬取的Field和数据库collection, 这里我们只爬取商品名和价格。
# items.py >>>import scrapy >>>class HuaweiMainPageItem(scrapy.Item): >>> collection = 'items' >>> title = scrapy.Field() >>> price = scrapy.Field()
2) 初步实现爬虫
- 这里爬虫只打开搜索页面。
# huawei.py >>import scrapy >>from huawei_main_page.items import HuaweiMainPageItem >>class HuaweiSpider(scrapy.Spider): >> name = 'huawei' >> allowed_domains = ['www.vmall.com'] >> url = 'https://www.vmall.com/' >> def start_requests(self): >> yield scrapy.Request(url=self.url, callback=self.parse)
- 顺便在settings.py中配置要搜索库
# settings.py KEYWORD = ['笔记本']
3) 通过下载器中间件实现对接Selenium
- 首先配置中间件和Selenium参数
# settings.py ... ... SELENIUM_TIMEOUT = 20 MAX_PAGE=10 ... ... DOWNLOADER_MIDDLEWARES = { 'huawei_main_page.middlewares.HuaweiMainPageDownloaderMiddleware': 543, }
- 编写中间件,用selenium实现输入搜索内容,之后点击跳转页面,并将跳转的页面发送给下载器。
- 我使用的是火狐浏览器。
# middlewares.py >>>from selenium import webdriver >>>from selenium.webdriver.firefox.options import Options >>>from selenium.common.exceptions import TimeoutException >>>from selenium.webdriver.common.by import By >>>from selenium.webdriver.support.ui import WebDriverWait >>>from selenium.webdriver.support import expected_conditions as EC >>>from scrapy.http import HtmlResponse >>>from logging import getLogger >>>class HuaweiMainPageDownloaderMiddleware(object): >>> def __init__(self, kw,timeout=None): >>> self.logger = getLogger(__name__) >>> self.kw = kw >>> self.timeout = timeout >>> self.options = Options() >>> self.options.add_argument('-headless') >>> self.browser = webdriver.Firefox(firefox_options=self.options) >>> self.browser.set_window_size(1980, 1200) >>> self.browser.set_page_load_timeout(self.timeout) >>> self.wait = WebDriverWait(self.browser, self.timeout) >>> @classmethod >>> def from_crawler(cls, crawler): >>> return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),kw=crawler.settings.get('KEYWORD')) >>> def process_request(self, request, spider): >>> self.logger.debug('PhantomJS is Starting') >>> try: >>> page = self.browser.get(request.url) >>> search = self.wait.until( >>> EC.presence_of_element_located((By.CSS_SELECTOR,'.text')) >>> ) >>> search.clear() >>> search.send_keys(self.kw) >>> button = self.wait.until( >>> EC.presence_of_element_located((By.CSS_SELECTOR,'.button')) >>> ) >>> button.click() >>> self.browser.switch_to.window(self.browser.window_handles[-1]) >>> self.wait.until( >>> EC.presence_of_all_elements_located((By.CSS_SELECTOR,'.pro-panels')) >>> ) >>> return HtmlResponse(url=self.browser.current_url,body=self.browser.page_source,request=request,encoding='utf-8',status=200) >>> except TimeoutException: >>> return HtmlResponse(url=request.url,status=500,request=request) >>> def spider_opened(self, spider): >>> spider.logger.info('Spider opened: %s' % spider.name)
4) 继续在爬虫中编写解析方法
- 解析下载器返回的页面数据。
# huawei.py >>>import scrapy >>>from huawei_main_page.items import HuaweiMainPageItem >>>class HuaweiSpider(scrapy.Spider): >>> name = 'huawei' >>> allowed_domains = ['www.vmall.com'] >>> url = 'https://www.vmall.com/' >>> def start_requests(self): >>> yield scrapy.Request(url=self.url, callback=self.parse) >>> def parse(self, response): >>> products = response.css('.pro-panels') >>> for product in products: >>> item = HuaweiMainPageItem() >>> item['title'] = product.css('.p-name::text').extract_first() >>> item['price'] = product.css('.p-price b::text').extract_first() >>> yield item
5) 通过Pipeline将爬取结果保存到数据库
- 首先需要在settings.py中配置数据库信息。
# settings.py >>>ITEM_PIPELINES = { >>> 'huawei_main_page.pipelines.HuaweiMainPagePipeline': 300, >>>} >>>MONGO_URI='localhost' >>>MONGO_DB='project_huawei'
- 之后在pipeline.py中实现存储。
>>>import pymongo >>>class HuaweiMainPagePipeline(object): >>> def __init__(self,mongo_uri,mongo_db): >>> self.mongo_uri = mongo_uri >>> self.mongo_db = mongo_db >>> @classmethod >>> def from_crawler(cls,crawler): >>> return >>>>cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB')) >>> def open_spider(self,spider): >>> self.client = pymongo.MongoClient(self.mongo_uri) >>> self.db = self.client[self.mongo_db] >>> def process_item(self, item, spider): >>> self.db[item.collection].insert(dict(item)) >>> return item >>> def close_spider(self,spider): >>> self.client.close()
参考资料
- https://blog.csdn.net/u010138758/article/details/80152151 J-Ombudsman
- https://www.cnblogs.com/zhuluqing/p/8832205.html moisiet
- https://www.runoob.com 菜鸟教程
- http://www.tulingxueyuan.com/ 北京图灵学院
- http://www.imooc.com/article/19184?block_id=tuijian_wz#child_5_1 两点水
- https://blog.csdn.net/weixin_44213550/article/details/91346411 python老菜鸟
- https://realpython.com/python-string-formatting/ Dan Bader
- https://www.liaoxuefeng.com/ 廖雪峰
- https://blog.csdn.net/Gnewocean/article/details/85319590 新海说
- https://www.cnblogs.com/Nicholas0707/p/9021672.html Nicholas
- https://www.cnblogs.com/dalaoban/p/9331113.html 超天大圣
- https://blog.csdn.net/zhubao124/article/details/81662775 zhubao124
- https://blog.csdn.net/z59d8m6e40/article/details/72871485 z59d8m6e40
- https://www.jianshu.com/p/2b04f5eb5785 MR_ChanHwang
- 《Python学习手册》Mark Lutz
- 《Python编程 从入门到实践》Eric Matthes
- 《Python3网络爬虫开发实战》崔庆才
本文作者:大师兄(superkmi)