【爬取小说系列三】鬼吹灯
2020-05-09 本文已影响0人
松龄学编程
“人点烛,鬼吹灯”是传说中摸金派的不宣之秘,意为进入古墓之中先在东南角点燃一支蜡烛才能开棺,如果蜡烛熄灭,须速速退出,不可取一物。相传这是祖师爷所定的一条活人与死人的契约,千年传承,不得破。《鬼吹灯》成书在2006年,最近才开始看,一下子被里面的人物和情节吸引。嗟叹卸岭魁首陈玉楼盛衰无常的一生,艳羡鹧鸪哨惊心动魄的冒险和万中无一的身手,喜欢大金牙的现实与戏谑,跟随摸金三人组闯荡古墓大川。。。一波未平,一波又起,读起来就放不下,总想着故事往下怎么发展。百度一下,在cxbz958网站找到了全版。胡八一,我要来追随你的脚步啦。
如果没有scrapy,要自己写框架,根据需求尽可能的调整框架,直到满足所有需求为止,工作量挺大的。反复造轮子,效率低下。可以学习轮子的原理,重复造轮子的事情我们不做。感谢开源,感谢前人种树。但是。。。
本意是接着hoho的项目开展,继续使用scrapy。试了试,才发现这个网站对于scrapy有反爬的措施。解决反爬的问题,首先想到的就是selenium了。这次我们用selenium来爬数据吧!
网页分析
- 打开firefox网页检查器,找到需要爬取的urls。
- 想要的标题title和内容content,分别在[[[class为content]的div]下的h1]和[[id为content]的div]中。



需求分析
- 把鬼吹灯小说存储在txt文件
代码实现
手动建立新的项目app,添加文件guichuideng_crawler.py
tree
.
├── config.py
├── crawler.py
├── crawler_manager.py
├── crawlerlogger.py
├── guichuideng_crawler.py
├── issue_builder.py
└── util.py
项目初始化好了。给爬虫编码:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from crawler import NovelCrawler
from crawlerlogger import CrawlerLogger
import time
class GuichuidengCrawler(NovelCrawler):
BASEURL = 'http://www.cxbz958.com/guichuideng/'
def setupLocator(self):
self.locator = (By.CLASS_NAME, 'listmain')
self.logger.info(f'locator: {self.locator}')
def setupURL(self):
self.urls = [GuichuidengCrawler.BASEURL]
def setupLogger(self):
self.logger = CrawlerLogger(__name__).logger
def parse(self):
driver = self.driver
titles = driver.find_elements_by_xpath('//div[@class="listmain"]')[1]
titles = titles.find_elements_by_tag_name('a')
data = []
count = len(titles)
self.logger.info(f"打开小说列表,长度:{count}")
for index in range(count):
self.fetch(GuichuidengCrawler.BASEURL,self.locator)
titles = driver.find_elements_by_xpath('//div[@class="listmain"]')[1]
titles = titles.find_elements_by_tag_name('a')
self.logger.info(f"打开第【{index + 1}】条章节")
titles[index].click()
locator = (By.CLASS_NAME, 'content')
WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(locator))
self.logger.info("通过页面延迟机制")
title = driver.find_element_by_xpath('//div[@class="content"]/h1').text
content = driver.find_element_by_xpath('//div[@id="content"]').text
data.append({'title':title,'content':content})
time.sleep(self.intervel)
return data
分析一下:
- 设置定位器locator,确定网页正常打开
- 设置url和打印器logger
- 获取章节数后,依次从目录页,跳转到具体章节详情页拿数据
- 请求之间,加上一些等待,减轻服务器压力
看看父类做了哪些事吧:
class Crawler():
def crawl(self):
pass
def fetch(self):
pass
def parse(self):
pass
def _setup(self):
pass
def _teardown(self):
pass
class NovelCrawler(Crawler):
def __init__(self,driver):
self.driver = driver
def crawl(self):
self._setup()
for url in self.urls:
self.fetch(url,self.locator)
data = self.parse()
self.build_issue(data)
self._tearDown()
def fetch(self,url,locator):
try:
self.driver.get(url)
self.logger.info(f"访问页面成功: {url}")
self.logger.info(f'页面locator: {locator}')
WebDriverWait(self.driver, 20, 0.5).until(EC.presence_of_element_located(locator))
self.logger.info("通过页面延迟机制")
except TimeoutException as e:
self.logger.info(f"访问页面失败: {url}, error: {e}")
except Exception as e:
self.logger.info(f"访问页面失败: {url}, error: {e}")
def parse(self):
return []
def build_issue(self,data):
cls_name = self.__class__.__name__
filename = cls_name.replace('Crawler','')
builder = IssueBuilder(data)
builder.build(filename)
def _setup(self):
self.setupLogger()
self.setupURL()
self.setupLocator()
self.setupInterval()
def _tearDown(self):
pass
def setupLocator(self):
self.locator = None
def setupURL(self):
self.urls = []
def setupLogger(self):
self.logger = None
def setupInterval(self):
self.intervel = 3
分析一下:
- Crawler是最抽象的一层,确定爬虫的结构。爬虫主要由爬取crawl,获取网页内容fetch和解析网页parse构成。除此以外,会有初始化setup和析构teardown的一些操作。
- NovelCrawler是小说爬虫的抽象层,定义了小说爬虫的操作。构造器中的driver,是由外部传入的selenium的浏览器对象
其他一些辅助类的信息:
# -*- coding: utf-8 -*-
import logging
class CrawlerLogger():
def __init__(self, name):
logger = logging.getLogger(name)
logger.setLevel(level = logging.INFO)
handler = logging.FileHandler('../log/'+name+'.log','w',encoding='utf-8')
handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
console = logging.StreamHandler()
console.setLevel(logging.INFO)
logger.handlers.clear()
logger.addHandler(handler)
logger.addHandler(console)
logger.info(f"{name} start print log")
self.logger = logger
class IssueBuilder:
BASEDIR = ' path to save /fruits'
EXTENTION = '.txt'
def __init__(self,data):
self.data = data
def build(self,filename):
file_path = IssueBuilder.BASEDIR + '/' + filename + IssueBuilder.EXTENTION
with open(file_path, 'a') as f:
for item in self.data:
f.write(item.get('title','') + '\n')
f.write(item.get('content','') + '\n\n')
分析一下:
- 定义CrawlerLogger。方便打印信息
- 改造IssueBuilder。参数换成了包含title和content字典的数组
一起看看成果吧:

写爬虫的过程,虽说没有盗墓惊现,在不断试错的过程中,却也不乏一些探索,对结果的希冀和报错的沮丧,倒是与盗墓中寻宝的经历感受,有异曲同工之处。胡八一期待你新的冒险!!!