Scrapy抓取小说各个章节存储成Text
2018-12-09 本文已影响59人
whong736
抓取小说网站 全书网 http://www.quanshuwang.com/
玄幻魔法小说存储成text,抓取小说名,小说的章节,以及小说的内容
达到的效果;
抓取小说
image.png
抓取小说章节
image.png
抓取思路:
1.分析网页
image.png2.链接地址规律:
http://www.quanshuwang.com/list/1_1.html #第一页
http://www.quanshuwang.com/list/1_2.html #第二页
http://www.quanshuwang.com/list/1_3.html #第三页
规律如下:
http://www.quanshuwang.com/list/X_Y.html
X 代表不同分小说分类,比如这里的1代表玄幻魔法,2代表武侠修真
Y代表分页页数
3.页面层级分析
第一层级:分类列表
http://www.quanshuwang.com/list/1_1.html
第二层级 一个小说名称
http://www.quanshuwang.com/book_167173.html #小说链接地址
进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同
http://www.quanshuwang.com/book/167/167173
image.png
第三层小说章节列表页面
http://www.quanshuwang.com/book/167/167173
image.png
第四层,章节内容页
http://www.quanshuwang.com/book/167/167173/48194124.html
image.png
1.新建项目
scrapy startproject Fiction
cd Fiction
2.新建爬虫文件
scrapy genspider -t basic novel quanshuwang.com
3.确定自己需要抓取的小说字段,小说名,小说章节,小说内容,其他字段可以慢慢补充。开始编写:item
# -*- coding: utf-8 -*-
import scrapy
class FictionItem(scrapy.Item):
# define the fields for your item here like:
#小说名称
name = scrapy.Field()
#小说章节名字
chapter_name = scrapy.Field()
#小说章节内容
chapter_content = scrapy.Field()
image.png
4.编写爬虫文件
# -*- coding: utf-8 -*-
import scrapy
import re
from Fiction.items import FictionItem
from scrapy.http import Request
class NovelSpider(scrapy.Spider):
name = 'novel'
allowed_domains = ['quanshuwang.com']
start_urls = [
'http://www.quanshuwang.com/list/1_2.html',
'http://www.quanshuwang.com/list/1_3.html',
] # 全书网玄幻魔法类前2页
# 获取每一本书的URL
def parse(self, response):
book_urls = response.xpath('//li/a[@class="l mr10"]/@href').extract()
for book_url in book_urls:
yield Request(book_url, callback=self.parse_read)
# 获取马上阅读按钮的URL,进入章节目录
def parse_read(self, response):
read_url = response.xpath('//a[@class="reader"]/@href').extract()[0]
yield Request(read_url, callback=self.parse_chapter)
# 获取小说章节的URL
def parse_chapter(self, response):
chapter_urls = response.xpath('//div[@class="clearfix dirconone"]/li/a/@href').extract()
for chapter_url in chapter_urls:
yield Request(chapter_url, callback=self.parse_content)
# 获取小说名字,章节的名字和内容
def parse_content(self, response):
# 小说名字
name = response.xpath('//div[@class="main-index"]/a[@class="article_title"]/text()').extract_first()
result = response.text
# 小说章节名字
chapter_name = response.xpath('//strong[@class="l jieqi_title"]/text()').extract_first()
# 小说章节内容
chapter_content_reg = r'style5\(\);</script>(.*?)<script type="text/javascript">'
chapter_content_2 = re.findall(chapter_content_reg, result, re.S)[0]
chapter_content_1 = chapter_content_2.replace(' ', '')
chapter_content = chapter_content_1.replace('<br />', '')
item = FictionItem()
item['name'] = name
item['chapter_name'] = chapter_name
item['chapter_content'] = chapter_content
yield item
image.png
5.编写Pipeline文件,存储文件成Text
# -*- coding: utf-8 -*-
import os
class FictionPipeline(object):
def process_item(self, item, spider):
#将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
curPath = '/Users/vincentwen/MyCode/Scrapy'
tempPath = str(item['name'])
targetPath = curPath + os.path.sep + tempPath
if not os.path.exists(targetPath):
os.makedirs(targetPath)
#将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
filename_path = '/Users/vincentwen/MyCode/Scrapy' + os.path.sep + str(item['name']) + os.path.sep + str(item['chapter_name']) + '.txt'
with open(filename_path, 'w', encoding='utf-8') as f:
f.write(item['chapter_content'] + "\n")
f.close()
return item
image.png
6.修改setting文件
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'Fiction.pipelines.FictionPipeline': 300,
}
image.png
7.运行爬虫测试效果,先用查看日志模式,成功后再切换成无日志模式
scrapy crawl novel
image.png
8.代码地址:
https://github.com/wzw5566/Fiction
觉得文章有用,请用支付宝扫描,领取一下红包!打赏一下
支付宝红包码