Scrapy抓取小说各个章节存储成Text

2018-12-09 本文已影响59人 whong736

抓取小说网站全书网 http://www.quanshuwang.com/
玄幻魔法小说存储成text，抓取小说名，小说的章节，以及小说的内容

全书网

达到的效果；

抓取小说

image.png

抓取小说章节

image.png

抓取思路：

1.分析网页

image.png

2.链接地址规律：

http://www.quanshuwang.com/list/1_1.html  #第一页
http://www.quanshuwang.com/list/1_2.html  #第二页
http://www.quanshuwang.com/list/1_3.html  #第三页

规律如下：
http://www.quanshuwang.com/list/X_Y.html
X 代表不同分小说分类，比如这里的1代表玄幻魔法，2代表武侠修真
Y代表分页页数

3.页面层级分析

第一层级：分类列表

http://www.quanshuwang.com/list/1_1.html

第二层级一个小说名称

http://www.quanshuwang.com/book_167173.html  #小说链接地址

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

http://www.quanshuwang.com/book/167/167173

image.png

第三层小说章节列表页面

http://www.quanshuwang.com/book/167/167173

image.png

第四层，章节内容页

http://www.quanshuwang.com/book/167/167173/48194124.html

image.png

1.新建项目

scrapy startproject Fiction

cd Fiction

2.新建爬虫文件

scrapy genspider -t basic novel quanshuwang.com

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

# -*- coding: utf-8 -*-
import scrapy
class FictionItem(scrapy.Item):
    # define the fields for your item here like:
    #小说名称
    name = scrapy.Field()
    #小说章节名字
    chapter_name = scrapy.Field()
    #小说章节内容
    chapter_content = scrapy.Field()

image.png

4.编写爬虫文件

# -*- coding: utf-8 -*-
import scrapy
import re
from Fiction.items import FictionItem
from scrapy.http import Request


class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['quanshuwang.com']
    start_urls = [
        'http://www.quanshuwang.com/list/1_2.html',
        'http://www.quanshuwang.com/list/1_3.html',

    ]  # 全书网玄幻魔法类前2页

    # 获取每一本书的URL
    def parse(self, response):
        book_urls = response.xpath('//li/a[@class="l mr10"]/@href').extract()
        for book_url in book_urls:
            yield Request(book_url, callback=self.parse_read)

    # 获取马上阅读按钮的URL，进入章节目录
    def parse_read(self, response):
        read_url = response.xpath('//a[@class="reader"]/@href').extract()[0]
        yield Request(read_url, callback=self.parse_chapter)

    # 获取小说章节的URL
    def parse_chapter(self, response):
        chapter_urls = response.xpath('//div[@class="clearfix dirconone"]/li/a/@href').extract()
        for chapter_url in chapter_urls:
            yield Request(chapter_url, callback=self.parse_content)

    # 获取小说名字,章节的名字和内容
    def parse_content(self, response):
        # 小说名字
        name = response.xpath('//div[@class="main-index"]/a[@class="article_title"]/text()').extract_first()

        result = response.text
        # 小说章节名字
        chapter_name = response.xpath('//strong[@class="l jieqi_title"]/text()').extract_first()
        # 小说章节内容
        chapter_content_reg = r'style5\(\);</script>(.*?)<script type="text/javascript">'
        chapter_content_2 = re.findall(chapter_content_reg, result, re.S)[0]
        chapter_content_1 = chapter_content_2.replace('    ', '')
        chapter_content = chapter_content_1.replace('<br />', '')

        item = FictionItem()
        item['name'] = name
        item['chapter_name'] = chapter_name
        item['chapter_content'] = chapter_content
        yield item

image.png

5.编写Pipeline文件，存储文件成Text

# -*- coding: utf-8 -*-
import os


class FictionPipeline(object):

  def process_item(self, item, spider):
      #将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
      curPath = '/Users/vincentwen/MyCode/Scrapy'  

      tempPath = str(item['name'])
      targetPath = curPath + os.path.sep + tempPath
      if not os.path.exists(targetPath):
          os.makedirs(targetPath)
      #将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
      filename_path = '/Users/vincentwen/MyCode/Scrapy' + os.path.sep + str(item['name']) + os.path.sep + str(item['chapter_name']) + '.txt' 
      with open(filename_path, 'w', encoding='utf-8') as f:
          f.write(item['chapter_content'] + "\n")
          f.close()
      return item

image.png

6.修改setting文件

ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

ITEM_PIPELINES = {
   'Fiction.pipelines.FictionPipeline': 300,
}

image.png

7.运行爬虫测试效果,先用查看日志模式，成功后再切换成无日志模式

scrapy crawl novel

image.png

8.代码地址：
https://github.com/wzw5566/Fiction

觉得文章有用，请用支付宝扫描，领取一下红包！打赏一下

支付宝红包码

Scrapy抓取小说各个章节存储成Text

达到的效果；

1.分析网页

2.链接地址规律：

3.页面层级分析

第一层级：分类列表

第二层级一个小说名称

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

第三层小说章节列表页面

第四层，章节内容页

1.新建项目

2.新建爬虫文件

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

4.编写爬虫文件

5.编写Pipeline文件，存储文件成Text

6.修改setting文件

猜你喜欢

热点阅读

Scrapy抓取小说各个章节存储成Text

达到的效果；

1.分析网页

2.链接地址规律：

3.页面层级分析

第一层级：分类列表

第二层级 一个小说名称

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

第三层小说章节列表页面

第四层，章节内容页

1.新建项目

2.新建爬虫文件

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

4.编写爬虫文件

5.编写Pipeline文件，存储文件成Text

6.修改setting文件

猜你喜欢

热点阅读

第二层级一个小说名称