Python爬虫实战——再爬斗破苍穹小说

2023-07-31  本文已影响0人  libdream

很久很久之前发过一次初学爬虫时爬取斗破小说的笔记,后来由于工作调整原因,爬虫很久没碰了,现在又想重新拾起这个技能,再重头学习一次吧。

这次爬取直接从首页网址入手
http://book.doupoxs.com/doupocangqiong/
主要思路就是:
1-爬取首页所有章节的超链接,按章节顺序排列。
2-获取各章节的标题和正文。
3-将爬取的内容保存到文本文件中。

代码如下:

import requests
from bs4 import BeautifulSoup
import time
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

def get_chapter_urls(url):
    response = requests.get(url,headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    #方法一:获取所有章节的超链接
    chapters = soup.find('div',class_='xsbox clearfix').find_all('a')
    urls = [url.get('href') for url in chapters]
    #方法二:获取所有章节的超链接
##    chapters = soup.find_all('a')
##    urls = [url.get('href') for url in chapters if url.get('href').startswith('/doupocangqiong/') and url.get('href').split('/')[-1].split('.')[0].isdigit()]
##    urls.sort(key=lambda x: int(x.split('/')[-1].split('.')[0]))  # 按章节顺序排序
    return ['http://book.doupoxs.com' + url for url in urls]

def get_chapter_content(url):
    time.sleep(random.randint(1,3))#设置随机1-3秒内延迟
    response = requests.get(url,headers=headers)
    response.encoding = 'utf-8'  
    soup = BeautifulSoup(response.text, 'lxml')
    title = soup.find('div',class_='entry-tit').text # 获取章节标题
    content = soup.find('div', class_='m-post').text.replace('\xa0'*8, '\n\n')  # 获取章节内容
    return title, content

def save_to_file(title, content, filename):
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(title + '\n')
        f.write(content + '\n\n')

def main(url, filename):
    chapter_urls = get_chapter_urls(url)
    for url in chapter_urls:
        title, content = get_chapter_content(url)
        save_to_file(title, content, filename)

if __name__ == "__main__":
    main('http://book.doupoxs.com/doupocangqiong/', 'doupo.txt')

最后爬取的小说如图:


2023-08-01_12-01-21.png
上一篇下一篇

猜你喜欢

热点阅读