Python爬取豆瓣电子小说

2017-10-24 本文已影响0人 Treehl

准备工作

安装BeautifulSoup模块

pip install BeautifulSoup4

BeautifulSoup4 文档

第一步：分析网页

登陆https://read.douban.com/kind/100打开Chrome开发者工具（F12），如下图所示：

我们得到如下信息

小说列表在页面位置为class属性为item store-item的ul标签中
小说信息在ul标签下的li标签中

第二步：使用requets下载网页源码

import requests

download_url = 'https://read.douban.com/kind/100'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }

def download_page(url):
    data = requests.get(url, headers=headers).content
    return data

def main():
    url = download_url
    print download_page(download_url)

if __name__ == '__main__':
    main()

简单测试下，运行后得到的结果

<!DOCTYPE html><html lang="zh-CN" class="ua-windows ua-chrome ua-chrome61 ua-webkit is-anonymous" ><head><meta charset="utf-8"><meta http-equiv="Pragma" content="no-cache"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta http-equiv="Expires" content="Sun, 6 Mar 2005 01:00:00 GMT"><meta name="google-site-verification" content="Wyui6dsg_StZ9K3rJ6xtsNbxPhELyhMkEp2MsAhviBQ"><meta property="weixin:image"  ...............................

上面的代码中User-Agent将创建请求的浏览器和用户代理名称等信息传递给服务器，服务器通过校验请求的U-A来识别爬虫，这算是最简单的一种反爬虫机制了。
在开发者功能（F12）的Network中我们可以找到U-A

当我们拿到网页源码之后，就需要解析HTML源码了。这里，我们使用BeautifulSoup来搞定这件事。

1.from bs4 import BeautifulSoup
2.def parse_html(html):
3.  soup = BeautifulSoup(html, 'html.parser')
4.  read_list_soup = soup.find('ul', attrs={'class': 'list-lined ebook-list  column-list'})
5.  for read_li in read_list_soup.find_all('li', attrs={'class': 'item store-7.
item'}):
6.        detail = read_li.find('div', attrs={'class': 'title'})
7.        read_name = detail.find('a').getText()
8.        print read_name

现在我们解析下代码

导入模块
定义parse_html函数，它接受html源码作为输入，并将这个网页中的小说名称打印到控制台
使用BeautifulSoup解析网页
这是网页html源码定位到的小说列表元素
使用for循环解析每个li标签
6、7都是解析小说名字

到这里我们得到了小说的名称，接下来我们需要解析翻页的元素

第三步：定位下一页链接

依旧使用开发者工具（F12），我们找到了下一页元素在li标签中的class为next,链接则在a标签中，到了最后一页之后，这个li中的a标签消失了，就不需要再翻页了
代码如下：

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    read_list_soup = soup.find('ul', attrs={'class': 'list-lined ebook-list column-list'})

    read_name_list = []

    for read_li in read_list_soup.find_all('li', attrs={'class': 'item store-item'}):
        detail = read_li.find('div', attrs={'class': 'title'})
        read_name = detail.find('a').getText()
        read_name_list.append(read_name)

    next_page = soup.find('li', attrs={'class': 'next'}).find('a')
    if next_page:
        return read_name_list, download_url + next_page['href']
    return read_name_list, None

在解析html之后取回我们需要的数据，next_page到最后一页就返回None

第四步：使用codecs处理中文编码

def main():
    url = download_url

    with codecs.open('readings', 'wb', encoding='utf-8') as f:
        while url:
            html = download_page(url)
            readings, url = parse_html(html)
            f.write(u'{readings}\n'.format(readings='\n'.join(readings)))

程序运行完之后，所有的小说名称全部写入了readings中

完整代码
欢迎访问我的博客Treehl的博客