Python爬取知乎热门问题_爬虫入门

2017-03-31 本文已影响298人章开晴

简介

知乎的网站是比较好爬的，没有复杂的反爬手段，适合初学爬虫的人作为练习
因为刚刚入门python，所以只是先把知乎上热门问题的一些主要信息保存到数据库中，待以后使用这些信息进行数据分析，爬取的网页链接是赞同超过1000的回答

网页分析

1.分析网站的页面结构

准备提取热门问题的问题、答主、赞数、评论数等内容

界面分析

2.分析网站的元素

选择页面中需要爬取的内容对应的元素，分析特征(class,id等)，稍后使用BeautifulSoap爬取这些内容

HTML分析

3.用Beautifulsoup解析获取的网页

这些网页的url的数字是递增的，所以先写一个语句将所有的url保存到数组中，稍后从中一个接一个获取网页

urls = []
    lists = []  # 存储此页面的数据
    url_part = "https://www.zhihu.com/collection/19928423?page="  # 赞数超过一千的收藏夹
    for i in range(1, 5):
        urls.append((url_part + str(i)))  # 拼接知乎爬取链接

用BeautifulSoap解析部分的代码

 for url in urls:
        get_html = requests.get(url, headers=headers)  # requests请求页面内容，由于知乎没有限制爬取，所以不用设置伪请求头
        Soup = BeautifulSoup(get_html.text, 'lxml')  # BeautifulSoup解析页面内容
        items = Soup.find_all('div', class_="zm-item")  # 获取所有的热门问题内容
        for i in items:
            data = {
                "question": i.find("h2", class_="zm-item-title").text,  # 问题题目
                "like": i.find("div", class_="zm-item-vote").text,  # 问题赞数
                "user_info_name": i.find("div",class_="answer-head").find("span", class_="author-link-line").text if(i.find("div",class_="answer-head").find("span", class_="author-link-line")) else "",  # 答主信息-姓名,一些数据可能因为某种原因无法获取，可以用这种if-else方式避免发生错误
                "user_info_sign": i.find("div", class_="answer-head").find("span", class_="bio").text if(i.find("div", class_="answer-head").find("span", class_="bio")) else "",  # 答主信息-签名
                "answer": i.find("div", class_="zh-summary summary clearfix").text,  # 问题摘要
                "time": i.find("p", class_="visible-expanded").find("a",class_="answer-date-link meta-item").text,  # 问题编辑时间
                "comment":i.find("div", class_="zm-meta-panel").find("a",class_="meta-item toggle-comment js-toggleCommentBox").text,  # 问题评论数
                "link": i.find("link",itemprop_="url").get("href")  # 问题链接
                }  # 临时存取知乎的数据
            lists.append(data)  # 保存返回到数组中
    return lists

4.将网页数据保存到Mongodb数据库中

def storing_data(lists):  # 将爬取数据保存到数据库中
    client = MongoClient('mongodb://localhost:27017/')  # 连接到mongodb
    db = client.lib  # open the database "lib"
    collection = db.Books  # open the table "Books"
    for alist in lists:
         collection.insert(alist)

5.完整代码

from pymongo import MongoClient  # 链接数据库所用的库
import requests  # 获取页面所用的库
from bs4 import BeautifulSoup  # 提取页面所用的库
import time  # 计算程序时间所用的库


def get_zhihu():
    headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/"
                             "56.0.2924.87 Safari/537.36"}  # 请求头
    urls = []  # 存储链接
    lists = []  # 存储此页面的数据
    url_part = "https://www.zhihu.com/collection/19928423?page="  # 赞数超过一千的收藏夹
    client = MongoClient('mongodb://localhost:27017/')  # 连接到mongodb
    db = client.lib  # open the database "Library"（数据库名称可以自己修改）
    collection = db.zhihu  # open the table "zhihu"（表名称可以自己修改）
    t1 = time.time()  # 获取初始时间
    a = 0  # 记录已爬取问题数
    page = 0  # 记录已爬取网页数
    lastpage = 4198  # 爬取的收藏夹最后一页的页码，可以根据当前数目自行调整
    for i in range(page, lastpage):
        urls.append((url_part + str(i)))  # 拼接知乎爬取链接
    for url in urls:
        page = page + 1
        try:
            get_html = requests.get(url, headers=headers)  # requests请求页面内容
            Soup = BeautifulSoup(get_html.text, 'lxml')  # BeautifulSoup解析页面内容
            items = Soup.find_all('div', class_="zm-item")  # 获取所有的热门问题内容
            try:
                for i in items:
                    a=a+1
                    data = {
                        "question": i.find("h2", class_="zm-item-title").text,  # 问题题目
                        "like": i.find("div", class_="zm-item-vote").text,  # 问题赞数
                        "user_info_name": i.find("div",class_="answer-head").find("span", class_="author-link-line").text,  # 答主信息-姓名
                        "user_info_sign": i.find("div", class_="answer-head").find("span", class_="bio").text,  # 答主信息-签名
                        "answer": i.find("div", class_="zh-summary summary clearfix").text,  # 问题摘要
                        "time": i.find("p", class_="visible-expanded").find("a",class_="answer-date-link meta-item").text,  # 问题编辑时间
                        "comment":i.find("div", class_="zm-meta-panel").find("a",class_="meta-item toggle-comment js-toggleCommentBox").text,  # 问题评论数
                        "link": i.find("link").get("href") # 问题链接
                        }  # 临时存取知乎的数据
                    collection.insert(data)  # 插入到数据表中
                    t2=time.time()
                    print("-----No."+str(a)+"-----Page:"+str(page)+"-----")  # 显示爬取进度及所用时间
                    print("usage time: ",str(t2-t1))
                time.sleep(0.5)  # 防止爬的太快导致发生错误
            except:
                print("some errors happened, time will sleep 5s")  # 爬取错误产生时的操作，如原回答已不存在（被删除等）
                time.sleep(5)
        except:
            print("Internet connection error, time will sleep 10s")  # 程序错误产生时的操作，如IP暂时被封等
            time.sleep(10)
    return lists # 这个直接在爬取中保存到数据库了，如果将data追加到lists中需要用到这个参数


if __name__ == "__main__":
    get_zhihu()  # 运行爬虫程序
    print("OK")  # 表明操作完成

所使用的环境是Ubuntu16.10,PyCharm 2017.1,代码基于Python3格式,需要先使用pip命令安装pymongo,BeautifulSoap等库,以及Mongodb数据库和Robomongo数据库可视化软件,最终数据库的结果如下图

爬取效果
总共爬取了近四万条回答，可以在终端使用mongoexport -d lib -c zhihu --csv -o zhihu.csv -f question,user_info_name,link,like,comment,time,answer,user_info_sign这条命令将数据库中的数据导出到zhihu.csv(utf-8编码)文件中,也可以改变命令的参数导出成json格式

0.安装及数据库入门

开始尝试的是Mysql，但是遇到了很多困难，决定改为MongoDB，发现确实更加简单，下面所列的是我在学习数据库时所使用的教程，感谢这些作者

参考教程：

如何在python中将数据保存到MongoDB数据库

0.1结语

关于本部分的代码已经发布在我的github，欢迎浏览