Python小白如何使用爬虫自动抓取《三生三世十里桃花》豆瓣电影

2017-08-15 本文已影响2447人大吉大利小米酱

前言：
本文主要针对有一定编程基础，但完全没有学习过python，又希望快速通过爬虫对网页中一些同类信息进行提取的人。

前两天看了《三生三世十里桃花》，我的表情是懵逼的，内心是崩溃的，心情是想哭的。回家立刻打开了豆瓣刷起了影评，看到6w多条的短评，突然想深♂入了解众豆友对于《三》的普遍想法，于是我将魔爪伸向了python，一门我多次从入门到放弃的语言。鉴于之前的学习，都是从教程第一页开始，又在第一页结束，基本可以认定为是python小白。

1.准备工作

python是一门相对于其他语言来说肥肠自由的语言，从它只能用空白符作为强制缩进符就能够感受到它与众不同，爱用不用的独特气质，像这样一位潇洒任性的公子自然免不得要提前做一些准备才能驾驭。

在开始使用python前你需要：

了解编程方法(曾经学过任何一门计算机语言即可)

了解python2.x与3.x的区别，根据寄几的需求确定所使用的版本(我没有了解)

确定版本后，了解对应版本的python语法(我入门到放弃了)

在电脑上安装python(我去抱程序员小哥哥的大腿了)

总而言之，在看完百度百科对python名词的定义之后，为了能够敏捷而又不失优雅地完成这次操作，我慎(tou)重(lan)选择了直接进行实(ban)战(yun)演(dai)练(ma)。废话不多说，开八。

由于想要得到的是豆瓣的内容，所以选择了抓取豆瓣电影Top250的爬虫实例作为参考，具体内容见：抓取豆瓣电影Top250，这里使用的软件版本是python2，于是我果断地选择了使用python2.7版(对！我就是这么果断！)。

2.分析爬虫原理

个人对简易爬虫的理解，是机器语言对用户操作的模拟，通过程序快速处理并实现对于用户来说重复费时的工作。

模拟用户操作
以《三》的短评为例，首先键入豆瓣短评的网址https://movie.douban.com/subject/25823277/comments?status=P，载入网页后，在用户名和评分下方(定位信息)，即可看到用户的短评，6w条短评，我们就要不停的点击下一页来查看，这无疑是既重复，又费时的事情。

而使用python爬虫，就只需要几分钟的时间就可以搞定啦(明明折腾了2天)。

3.分析url

可能有人会说，我平时都不是这么看短评的，我是先进主页→搜索三生三世→再点影片详情…(闭嘴)。所有在进入目标网页之前的操作，都可以以直接输入目标网址来代替。

《三》的热门短评首页网址结构：
movie.douban.com/subject/25823277/comments?status=P

《三》的热门短评第二页网址结构：
movie.douban.com/subject/25823277/comments?start=21&limit=20&sort=new_score&status=P

《三》的热门短评第三页网址结构：
movie.douban.com/subject/25823277/comments?start=44&limit=20&sort=new_score&status=P

……往后翻页网址中有变化的只有start=后面的数字

可以看出，热门短评只有首页与其他页的结构不同，于是我尝试了将第二页start=后面的数字改为0，即网址为：
movie.douban.com/subject/25823277/comments?start=0&limit=20&sort=new_score&status=P
得到了与短评首页相同的网页。

网址构成

movie.douban.com是豆瓣电影的主页。

/subject/25823277/是影片对应的编号，如果想要看其他电影的短评，在此处将编号改为其他电影的即可。

comments是与短评中好友短评对应，在选择好友短评时显示为follows_comments。

limit代表每页仅显示20个短评内容，sort=new_score代表显示类型是按热门短评排序，按最新短评排序时sort=time，但此时的start后面的数字是乱序的。

而唯一变化的是start=后的参数，这个参数并非如像每页20条短评内容一般，按照每20一次增加，在翻了几页之后，发现除了递增数字差＞20以外，没有其他明显规律，又因为在按最新短评排序是，start值为乱序，推理出start值对应的可能是评价过电影的用户，包括写短评以及只评星级未写评价内容的用户，排序方式是按热门程度依次排序(我猜的)。

由于不清楚真实的start递增规律，于是依旧采用了start+=20这种方法来实现翻页，这种方法的弊端是由于与实际翻页得到数字错位较大而在页面出现重复的短评内容，这个会在输出结果中进行修复。(欢迎大佬指点优化)

4.分析网页

设置好目标网址之后，就可以打开目标网页了，首先来看野生的网页内容，我们所需要的信息就是红框内的内容。

三生三世十里桃花豆瓣短评页面

在chrome浏览器下打开网页，并右键选择“显示网页源代码”，找到与短评内容对应的代码块。

<div class="comment-item" data-cid="1224176725">
    <div class="avatar">
        <a title="王大根" href="https://www.douban.com/people/diewithme/">
            ![](http:https://img.haomeiwen.com/i5588611/337752531f6c3511.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
        </a>
    </div>
    <div class="comment">
        <h3>
            <span class="comment-vote">
                <span class="votes">13622</span>
                <input value="1224176725" type="hidden"/>
                <a href="javascript:;" class="j a_vote_comment">有用</a>
            </span>
            <span class="comment-info">
                <a href="https://www.douban.com/people/diewithme/" class="">王大根</a>
                <span>看过</span>
                <span class="allstar10 rating" title="很差"></span>
                <span class="comment-time " title="2017-08-03 19:21:41">2017-08-03</span>
            </span>
        </h3>
        <p class=""> 年度最烂，真的是年度最烂，前半部像杨洋和刘亦菲主持的少儿节目，后半部毁天灭地只为衬托他们狗屎一样的爱情。杨洋现在表演时的自恋感，已经到了他深情望着女演员的时候，你都会怀疑他是不是在看对方眼里映出的他自己帅气的倒影……</p>
        <a class="js-irrelevant irrelevant" href="javascript:;">这条短评跟影片无关</a>
    </div>
</div>

<div class="comment-item" data-cid="1223910856">
    <div class="avatar">
        <a title="天是红河岸" href="https://www.douban.com/people/ronghua1983/">
            ![](http:https://img.haomeiwen.com/i5588611/0f1426e8d4f3600c.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
        </a>
    </div>
    <div class="comment">
        <h3>
            <span class="comment-vote">
                <span class="votes">14832</span>
                <input value="1223910856" type="hidden"/>
                <a href="javascript:;" class="j a_vote_comment">有用</a>
            </span>
            <span class="comment-info">
                <a href="https://www.douban.com/people/ronghua1983/" class="">天是红河岸</a>
                <span>看过</span>
                <span class="allstar10 rating" title="很差"></span>
                <span class="comment-time " title="2017-08-03 07:53:41">2017-08-03</span>
            </span>
        </h3>
        <p class=""> 杨洋之后再无如此猥琐油腻之夜华。</p>
        <a class="js-irrelevant irrelevant" href="javascript:;">这条短评跟影片无关</a>
    </div>
</div>

这是两条短评的源代码，可以判断一个短评代码块是从<div class="comment-item" data-cid="用户编号">开始的，代码块中除用户名、头像、评价星数、评价时间、有用数、短评内容各不相同以外，其他都是相同的代码。其中短评内容在<p class=""></p>之间。

这里需要引入正则表达式。

根据代码编写正则表达式：

<div.*?class="comment-item".*?>.*?'<p.*?class="">(.*?)</p>

其中.*?是正则表达式中的懒惰匹配，(.*?)是捕获组，即将这个位置匹配到的数据缓存下来，也就是我们需要找的短评内容。

5.构(ban)建(yun)代码beta版

要看最终代码的直接跳到第9节

直接引用豆瓣top250的爬虫代码，并对代码中的内容进行修改。

# -*- coding:utf-8 -*-
import urllib2
import re
import sys

class MovieComment:
    def __init__(self):
        #设置默认编码格式为utf-8
        reload(sys)
        sys.setdefaultencoding('utf-8')  
        self.start = 0    #爬虫起始位置
        self.param = '&filter=&type='
        #User-Agent是用户代理，用于使服务器识别用户所使用的操作系统及版本、浏览器类型等，可以认为是爬虫程序的伪装。
        self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
        self.commentList = []
        self.filePath = '/Users/xiaomi/Desktop/comment.txt'
    
    def getPage(self):
        try:
            URL = 'https://movie.douban.com/subject/25823277/comments?start=' + str(self.start)
            request = urllib2.Request(url = URL, headers = self.headers)
            response = urllib2.urlopen(request)
            page = response.read().decode('utf-8')
            pageNum = (self.start + 20)/20
            print '正在抓取第' + str(pageNum) + '页数据...' 
            self.start += 20
            return page
        except (urllib2.URLError,Exception), e:
            if hasattr(e, 'reason'):
                print '抓取失败，具体原因：', e.reason
    
    def getMovie(self):
        pattern = re.compile(u'<div.*?class="avatar">.*?'
                             + u'<a.*?title="(.*?)".*?href=".*?">.*?</a>.*?'
                             + u'<p.*?class="">(.*?)</p>',re.S)  #正则表达式
        while self.start <= 100:  #爬虫结束位置
            page = self.getPage()
            comments = re.findall(pattern, page)
            for comment in comments:
                self.commentList.append([comment[0], comment[1].strip()]) #将捕获组数据写入评论List中
    
    def writeTxt(self):
        fileComment = open(self.filePath, 'w')
        try:
            for comment in self.commentList:
                fileComment.write( comment[1] + '\r\n\r\n') #输出评论List数据
            print '文件写入成功...'
        finally:
            fileComment.close()
    
    def main(self):
        print '正在从《三生三世十里桃花》电影短评中抓取数据...'
        self.getMovie()
        self.writeTxt()
        print '抓取完毕...'

DouBanSpider = MovieComment()
DouBanSpider.main()

为了尽快调试程序，将爬虫结束start值设为了100。

这里的正则做了一些修改，因为当只设置一个数组元素时，输出的数据仅包含一个字符，例，当设置为输出用户名时，实际只输出了用户名的第一个字；但当设置了2个数组元素时，输出数据就恢复了正常。(未查出原因，有兴趣的盆友可以debug一下，然后告诉我原因hhhhh)

所以这里的正则设置捕获2组数据，而实际输出所需的那组数据。

代码运行界面

输出短评结果
这里可以看到爬取短评时会出现一些源代码，这里的源代码对应的是

来自移动端发布的手机标识
这是为了标识出来自移动端发布的短评消息的一串代码，此类代码一共有三种，Android端、iPhone端和web端，可以通过程序筛选去除，为了更快的解(tou)决(lan)，我使用的是文本去重复工具，可以一键去除文本中大段的重复内容，不仅可以去掉重复的源代码，还可以将第三节中因为固定start值带来的重复短评也去掉。

6.模拟登陆

爬虫调试成功后，我开始了野心勃勃的6w短评爬取征途，将start结束值设置成了64000，准备代码跑起来，爬虫爬起来，悠哉地喝杯茶，吃顿饭，再回来看结果。但是……

TypeError:expected string or buffer

经过很长一段时间的调试，大致知道了报错的原因，访客模式下豆瓣短评仅能访问前10页，如果想要访问后面的页面，必须要登录。

模拟登陆的方法有2种，一种是post信息登录，还有一种是cookie登录。

网上查询的几种方法查看post信息，firefox的插件httpfox不支持mac，调试代理工具fiddler也没有Mac版，对应的Mac版软件Charles也莫名其妙记录不到豆瓣post数据，chrome检查结果中看不到form data(仿佛看到了命运之神的嫌弃…)。

查看cookie cookie登录的方法就比较简单了，在chrome中登录豆瓣，右键网页选择“检查”，在弹出窗口中按照network→www.douban.com (没有就刷新一下网页)→headers查看cookie，将cookie:后的内容全部复制，添加到代码中headers的user-agent后面中。

self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                'cookie': '将复制的cookie信息贴在此处'}

7.短评爬虫1.0

已更新优化代码2.0，查看文字后可直接跳过代码1.0继续阅读，2.0源码见第9节
修改代码后，程序继续执行，但偶尔会发生卡顿，于是添加了超时响应代码，3秒没有响应就再次发送请求。最终代码如下：

# encoding:UTF-8
import urllib2
import re
import sys
import time

class MovieComment:
    def __init__(self):
        #设置默认编码格式为utf-8
        reload(sys)
        sys.setdefaultencoding('utf-8')
        self.start = 0
        self.param = '&limit=20&sort=new_score&status=P'
        self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                        'cookie': 'cookie信息'}
        self.commentList = []
        self.filePath = '/Users/xiaomi/Desktop/ssssbc.txt'

    def getPage(self):
        try:
            URL = 'https://movie.douban.com/subject/25823277/comments?start=' + str(self.start)
            request = urllib2.Request(url = URL, headers = self.headers)
            response = urllib2.urlopen(request, timeout = 3)
            page = response.read().decode('utf-8')
            pageNum = (self.start + 20)/20
            print '正在抓取第' + str(pageNum) + '页数据...' 
            self.start += 20
            return page
        except (urllib2.URLError,Exception), e:
            if hasattr(e, 'reason'):
                print '抓取失败，具体原因：', e.reason
                #超时响应
                response = urllib2.urlopen(request,timeout = 3)
                page = response.read().decode('utf-8')
                pageNum = (self.start + 20)/20
                print '正在抓取第' + str(pageNum) + '页数据...' 
                self.start += 20
                return page
    
    def getMovie(self):
        pattern = re.compile(u'<div.*?class="avatar">.*?'
                             + u'<a.*?title="(.*?)".*?href=".*?">.*?</a>.*?'
                             + u'<p.*?class="">(.*?)</p>',re.S)
        while self.start <= 100:
            page = self.getPage()
            comments = re.findall(pattern, page)
            for comment in comments:
                self.commentList.append([comment[0], comment[1].strip()])
    
    def writeTxt(self):
        fileComment = open(self.filePath, 'w')
        try:
            for comment in self.commentList:
                fileComment.write( comment[1] + '\r\n\r\n')
            print '文件写入成功...'
        finally:
            fileComment.close()
    
    def main(self):
        print '正在从《三生三世十里桃花》电影短评中抓取数据...'
        self.getMovie()
        self.writeTxt()
        print '抓取完毕...'

DouBanSpider = MovieComment()
DouBanSpider.main()

8.爬虫被封禁的一些问题

单个ip单个用户在短时间内请求网站数据过快，都会被豆瓣的反爬虫机制发现并判断为机器操作而封禁，解决的方法有几种。

1.使用ip代理池，隔一段时间随机换一个ip(我还没研究出来)

2.降低爬取速度，设置爬虫间隔时间(我也还没研究出来)

3.将爬取内容分块，分时间段爬取(爬一会，歇一会ㄒ_ㄒ我用得就是这个方法，所以才叫半自动爬虫啊ㄒ_ㄒ)

花了半天的时间，最终将《三生三世十里桃花》电影版的6w多条短评都爬了下来(感谢指点我的程序员小哥哥们！感谢互联网！感谢郭嘉！）提取了高频词汇，大家对电影的高度评价果然不出我的所料，我不禁流下了激动的眼泪…

《三生三世十里桃花》6w豆瓣短评关键词

2017.08.17 没想到一篇偷懒卖二的文章居然上了首页(ㄒ_ㄒ其他文写死了N多脑细胞也没人看)，索性再贴一个昨天刚做的词云，纪录片《二十二》豆瓣1w条短评关键词，推荐正在读这篇文的你，去电影院看看这些老人们。

《二十二》1w豆瓣短评关键词

这世界真好，谢谢你们:-D

2017.08.19

9.代理设置

经过2天的努力，我会用代理啦！
为什么要使用代理的原因请参考第8节。
依旧参考这位大神的代理设置例子：Python爬虫技巧---设置代理IP
使用代理的方法是先从开放代理网站批量爬取代理IP存放在指定列表中（get_ip_list），再从指定列表中随机获取单个IP（get_random_ip）作为代理IP。

def get_ip_list():
    url = 'http://www.xicidaili.com/nn/'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    web_data = requests.get(url, headers=headers)
    soup = BeautifulSoup(web_data.text, 'lxml')
    ips = soup.find_all('tr')
    ip_list = []
    for i in range(1, len(ips)):
        ip_info = ips[i]
        tds = ip_info.find_all('td')
        ip_list.append(tds[1].text + ':' + tds[2].text)
    return ip_list

def get_random_ip():
    ip_list = get_ip_list()
    proxy_list = []
    for ip in ip_list:
        proxy_list.append(ip)
    proxy_ip = random.choice(proxy_list)
    return proxy_ip

使用代理ip的标头登录网站，后面的就和第3章开始衔接上了，不再多说。

url = '想登录网站的网址'
proxy = urllib2.ProxyHandler({"http": self.proxies})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url = URL, headers = self.headers)
response = urllib2.urlopen(request, timeout = 5)

设置好代理后，使用代理的思路是使用1个IP爬取豆瓣20页信息后，更换一次代理IP。

if self.start % 400==0:
    self.proxies = get_random_ip()

整体代码如下：

# encoding:UTF-8
import urllib2
from bs4 import BeautifulSoup
import requests
import re
import random
import sys
import time

def get_ip_list():
    url = 'http://www.xicidaili.com/nn/'
    headers = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
    }
    web_data = requests.get(url, headers=headers)
    soup = BeautifulSoup(web_data.text, 'lxml')
    ips = soup.find_all('tr')
    ip_list = []
    for i in range(1, len(ips)):
        ip_info = ips[i]
        tds = ip_info.find_all('td')
        ip_list.append(tds[1].text + ':' + tds[2].text)
    return ip_list

def get_random_ip():
    ip_list = get_ip_list()
    proxy_list = []
    for ip in ip_list:
        proxy_list.append(ip)
    proxy_ip = random.choice(proxy_list)
    return proxy_ip

class MovieComment:
    def __init__(self):
        #设置默认编码格式为utf-8
        reload(sys)
        sys.setdefaultencoding('utf-8')
        self.start = 0 #爬虫起始位置
        self.param = '&limit=20&sort=new_score&status=P'
        #User-Agent是用户代理，用于使服务器识别用户所使用的操作系统及版本、浏览器类型等，可以认为是爬虫程序的伪装。
        self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                        'cookie': 'cookie信息'}
        self.commentList = []
        self.filePath = '/Users/xiaomi/Desktop/eebc.txt'
        self.proxies = get_random_ip() #定义代理IP

    def getPage(self):
        try:
            URL = 'https://movie.douban.com/subject/26430107/comments?start=' + str(self.start)
            proxy = urllib2.ProxyHandler({"http": self.proxies})
            opener = urllib2.build_opener(proxy)
            urllib2.install_opener(opener)
            request = urllib2.Request(url = URL, headers = self.headers)
            response = urllib2.urlopen(request, timeout = 5)
            page = response.read().decode('utf-8')
            pageNum = (self.start + 20)/20
            print '正在抓取第' + str(pageNum) + '页数据...' 
            self.start += 20
            if self.start % 100==0:
                self.proxies = get_random_ip()
            return page
        except (urllib2.URLError,Exception), e:
            if hasattr(e, 'reason'):
                print '抓取失败，具体原因：', e.reason
                #超时响应
                response = urllib2.urlopen(request,timeout = 5)
                page = response.read().decode('utf-8')
                pageNum = (self.start + 20)/20
                print '正在抓取第' + str(pageNum) + '页数据...' 
                self.start += 20
                if self.start % 400==0: #设置获取IP间隔页数
                    self.proxies = get_random_ip()
                return page
    
    def getMovie(self):
        pattern = re.compile(u'<div.*?class="avatar">.*?'
                             + u'<a.*?title="(.*?)".*?href=".*?">.*?</a>.*?'
                             + u'<p.*?class="">(.*?)</p>',re.S) #正则表达式
        while self.start <= 20000:  #爬虫结束位置
            page = self.getPage()
            time.sleep(2)
            comments = re.findall(pattern, page)
            for comment in comments:
                self.commentList.append([comment[0], comment[1].strip()]) #将捕获组数据写入评论List中
    
    def writeTxt(self):
        fileComment = open(self.filePath, 'w')
        try:
            for comment in self.commentList:
                fileComment.write( comment[1] + '\r\n\r\n')
            print '文件写入成功...'
        finally:
            fileComment.close()
    
    def main(self):
        print '正在从《二十二》电影短评中抓取数据...'
        self.getMovie()
        self.writeTxt()
        print '抓取完毕...'

DouBanSpider = MovieComment()
DouBanSpider.main()

我已加入“维权骑士”（https://rightknights.com/material/author?id=45845）的版权保护计划