二. 挖掘和分析 —《无问西东》的豆瓣评论数据分析

2018-04-03 本文已影响39人何大炮

基于上一篇由爬虫在豆瓣上得到的数据，我将进行简单的数据分析。
考虑到后期数据处理都是用xlsx，所以这对前面的存储数据的方式做一个修改，如下：

 def parse(self, response):

        self.log('start')
        divs = response.xpath('//div[@class="comment"]')
        self.log('open file')
        if self.pages == 0:
            global workbook
            global worksheet
            global style
            workbook = xlwt.Workbook(encoding='ascii')
            worksheet = workbook.add_sheet('My Worksheet')
            style = xlwt.XFStyle()  # 初始化样式
            font = xlwt.Font()  # 为样式创建字体
            font.name = 'Times New Roman'
            font.bold = True  # 黑体
            # font.underline = True # 下划线
            # font.italic = True # 斜体字
            style.font = font  # 设定样式
            worksheet.write(0, 0, "ID", style)
            worksheet.write(0, 1, "Time", style)
            worksheet.write(0, 2, "Comment", style)

        for i in range(0, len(divs)):
            comment_votes = divs[i].xpath('./h3/span[@class="comment-vote"]/span[@class="votes"]/text()')[0].extract()
            if int(comment_votes) > 10:
                comment = divs[i].xpath('./p/text()').extract_first().strip()
                time = divs[i].xpath(
                    './h3/span[@class="comment-info"]/span[@class="comment-time "]/text()').extract_first().strip()
                self.bh +=1
                worksheet.write(self.bh, 0, self.bh, style)
                worksheet.write(self.bh, 1, time, style)
                worksheet.write(self.bh, 2, comment, style)

        self.log('close file')
        # next page(https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links)
        self.pages += 1
        url = response.css('a.next::attr(href)').extract_first()

        if url:
            url = "https://movie.douban.com/subject/6874741/comments" + url
            return scrapy.Request(url=url, headers=self.headers, cookies=self.cookie, callback=self.parse)

        else:
            workbook.save('comments.xls')
            self.log('finished, totally: ' + str(self.pages) + 'pages')

这里用到的包—— xlwt，专门的excle 写入包。
workbook 是一个很大的容器，可以容纳很多的worksheet，worksheet则装着我们写入的东西。

写入的时候，用到一个函数write(y, x, content, style)：
y:写入到哪一排（row）
x: 写入到那一列（column）
content：写入内容
style：写入内容的字体风格

Start from here

看到有些评论在电影上映前就开始夸夸其词，我很好奇这样的人有多少，又占有多少比例呢？
于是，我开始了数据分析：

Tools

Spider（python3的科学计算专用编译器，比pyCharm多很多科学计算包）
pandas 包用于读入和处理数据
matplotlib 包用于画图

开始处理：具体过程都在注释里（下同）

address = "/Users/LiweiHE/acquisition/comments.xls"

# a sheet is what we need ,
# so sheetname=0 rather than sheetname=[0]
comment = pd.read_excel(address, sheetname=0, index_col=None, na_values=["NA"])

# formalises 'Time' attribute
comment['Time'] = pd.to_datetime(comment['Time'])

# sort the records by Time 
comment = comment.sort_values(by='Time')

# copy 2 new comment.
old = comment[comment.Time < "2018-01-12"]
new = comment[comment.Time >= "2018-01-12"]

开始画图：

# Chinese display
font = r'/Users/LiweiHE/PycharmProjects/simfang.ttf'
myfont = matplotlib.font_manager.FontProperties(fname=font)

# create a window called figure1 for the painting 
plt.figure(1)
# left: position of bars, height: the height of bars, width: the width of bar, 
# yerr: to aviod reaching celling of picture
rects =plt.bar(left = (0.2,0.6),height = (len(old),len(new)),color=('r','g'),width = 0.2,align="center",yerr=0.000001)
# add the text above the bar
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        # (x,y, value, size)
        plt.text(rect.get_x()+0.03,1.05*height, '%s' % int(height),size = 15)
        
autolabel(rects)
# place the name of bars 
plt.xticks((0.2,0.6),("上映前就发表的评论","上映后才发表的评论"),fontproperties=myfont)
# x-axis range 
plt.xlim(0,1)
# y-axis range
plt.ylim(0,350)

plt.ylabel("人数", fontproperties=myfont)
plt.title('无问东西评论的分类', fontproperties=myfont)
plt.savefig("无问东西评论的分类_bar.png")

无问东西评论的分类_bar

哎呀，67个人都是还没上映就发表了言论，而且点赞数还不少（收集数据时的条件），那他们又占了多少呢？

画图：

# pie
plt.figure(2)
# label of every part
labels = ["Before", "After"]
# weight of every part, 自动百分比化
sizes = [len(old),len(new)]
# colors of every part
colors = ['red','green']
# 割裂pie的痕迹宽度
explode = (0.05,0)
# labeldistance: 标签与图片的距离 倍径， autopct：accuracy, pctdistance:百分比的text离圆心的距离
patches,l_text,p_text = plt.pie(sizes, colors=colors, explode=explode, 
                                labels=labels, labeldistance = 1.1, 
                                startangle=90, pctdistance = 0.6,
                                autopct = '%3.1f%%',shadow = False)
# x=y
plt.axis('equal')
# https://blog.csdn.net/helunqu2017/article/details/78641290
# 右上角标签
plt.legend(prop=myfont)
plt.title('无问东西评论的分类', fontproperties=myfont)
plt.savefig("无问东西评论的分类_pie.png")

无问东西评论的分类_pie

定睛一看，没有上映就发表评论的人接近20%，这会不会有些误导观众呢？
所以针对不同时期的评论，采用词云分析。

# 结巴分词不支持excel哦
old.to_csv("old_comments.csv")
new.to_csv("new_comments.csv")

还没上映前的词云

上映后的词云
对比两个不同时期的词云，我们可以发现，上映前的评论多是在讨论演员和导演；而上映后发表的评论多在讨论电影的故事和真实性。所以我们可以得出一个结论：
上映后发表的评论更加注重电影内容而不是电影之外的东西；即，比上映前的评论更具有可靠性

那我们再来看看发表评论最多的date是哪天：

分析

# line chart
plt.figure(3)
# grab the Time attribute of all records
record = comment['Time']

# delete all the duplicates
record = record.drop_duplicates()

# calculate the number of comments coming out in a same day
time_array = []
for time in record:
    count =len(comment[comment.Time==time])
    time_array.append(count)

画图

无问东西评论的发布时间_line-chart

看来，13号到15号是发布评论的高峰时段呢，是不是可以推测一下《无问西东》是13号上映的。。。。哈哈哈哈。

评论的长度与打分

其实还是挺好奇评论的长度和打得分数是不是有联系，所以我这里用了基于k-means的聚类分析。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Apr 15 17:34:12 2018

@author: LiweiHE
"""
import pandas as pd
import matplotlib.pyplot as plt 
  

   
from sklearn.cluster import KMeans   

address = "/Users/LiweiHE/acquisition/comments.xls"
comment = pd.read_excel(address, sheetname=0, index_col=None)

# delete all the record Grades valued 0
comment = comment[comment['Grades']>0]

# grab the Grades and length of comment attributes of all records
record = comment.loc[:,['Grades','length of Comment']]

clf = KMeans(n_clusters=3)    
y_pred = clf.fit_predict(record)    

x = comment['Grades']
y = comment['length of Comment']


plt.scatter(x, y, c=y_pred, marker='x')     
plt.title("Kmeans-comments")     
plt.xlabel("Grades")    
plt.ylabel("length of Comments")    
plt.legend(["Rank"])     
plt.savefig("Kmeans-comments.png")

Kmeans-comments.png

根据每个聚类的提示，不同的评论长度均匀地分布在每个分数区间上，也就是说，评论长度和打得分数没有联系！