二. 挖掘和分析 —《无问西东》的豆瓣评论数据分析
2018-04-03 本文已影响39人
何大炮
基于上一篇由爬虫在豆瓣上得到的数据,我将进行简单的数据分析。
考虑到后期数据处理都是用xlsx,所以这对前面的存储数据的方式做一个修改,如下:
def parse(self, response):
self.log('start')
divs = response.xpath('//div[@class="comment"]')
self.log('open file')
if self.pages == 0:
global workbook
global worksheet
global style
workbook = xlwt.Workbook(encoding='ascii')
worksheet = workbook.add_sheet('My Worksheet')
style = xlwt.XFStyle() # 初始化样式
font = xlwt.Font() # 为样式创建字体
font.name = 'Times New Roman'
font.bold = True # 黑体
# font.underline = True # 下划线
# font.italic = True # 斜体字
style.font = font # 设定样式
worksheet.write(0, 0, "ID", style)
worksheet.write(0, 1, "Time", style)
worksheet.write(0, 2, "Comment", style)
for i in range(0, len(divs)):
comment_votes = divs[i].xpath('./h3/span[@class="comment-vote"]/span[@class="votes"]/text()')[0].extract()
if int(comment_votes) > 10:
comment = divs[i].xpath('./p/text()').extract_first().strip()
time = divs[i].xpath(
'./h3/span[@class="comment-info"]/span[@class="comment-time "]/text()').extract_first().strip()
self.bh +=1
worksheet.write(self.bh, 0, self.bh, style)
worksheet.write(self.bh, 1, time, style)
worksheet.write(self.bh, 2, comment, style)
self.log('close file')
# next page(https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links)
self.pages += 1
url = response.css('a.next::attr(href)').extract_first()
if url:
url = "https://movie.douban.com/subject/6874741/comments" + url
return scrapy.Request(url=url, headers=self.headers, cookies=self.cookie, callback=self.parse)
else:
workbook.save('comments.xls')
self.log('finished, totally: ' + str(self.pages) + 'pages')
这里 用到的包—— xlwt,专门的excle 写入包。
workbook 是一个很大的容器,可以容纳很多的worksheet,worksheet则装着我们写入的东西。
写入的时候,用到一个函数write(y, x, content, style):
y:写入到哪一排(row)
x: 写入到那一列(column)
content:写入内容
style:写入内容的字体风格
Start from here
看到有些评论在电影上映前就开始夸夸其词,我很好奇这样的人有多少,又占有多少比例呢?
于是,我开始了数据分析:
Tools
Spider(python3的科学计算专用编译器,比pyCharm多很多科学计算包)
pandas 包 用于读入和处理数据
matplotlib 包 用于画图
开始处理:具体过程都在注释里(下同)
address = "/Users/LiweiHE/acquisition/comments.xls"
# a sheet is what we need ,
# so sheetname=0 rather than sheetname=[0]
comment = pd.read_excel(address, sheetname=0, index_col=None, na_values=["NA"])
# formalises 'Time' attribute
comment['Time'] = pd.to_datetime(comment['Time'])
# sort the records by Time
comment = comment.sort_values(by='Time')
# copy 2 new comment.
old = comment[comment.Time < "2018-01-12"]
new = comment[comment.Time >= "2018-01-12"]
开始画图:
# Chinese display
font = r'/Users/LiweiHE/PycharmProjects/simfang.ttf'
myfont = matplotlib.font_manager.FontProperties(fname=font)
# create a window called figure1 for the painting
plt.figure(1)
# left: position of bars, height: the height of bars, width: the width of bar,
# yerr: to aviod reaching celling of picture
rects =plt.bar(left = (0.2,0.6),height = (len(old),len(new)),color=('r','g'),width = 0.2,align="center",yerr=0.000001)
# add the text above the bar
def autolabel(rects):
for rect in rects:
height = rect.get_height()
# (x,y, value, size)
plt.text(rect.get_x()+0.03,1.05*height, '%s' % int(height),size = 15)
autolabel(rects)
# place the name of bars
plt.xticks((0.2,0.6),("上映前就发表的评论","上映后才发表的评论"),fontproperties=myfont)
# x-axis range
plt.xlim(0,1)
# y-axis range
plt.ylim(0,350)
plt.ylabel("人数", fontproperties=myfont)
plt.title('无问东西评论的分类', fontproperties=myfont)
plt.savefig("无问东西评论的分类_bar.png")
无问东西评论的分类_bar
哎呀,67个人都是还没上映就发表了言论,而且点赞数还不少(收集数据时的条件),那他们又占了多少呢?
画图:
# pie
plt.figure(2)
# label of every part
labels = ["Before", "After"]
# weight of every part, 自动百分比化
sizes = [len(old),len(new)]
# colors of every part
colors = ['red','green']
# 割裂pie的痕迹宽度
explode = (0.05,0)
# labeldistance: 标签与图片的距离 倍径, autopct:accuracy, pctdistance:百分比的text离圆心的距离
patches,l_text,p_text = plt.pie(sizes, colors=colors, explode=explode,
labels=labels, labeldistance = 1.1,
startangle=90, pctdistance = 0.6,
autopct = '%3.1f%%',shadow = False)
# x=y
plt.axis('equal')
# https://blog.csdn.net/helunqu2017/article/details/78641290
# 右上角标签
plt.legend(prop=myfont)
plt.title('无问东西评论的分类', fontproperties=myfont)
plt.savefig("无问东西评论的分类_pie.png")
无问东西评论的分类_pie
定睛一看,没有上映就发表评论的人接近20%,这会不会有些误导观众呢?
所以针对不同时期的评论,采用词云分析。
# 结巴分词不支持excel哦
old.to_csv("old_comments.csv")
new.to_csv("new_comments.csv")
还没上映前的词云
上映后的词云
对比两个不同时期的词云,我们可以发现,上映前的评论多是在讨论演员和导演;而上映后发表的评论多在讨论电影的故事和真实性。所以我们可以得出一个结论:
上映后发表的评论更加注重电影内容而不是电影之外的东西;即,比上映前的评论更具有可靠性
那我们再来看看发表评论最多的date是哪天:
分析
# line chart
plt.figure(3)
# grab the Time attribute of all records
record = comment['Time']
# delete all the duplicates
record = record.drop_duplicates()
# calculate the number of comments coming out in a same day
time_array = []
for time in record:
count =len(comment[comment.Time==time])
time_array.append(count)
画图
无问东西评论的发布时间_line-chart看来,13号到15号是发布评论的高峰时段呢,是不是可以推测一下《无问西东》是13号上映的。。。。哈哈哈哈。
评论的长度与打分
其实还是挺好奇评论的长度和打得分数是不是有联系,所以我这里用了基于k-means的聚类分析。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Apr 15 17:34:12 2018
@author: LiweiHE
"""
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
address = "/Users/LiweiHE/acquisition/comments.xls"
comment = pd.read_excel(address, sheetname=0, index_col=None)
# delete all the record Grades valued 0
comment = comment[comment['Grades']>0]
# grab the Grades and length of comment attributes of all records
record = comment.loc[:,['Grades','length of Comment']]
clf = KMeans(n_clusters=3)
y_pred = clf.fit_predict(record)
x = comment['Grades']
y = comment['length of Comment']
plt.scatter(x, y, c=y_pred, marker='x')
plt.title("Kmeans-comments")
plt.xlabel("Grades")
plt.ylabel("length of Comments")
plt.legend(["Rank"])
plt.savefig("Kmeans-comments.png")
Kmeans-comments.png
根据每个聚类的提示,不同的评论长度均匀地分布在每个分数区间上,也就是说,评论长度和打得分数没有联系!