爬虫实战1:如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件
2020-04-25 本文已影响0人
有趣的数据
事件已经发生好多个小时,各种版本的谣言都在上演,吃瓜群众又是怎样看待这次事件的呢,下面将从数据的角度来分析此次事件。
数据获取网站:
https://m.weibo.cn
代码展示:
# 主函数展示def get_comment1(list_id):with open('weibo_comment_zhouyangqing1.csv', 'a', newline='') as csvfile:writer = csv.writer(csvfile, dialect='excel')writer.writerow(['微博id', '用户id', '用户名称','性别','身份认证','描述','评论','回复','点赞','认证'])for id in list_id:max_id = ""Data_all = []while True:if max_id == "":p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id_type=0"else:p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id="+ str(max_id) + "&max_id_type=1"print(p_url)dic_data = get_text_one(p_url)try:if dic_data != None:max_id = dic_data["data"]["max_id"]# 转发数datalist = dic_data['data']['data']for d in datalist:userid = d['user']['id']print(userid)username = d['user']['screen_name']gender=d['user']['gender']# 身份认证verified = d['user']['verified']# 评论内容comment=d['text']total_number=d['total_number']like_count=d['like_count']try:verified_reason = d['user']['verified_reason']description=d['user']['description']except:verified_reason = 0description=0Data = [id, userid, username,gender, verified,description,comment,total_number,like_count,verified_reason]Data_all.append(Data)csv_w('weibo_comment_zhouyangqing1.csv', Data_all)else:breakexcept:pass
数据展示:
吃瓜群众的性别占比为 男vs女 34%vs66%
热词分布:
评论词频分布:
整体评论分布
男主评论分布
女主评论分布
import pandas as pdimport jiebaimport timeimport csvimport refrom wordcloud import WordCloudfrom PIL import Imageimport matplotlib.pyplot as pltimport numpy as npdata=pd.read_csv('./text3.csv',encoding='gb18030')items=data['评论'].astype(str).tolist()print(len(data))# 创建停用词listdef stopwordslist():stopwords = [line.strip() for line in open('./stop_word.txt', 'r', encoding='utf-8').readlines()]return stopwords# 去除英文def remove_sub(input_str):punc=u'123456789.'punc1 = u'123456789.a-zA-Z'output_str = re.sub(r'[{}]+'.format(punc1), '', input_str)return output_stralice_mask = np.array(Image.open('./b2.png'))cloud = WordCloud(#设置字体,不指定就会出现乱码font_path="./ziti.ttf",#font_path=path.join(d,'simsun.ttc'),#设置背景色background_color='white',#词云形状mask=alice_mask,#允许最大词汇max_words=200,#最大号字体max_font_size=200,random_state=1,width=400,height=800)q=[]outstr = ''for item in items:d=[]b=jieba.cut(item,cut_all=False)# 创建一个停用词表stopwords=stopwordslist()for j in b:if j not in stopwords:if not remove_sub(j):continueif j !='\t':outstr+=joutstr+=" "print(len(outstr))with open('./text.txt','a')as f:f.write(outstr)f.write('\n')cloud.generate(outstr)cloud.to_file('./pic6.png')
还有一部分评论时间没有爬取,可以看吃瓜群众评论时间的分布,是不是也是那么精力充沛。有兴趣的可以弄一下。