爬虫实战1：如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件

2020-04-25 本文已影响0人有趣的数据

事件已经发生好多个小时，各种版本的谣言都在上演，吃瓜群众又是怎样看待这次事件的呢，下面将从数据的角度来分析此次事件。

数据获取网站：

https://m.weibo.cn

代码展示：

# 主函数展示
def get_comment1(list_id):    with open('weibo_comment_zhouyangqing1.csv', 'a', newline='') as csvfile:        writer = csv.writer(csvfile, dialect='excel')        writer.writerow(['微博id', '用户id', '用户名称','性别','身份认证','描述','评论','回复','点赞','认证'])    for id in list_id:        max_id = ""        Data_all = []        while True:            if max_id == "":                p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id_type=0"            else:                p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id="+ str(max_id) + "&max_id_type=1"            print(p_url)            dic_data = get_text_one(p_url)            try:                if dic_data != None:                    max_id = dic_data["data"]["max_id"]                    # 转发数                    datalist = dic_data['data']['data']                    for d in datalist:                        userid = d['user']['id']                        print(userid)                        username = d['user']['screen_name']                        gender=d['user']['gender']                        # 身份认证                        verified = d['user']['verified']                        # 评论内容                        comment=d['text']                        total_number=d['total_number']                        like_count=d['like_count']                        try:                            verified_reason = d['user']['verified_reason']                            description=d['user']['description']                        except:                            verified_reason = 0                            description=0                        Data = [id, userid, username,gender, verified,description,comment,total_number,like_count,verified_reason]                        Data_all.append(Data)                    csv_w('weibo_comment_zhouyangqing1.csv', Data_all)                else:                    break            except:                pass

数据展示：

吃瓜群众的性别占比为男vs女 34%vs66%

热词分布：

评论词频分布：

整体评论分布

男主评论分布

女主评论分布

import pandas as pdimport jiebaimport timeimport csvimport refrom wordcloud import WordCloudfrom PIL import Imageimport matplotlib.pyplot as pltimport numpy as np
data=pd.read_csv('./text3.csv',encoding='gb18030')items=data['评论'].astype(str).tolist()print(len(data))
# 创建停用词listdef stopwordslist():    stopwords = [line.strip() for line in open('./stop_word.txt', 'r', encoding='utf-8').readlines()]    return stopwords
# 去除英文def remove_sub(input_str):    punc=u'123456789.'    punc1 = u'123456789.a-zA-Z'    output_str = re.sub(r'[{}]+'.format(punc1), '', input_str)    return output_str
alice_mask = np.array(Image.open('./b2.png'))
cloud = WordCloud(       #设置字体，不指定就会出现乱码       font_path="./ziti.ttf",       #font_path=path.join(d,'simsun.ttc'),       #设置背景色       background_color='white',       #词云形状       mask=alice_mask,       #允许最大词汇       max_words=200,       #最大号字体       max_font_size=200,       random_state=1,       width=400,       height=800)
q=[]outstr = ''for item in items:    d=[]    b=jieba.cut(item,cut_all=False)    # 创建一个停用词表    stopwords=stopwordslist()    for j in b:        if j not in stopwords:            if  not  remove_sub(j):                continue            if j !='\t':                outstr+=j                outstr+=" "print(len(outstr))with open('./text.txt','a')as f:    f.write(outstr)    f.write('\n')


cloud.generate(outstr)cloud.to_file('./pic6.png')

还有一部分评论时间没有爬取，可以看吃瓜群众评论时间的分布，是不是也是那么精力充沛。有兴趣的可以弄一下。

爬虫实战1：如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件

猜你喜欢

热点阅读