爬虫数据分析案例-评论

2021-08-10 本文已影响0人皮皮大

微博吃瓜🍉

前段时间微博上吴某和都某的时间闹得沸沸扬扬，着实让大家吃了不少瓜。Peter从网上获取到了一些用户的评论数据作为数据分析，看看微博用户都是怎么看待这件事情的。至于事情后面怎么发展，等待法律的公平公正与公开，本文仅作为数据呈现和分析使用。

网页规律

本文中的数据是如何获取到的？

微博评论的数据ajax动态加载的，也就是在地址栏中的URL不变的情况返回不同的数据，但是实际发送请求的URL地址肯定是变化的，在谷歌浏览器中加载了4次，生成了不同的URL地址：

main_url = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4660583661568436&is_show_bulletin=2&is_mix=0&count=20&uid=3591355593"

url2 = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4660583661568436&is_show_bulletin=2&is_mix=0&max_id=27722026381139524&count=20&uid=3591355593"

url3 = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4660583661568436&is_show_bulletin=2&is_mix=0&max_id=11156509319242784&count=20&uid=3591355593"

url4 = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4660583661568436&is_show_bulletin=2&is_mix=0&max_id=5162109363544403&count=20&uid=3591355593"

main_url是主评论的url地址，其他的URL地址是明显不同的；url2、url3、url4的差别仅在于max_id的不同。几经周折，终于找到了关键：原来main_url地址下返回的数据中有下页（第二页）max_id的信息：

image

同样的操作，第二页返回的max_id也是对应到第三页的URL地址中max_id的值。

⚠️总结：通过前一页返回的数据中max_id的值作为下页url地址中max_id的值。

爬取字段

1.用户id
2.评论时间comment_time
3.注册微博时间register_time
4.用户城市city
5.用户性别gender
6.评论内容comment
7.评论点赞数comment_like
8.评论回复数comment_reply

给主页main_url发送请求获取到数据，找到我们需要爬取的字段信息（返回数据转成json文件的样子）：

image

再看看一个用户的数据信息：

image

本文中爬取的字段数据：

1、用户id

image

2、用户评论时间：comment_time

image

3、用户微博注册时间：register_time

image

4、评论内容：comment

image

5、评论点赞数：comment_like

image

6、评论回复人数：comment_reply

image

7、用户性别：gender

image

8、用户城市：city

image-20210725094815466

前期工作

导入相关库

import pandas as pd
import numpy as np
from snownlp import SnowNLP
import time
import datetime as dt


# 绘图相关
import jieba
import matplotlib.pyplot as plt
from pyecharts.globals import CurrentConfig, OnlineHostType
from pyecharts import options as opts  # 配置项
from pyecharts.charts import Bar, Pie, Line, HeatMap, Funnel, WordCloud, Grid, Page  # 各个图形的类
from pyecharts.commons.utils import JsCode
from pyecharts.globals import ThemeType,SymbolType

import plotly.express as px
import plotly.graph_objects as go

数据导入

通过pandas库将数据读取进来，我们查看前5条数据：

image

数据探索

数据探索部分包含：

数据大小
数据是否缺失
数据的字段类型，比如两个时间的字段是字符型，这是后续我们要处理的点

image

数据预处理

对爬取到的数据进行预处理：

评论时间comment_time和注册时间register_time我们改成熟悉的形式
评论comment中有表情符号[]，比如[doge]等，我们取出表情符前面的部分
性别从f、m转成女、男

时间预处理

对时间的处理，使用的是datetime库，开头已经导入了并缩写成dt。爬取到的数据使用的是格林威治标准时间，做如下转化：

Mon Jul 19 08:06:52 +0800 2021
Thu Nov 30 07:47:02 +0800 2017

transfer_std = "%a %b %d %H:%M:%S %z %Y"

df["comment_time"] = df["comment_time"].apply(lambda x: dt.datetime.strptime(x, transfer_std))
df.head()

image

评论处理

主要是将表情符处理掉：

image

性别处理

将数据中的f变成女，m变成男，更加直观容易理解

image

用户画像

用户画像主要从不同的维度来分析用户在评论中的情况，包含：性别、城市、微博年龄、评论点赞数和回复数等

性别

根据性格对用户进行分组统计：

image

虽然主评论只有1000+，但是从结果中可以看到：吴某某的粉丝还是以女性为主，远高于男性

城市

主要是想知道哪些城市对吴某某的关注度较高。为了方便，我们统一取用户的省份信息：

image

fig = px.bar(city[::-1],
             x="userid",
             y="city",
             text="userid",
             color="userid",
             orientation="h"
            )

fig.update_traces(textposition="outside")
fig.update_layout(height=800,width=1000)
fig.show()

image

从条形图中可以看到：

很多用户没有个人的省份（城市）信息
从填写的用户从观察到：江苏、浙江、北京、广东等发达省份对吴某某的关注度更高
海外也有不少的用户在关注

用户微博年龄

表示的是从用户注册到评论该条微博的时间间隔

生成评论时间和注册时间的时间间隔
将时间间隔取出天数days
将days转成年，不足一年则省去

image

px.scatter(df,x="comment_time",y="days",color="year",size="days")

image

用户年龄小结：

吴某某的微博是7.19发的，我们发现19号当天的评论的用户还是占多数
用户的微博年龄最多高达11年！！！也有不足一年，也就是今年新注册的用户

点赞数

主要是想查看哪些微博评论的点赞数靠前

image

fig = px.bar(dianzan[::-1],
             x="comment_like",
             y="comment",
             text="comment_like",
             orientation="h"
            )

fig.show()

image

Peter当时爬取的数据是这条评论点赞数最多：滚！！！

多么的简单粗暴！

回复数

从结果中我们看到，还是这条评论：滚！

image

点赞数和回复数分布

px.scatter(df,
           x="comment_like",
           y="comment_reply",
           size="days",
           facet_col="year",
           facet_col_wrap=3, # 每行最多3个图形
           color="year")

image

从不同年龄用户的点赞数和回复数中观察到：

用户的年龄在5-10年居多；新用户评论比较少
用户的年龄在5年或者6年的评论点赞数或者回复数集中度较高，用户倾向较一致；其他年龄段的用户相对分散

评论词云图

使用jieba分词来绘制用户评论的词云图：

df1 = df[df["comment"] != ""]  # 筛选出存在评论的数据
comment_list = df["comment"].tolist()

# 分词过程
comment_jieba_list = []
for i in range(len(comment_list)):
    # jieba分词
    seg_list = jieba.cut(str(comment_list[i]).strip(), cut_all=False)
    for each in list(seg_list):
        comment_jieba_list.append(each)
        
# 创建停用词list
def StopWords(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 传入停用词表的路径
stopwords = StopWords("/Users/peter/Desktop/WeChat/文章/spider/nlp_stopwords.txt")
# 词频统计
comment_result = pd.value_counts(stopword_list).reset_index().rename(columns={"index":"word",0:"number"})
comment_result

image

绘制全部评论词云图：

rec_words = [tuple(z) for z in zip(comment_result["word"].tolist(), comment_result["number"].tolist())]


# 5、WordCloud模块绘图
c = (
    WordCloud()
    .add("", rec_words, word_size_range=[20, 100], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="微博评论词云图"))
)

c.render_notebook()

image

我们截图前50个高频词云进行绘图

rec_words = [tuple(z) for z in zip(comment_result["word"].tolist(), comment_result["number"].tolist())]


# 5、WordCloud模块绘图
c = (
    WordCloud()
    .add("", rec_words[:50], word_size_range=[20, 100], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="微博评论词云图"))
)

c.render_notebook()

image

从全部词云和Top50词云图中观察到：

滚：还是那么的引人注目😭
很多的用户在评论中使用了吴某某的名字，或者昵称：凡凡、凡哥，甚至是哥哥
也有很多用户在给吴某某加油打气：加油、支持、喜欢等
很多法律相关的词语：法律、监狱、坐牢、违法、司法公正、真相等，说明很多粉丝还是比较理智的看待这件事情

再次郑重声明：本文仅做数据学习和分析展示，事情的后续结果如何，我们相信法律会有一个公平公正公开的结论🍉

爬虫数据分析案例-评论

微博吃瓜🍉

网页规律

爬取字段

前期工作

导入相关库

数据导入

数据探索

数据预处理

时间预处理

评论处理

性别处理

用户画像

性别

城市

用户微博年龄

点赞数

回复数

点赞数和回复数分布

评论词云图

猜你喜欢

热点阅读