作业分析(一)——抄作业

2018-12-20 本文已影响15人 Lykit01

我把@群主和@小佳的代码都敲了一下，学习了一下爬虫和pandas的简单的使用，也碰到了很多问题，这里总结一下。

一、爬虫代码如下：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import pymysql
import re
import datetime

headers={'user-agent':'Mozilla/5.0 (windows NT 10.0;Win64;x64) AppleWebKit/537.36 (KHTML,like Gecko Chrome/70.0.3538.102 Safari/537.36'}

def get_page_index(number):
    url = 'https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=%s'%number
    try:
        responses=requests.get(url,headers=headers)
        if responses.status_code==200:
            return responses.text
        return None
    except requests.exceptions.RequestException:
        print('Request ERR!')
        return None

def parse_index_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    note_list = soup.find_all('ul', class_='note-list')[0]
    content_li = note_list.find_all('li')
    dir={}
    for link in content_li:
        url = link.find_all('a', class_='title')[0]
        title=url.contents[0]
        link='http://www.jianshu.com'+url.get('href')
        #title may be duplicate,but link is unique
        dir[link]=title
    return dir

def get_page_detail(url):
    try:
        responses=requests.get(url,headers=headers)
        if responses.status_code==200:
            return responses.text
        return None
    except requests.exceptions.RequestException:
        print('DETAIL Requests ERR!')
        return None

def parse_detail_page(title,html):
    soup=BeautifulSoup(html,'html.parser')
    name=soup.find_all('div',class_='info')[0].find_all('span',class_='name')[0].find_all('a')[0].contents[0]
    content_detail=soup.find_all('div',class_='info')[0].find_all('div',class_='meta')[0].find_all('span')

    content_detail=[info.contents[0] for info in content_detail]
    publish_time=content_detail[0].strip('*')#提前去除*号
    word=content_detail[1]
    word_num=int(word.split()[1])

    texts=soup.find_all('div',class_='show-content-free')
    #用正则表达式去除各种标签
    reg1 = re.compile("<[^>]*>")
    contents = reg1.sub('', str(texts))
    contents=''.join([line.strip('\n') for line in contents])

    return title,name,publish_time,word_num,contents

def save_to_mysql(title,name,publish_time,word_num,contents):
    cur=CONN.cursor()
    insert_cmd="INSERT INTO exercise(name,title,publish_time,word_num,contents)" "VALUES(%s,%s,%s,%s,%s)"
    val=(name,title,publish_time,word_num,contents)#把val直接传到insert_data会报错
    try:
        cur.execute(insert_cmd,val)
        CONN.commit()
    except:
        CONN.rollback()

CONN=pymysql.connect('localhost','root','123456','crazydatadb')
def main():
    for number in range(1,12):
        html=get_page_index(number)
        dir=parse_index_page(html)
        for link,title in dir.items():
            html=get_page_detail(link)
            title,name,publish_time,word_num,contents=parse_detail_page(title,html)
            #print(title, name, publish_time, word_num,contents)
            start_time=datetime.datetime.strptime('2018.12.10 00:00','%Y.%m.%d %H:%M')
            end_time=datetime.datetime.strptime('2018.12.17 00:00','%Y.%m.%d %H:%M')
            if start_time<datetime.datetime.strptime(publish_time,'%Y.%m.%d %H:%M')<end_time:
                save_to_mysql(title,name,publish_time,word_num,contents)

if __name__=='__main__':
    main()

我增加的几个细节：

1爬取正文内容，并用正则去除html标记符号：

    texts=soup.find_all('div',class_='show-content-free')
    reg1 = re.compile("<[^>]*>")
    contents = reg1.sub('', str(texts))
    contents=''.join([line.strip('\n') for line in contents])

2进行时间筛选,只选第一周的文章

这里用了datetime包，可以直接比较大小：

start_time=datetime.datetime.strptime('2018.12.10 00:00','%Y.%m.%d %H:%M' )end_time=datetime.datetime.strptime('2018.12.17 00:00','%Y.%m.%d %H:%M')
if start_time<datetime.datetime.strptime(publish_time,'%Y.%m.%d %H:%M')<end_time:
     save_to_mysql(title,name,publish_time,word_num,contents)

3细节

这里用'html.parser'解析出来的html文件就是没有评论数、喜欢数和阅读数的，虽然用检查可以看到他们的标签，这是因为这部分是由js动态控制，具体的爬取方法我还没找到，希望有人研究了可以告诉我！

二、数据分析代码

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import pymysql
from os import path
#数据分析与可视化
import pandas as pd
import matplotlib.pyplot as plt

#词云
from wordcloud import WordCloud
import jieba

from dateutil.parser import parse
#文件目录
DIR = path.dirname(__file__)

#导入数据
CONN=pymysql.connect('localhost','root','123456','crazydatadb')
select_cmd="SELECT * from exercise"
data=pd.read_sql(select_cmd,CONN)
data.info()
print(data.head())
print(data.name.nunique())

#数据清洗
for pt in data.publish_time:
    pt=parse(pt)#将数据变为2018-12-16 23:38:00格式

#日期
data.publish_time=pd.to_datetime(data.publish_time)
data['date']=data.publish_time.dt.day
data.groupby(['date'])['name'].count().plot(kind='bar')
plt.show()

data['hour']=data.publish_time.dt.hour
data.groupby(['hour'])['name'].count().plot(kind='bar')
plt.show()

#字数
data.word_num=data.word_num.astype('int')#改变数据类型
print(data.groupby(['name'])['word_num'].sum().describe())
print(data.groupby(['name'])['word_num'].sum().sort_values(ascending=False).head(5))

#篇数
print(data.groupby(['name'])['title'].count().sort_values(ascending=False).head(5))

#词云
#标题分析
titles=''
for title in data['title']:
    titles+=title
#全文分析
contents=''
for content in data['contents']:
    contents+=content

def word_segment(text):
    #读取标点符号库
    f=open(path.join(DIR,'crazydata_stopwords.txt'))
    stopwords={}.fromkeys(f.read().split('\n'))
    f.close()
    #加载用户自定义词典
    jieba.load_userdict(path.join(DIR,'crazydatadict.txt'))
    segs=jieba.cut(text)
    text_list=[]
    #文本清洗
    for seg in segs:
        if seg not in stopwords and seg!=' ' and len(seg)!=1:
            text_list.append(seg.replace(' ',''))
    cloud_text=','.join(text_list)
    #print(len(cloud_text))

    wc=WordCloud(background_color='White',max_words=200,font_path='./fonts/simhei.ttf',min_font_size=15, max_font_size=60, width=400)
    wc.generate(cloud_text)
    wc.to_file(text_list[0]+'crazy_word.png')

word_segment(titles)
word_segment(contents)

详细的讲解可以看@小佳的。因为统计时间做了限制,搜集到了64篇，这里放一下结果：

数据的信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 6 columns):
id              64 non-null int64
title           64 non-null object
name            64 non-null object
publish_time    64 non-null object
word_num        64 non-null int64
contents        64 non-null object
dtypes: int64(2), object(4)
memory usage: 3.1+ KB

文章数：64，作者：48
字数统计信息：

count      48.000000
mean      916.625000
std      1575.740159
min        64.000000
25%       299.750000
50%       421.500000
75%       647.750000
max      8617.000000

字数合计：

夜希辰          8617
Lykit01      5884
1点点De小任性丶    4305
262153       3559
凡人求索         2005

篇数合计：

夜希辰          5
凡人求索         3
Lykit01      3
1点点De小任性丶    3
肖月_1d28      3

可以看到字数差别巨大，标准差也很大。夜希辰同学是最勤奋的了，向他学习！
下面的图可能对比更直观。（中文字体显示有误，以后要研究一下怎么解决。）

字数

发表时间

图上只有6天，这并不是统计有误，而是10号那天没人投稿。可以看到大家的投稿日期集中在周末。

分布日期

发布时间集中在11点左右和22点—24点，结束了半天或者一天的学习，正是写文章总结的好时候。

分布时间

词云

在做词云时，加载了停用词字典，过滤掉了一些干扰词。这些常用词也被我过滤掉了：时间、事件、知识、内容、学习

 f=open(path.join(DIR,'crazydata_stopwords.txt'))

另外加载了自定义的一些词，不过这些词并没有体现在下面的结果中，后续还要增加一些词，现有的如下：凡人求索、小佳、简书、专栏、群主、疯狂数据分析

 jieba.load_userdict(path.join(DIR,'crazydatadict.txt'))

(注意加载自定义词典时，要注意将txt编码为utf-8格式，txt默认的格式是ANSI格式。)

文章标题词云

从标题上来看，重点集中在计划上，非常符合第一周的主题。同时python、SQL、统计、实战这四个主题都显示出来了。关于统计的词比较多。关于初学者的“小白”、“入门”、“初步”，也比较多，看来新手很多，大家一起加油！

正文词云

从中文内容来看，信息就丰富很多了。除了提供标题提供的信息外，还有很多细节信息，比如工具上除了python还有excel。左下角还有两个头发hhh希望大家都能变强！右下角乱入了方言和调查，这是我的私货了，举例时这几个词用的比较多，而且我写字比较啰嗦。

这周结束后，还可以做每个作者的周对比分析，感觉会很有意思~感觉还可以做每个人的文章内容分析，看看学习的聚焦的点。如果有多余的时间的话我再试试。先分析这么多吧。
小佳同学总结的很好了，我就不啰嗦了。这里再说几个我这个小白做这个折腾的几个问题：
1.这句代码是在jupyter notebook中使用的，作用是当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像。在spyder或pycharm里使用时可以注释掉。
%matplotlib inline
在pycharm里实现同样的效果，可以用plt.show()
2.遇到pandas、numpy等import出错时，可能是环境配置报错，可以先卸载，然后再重装.

pip uninstall numpy
pip install numpy

重装的时候会把环境配置也重新装一遍。网上也有其他方法，但是不如这个方法有效。
3.用anaconda时尽量不要pip install pyqt5，也不要conda upgrade --all，这样会导致anaconda可能打不开。具体原因我还没搞清楚，不过我重装了anaconda才解决，比较折腾。

明天写leetcode题解！