解密大数据点心读书会

fish千聊课重新研读03

2017-09-22  本文已影响14人  Bog5d

本课主要是统计学常识。 很多概念都值得背诵下来。

描述统计量

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
df = pd.read_csv('HRSalaries.csv')
df.head()
len(df)

30284
我觉得列名称太麻烦了,还是把列名改简单点吧
df.columns=['id','p_ti','dep','an_s','re_s']
df.head()


df.dep.value_counts()
POLICE                   12461
FIRE                      4798
SANITATION                2092
WATER MGMT                1796
AVIATION                  1252
TRANSPORTATION            1196
EMERGENCY MGMT            1182
GENERAL SERVICES           922
PUBLIC LIBRARY             874
FAMILY & SUPPORT           719
HEALTH                     568
FINANCE                    533
LAW                        455
CITY COUNCIL               265
BUILDINGS                  261
COMMUNITY DEVELOPMENT      216
BUSINESS AFFAIRS           177
DoIT                        97
MAYOR'S OFFICE              96
PROCUREMENT                 77
CULTURAL AFFAIRS            76
HUMAN RESOURCES             61
ANIMAL CONTRL               57
DISABILITIES                29
TREASURER                   24
Name: dep, dtype: int64
df.groupby('dep').size()

此方法与value_counts()等效。

dep
ANIMAL CONTRL               57
AVIATION                  1252
BUILDINGS                  261
BUSINESS AFFAIRS           177
CITY COUNCIL               265
COMMUNITY DEVELOPMENT      216
CULTURAL AFFAIRS            76
DISABILITIES                29
DoIT                        97
EMERGENCY MGMT            1182
FAMILY & SUPPORT           719
FINANCE                    533
FIRE                      4798
GENERAL SERVICES           922
HEALTH                     568
HUMAN RESOURCES             61
LAW                        455
MAYOR'S OFFICE              96
POLICE                   12461
PROCUREMENT                 77
PUBLIC LIBRARY             874
SANITATION                2092
TRANSPORTATION            1196
TREASURER                   24
WATER MGMT                1796
dtype: int64
len(df.dep.unique())

25

集中趋势

均值

算数平均值

$$ \frac{\sum_i x_i}{N}$$

salary = df.an_s
salary.head()
df.head()
salary.head()
0    16140
1    71506
2    61085
3    81928
4    50379
Name: an_s, dtype: int64
salary.sum()/len(salary)
## 求得工资金额,然后除以领工资人数。当然这是笨办法
60836.98560295866
np.mean(salary)
## 这是np的算法
60836.98560295866
salary.mean()

60836.98560295866
df.groupby('dep').an_s.mean().sort_values(ascending=False)

尝试默写这段代码时,也是状况百出。

dep
DoIT                     73831.979381
BUILDINGS                72137.885057
FIRE                     69383.989996
MAYOR'S OFFICE           68953.677083
WATER MGMT               64760.186526
COMMUNITY DEVELOPMENT    64262.597222
GENERAL SERVICES         63747.808026
TREASURER                63497.500000
POLICE                   63127.904984
TRANSPORTATION           62947.504181
PROCUREMENT              61452.584416
HEALTH                   61213.503521
CULTURAL AFFAIRS         61181.894737
DISABILITIES             58058.586207
BUSINESS AFFAIRS         57216.067797
HUMAN RESOURCES          57108.163934
LAW                      55917.958242
AVIATION                 55816.200479
SANITATION               55555.813576
FINANCE                  54286.375235
ANIMAL CONTRL            47604.473684
PUBLIC LIBRARY           44241.731121
EMERGENCY MGMT           42845.754653
CITY COUNCIL             38046.547170
FAMILY & SUPPORT         31193.307371
Name: an_s, dtype: float64

中位数

len(salary)
#计算一共有多少个人领工资
30284
sa_s = salary.sort_values()
sa_s.head()
16629    3128
2247     3132
20961    3133
13423    3135
25422    3135
Name: an_s, dtype: int64
(sa_s.iloc[15142]+sa_s.iloc[15141])/2
中值
61836.0
salary.median()
61836.0
np.median(salary)
61836.0
plt.hist(salary,bins=30,rwidth=0.5)
plt.show()
output_21_0.png
np.mean(salary)>np.median(salary)

均值是否大于中值呢 结论为否,那么就是均值小于中值,那么就是往左偏的

False

救火人员收入分布

f_s = df[df.dep=='FIRE'].an_s

plt.hist(f_s,bins=30,rwidth=0.6)
plt.show()

我的命名都是简写,正式场合还是不能这样么用,不然只有我自己看得懂

output_24_0.png
f_s.median()
66260.0
f_s.mean()
69383.9899958316
f_s.median()<f_s.mean()
True

均值大于中值,所以图形偏右

离散程度

全距 range

salary.max()-salary.min()
笨办法算全距
198320
salary.range()

这是一次错误尝试,看看有没有直接算全距的方法,看来没有。

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-54-1a6fc05747b6> in <module>()
----> 1 salary.range()


F:\Program Files\conda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   2670             if name in self._info_axis:
   2671                 return self[name]
-> 2672             return object.__getattribute__(self, name)
   2673 
   2674     def __setattr__(self, name, value):


AttributeError: 'Series' object has no attribute 'range'

四分位距

Q1 = salary.quantile(0.25)
Q1

55671.75
Q3 = salary.quantile(0.75)
Q3
68558.5
IQR=Q3-Q1
IQR
12886.75
salary.quantile(0.5) - salary.median()

中位数就是0.5分位数

0.0
salary.plot(kind='box',vert=False,figsize=(6,3))

个人理解:这种作图的思路是先把数据摆出来,再说是画图。 相当于是我有这一堆棉花,再说做成棉被;
但还有中思路,先说我要做棉被,你把棉花拿出来呢,比如plt.plot(salary)

<matplotlib.axes._subplots.AxesSubplot at 0x23883b030b8>
output_37_1.png
plt.boxplot(salary)
{'boxes': [<matplotlib.lines.Line2D at 0x23883bd7748>],
 'caps': [<matplotlib.lines.Line2D at 0x23883bddac8>,
  <matplotlib.lines.Line2D at 0x23883be39b0>],
 'fliers': [<matplotlib.lines.Line2D at 0x23883be8a20>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x23883be3b38>],
 'whiskers': [<matplotlib.lines.Line2D at 0x23883bd7978>,
  <matplotlib.lines.Line2D at 0x23883bdd940>]}
output_38_1.png

IT从业人员工资

it_s =df[df.dep=='DoIT'].an_s.tolist()


什么意思呢,是一个列表吗?

plt.boxplot(it_s)


疑问:这段如果不列表化,就读不出数据,这是为什么呢?
{'boxes': [<matplotlib.lines.Line2D at 0x23884b272b0>],
 'caps': [<matplotlib.lines.Line2D at 0x23884b2cd30>,
  <matplotlib.lines.Line2D at 0x23884b2ceb8>],
 'fliers': [<matplotlib.lines.Line2D at 0x23884a8c940>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x23884a8c358>],
 'whiskers': [<matplotlib.lines.Line2D at 0x23884b27cc0>,
  <matplotlib.lines.Line2D at 0x23884b27e48>]}
output_41_1.png

建筑行业和it行业的收入比较
还以同时画两类图

bld_s =df[df.dep=='BUILDINGS'].an_s
bld_s.head()
158    60255
395    75517
523    60633
608    66145
884    68463
Name: an_s, dtype: int64
plt.boxplot([it_s,bld_s],labels=["IT","BUILDING"])

{'boxes': [<matplotlib.lines.Line2D at 0x23884ba1588>,
  <matplotlib.lines.Line2D at 0x23884e439b0>],
 'caps': [<matplotlib.lines.Line2D at 0x23884ba7ef0>,
  <matplotlib.lines.Line2D at 0x23884baf860>,
  <matplotlib.lines.Line2D at 0x23884e4f978>,
  <matplotlib.lines.Line2D at 0x23884e4fb00>],
 'fliers': [<matplotlib.lines.Line2D at 0x23884e438d0>,
  <matplotlib.lines.Line2D at 0x23884e54b70>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x23884baf9e8>,
  <matplotlib.lines.Line2D at 0x23884e54358>],
 'whiskers': [<matplotlib.lines.Line2D at 0x23884ba1f98>,
  <matplotlib.lines.Line2D at 0x23884ba77f0>,
  <matplotlib.lines.Line2D at 0x23884e48908>,
  <matplotlib.lines.Line2D at 0x23884e48a90>]}
output_44_1.png

错误记录:

所有部门雇员收入的box图

import seaborn as sns
sns.boxplot(data=df,x='dep',y='an_s')
plt.show()

## 意外发现用cmd安装真的好方便啊
# 不过太拥挤了,能不能设置大一点啊

我之前没有安装seaborn包。然后在cmd命令框里,输入conda install seaborn 发现太好用啦

疑惑:seaborn如何设置图形的大小呢?

output_47_0.png
mean = salary.mean()
var =np.sum((salary-mean)**2)/(len(salary)-1)
var

271490393.4177519
std =np.sqrt(var)
std

16476.965540346071
#接下来是简单方法s
np.var(salary)
271481428.6048666
salary.var()
271490393.4177519
salary.std()

16476.96554034607
mean = salary.mean()
np.sum((salary - mean)**2) / (len(salary) - 1)
 
271490393.4177519

我发现两种算法的方差和标准差,是有误差的。

验证拇指规则

拇指规则是指68%集中在均值周围正负一个标准差的区间

len(salary[salary.between(mean - std, mean + std)])/len(salary)


  1. between前面面要加点
  2. 方括号用错啦,应该是内部那个方括号应该是圆括号
0.7666094307224938
len(salary[salary.between(mean - 2*std, mean + 2*std)])/len(salary)
0.933364152687888

还真差不多呢。

两个变量的关系

协方差

$$ cov(x,y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{N-1} $$

df.head()
score = df.re_s
score.tail()

30279    4.8
30280    4.9
30281    6.2
30282    6.7
30283    5.5
Name: re_s, dtype: float64
mean_s=np.mean(score)
np.sum((salary-mean)*(score-mean_s))/len(score)
 
7.747344090350171

以上求出来的是协方差。 是标准差的晋级版。 标准差是一个变量和均值之间的关系;协方差是两个变量之间的关系,不对呀,是两个变量和均值的关系,还是两个变量之间的关系呢?

cov = np.cov(salary,score)[0,1]
cov
7.7475999218100222

相关系数

$$ \rho = \frac{cov(x,y)}{\sigma_x \sigma_y} $$

corrcoef =cov/(np.std(salary)*np.std(score))
corrcoef
0.00045634837652977272
np.corrcoef(salary,score)[0,1]
0.00045633330757003586

结果有点小误差。

plt.scatter(salary,score,alpha=0.5)
plt.show()
output_69_0.png
position = df[df.p_ti == 'FIREFIGHTER']
print(np.corrcoef(position.an_s, position.re_s)[1,0])
plt.scatter(position.an_s, position.re_s)
plt.show()
0.0571267765462
output_70_1.png
df.head()

第三课作业

1、计算 HRSalaries 数据中评分Review_Score 的均值和中位数,并判断其偏度是左偏还是右偏?

print(df.re_s.mean())
print(df.re_s.median())
df.re_s.median()<df.re_s.mean()
plt.hist(df.re_s)
6.4558908994849205
6.5





(array([   25.,   101.,   560.,  2480.,  5199.,  8858.,  7821.,  4088.,
         1140.,    12.]),
 array([ 2.  ,  2.78,  3.56,  4.34,  5.12,  5.9 ,  6.68,  7.46,  8.24,
         9.02,  9.8 ]),
 <a list of 10 Patch objects>)
output_73_2.png

均值小于中值,是往左偏的

print("Review_ScoreIQR:",df.re_s.quantile(0.75)-df.re_s.quantile(0.25))

Review_ScoreIQR: 1.4
df.re_s.plot(kind="box",vert=False,figsize=(9,6))

<matplotlib.axes._subplots.AxesSubplot at 0x23883c2e780>
output_76_1.png
plt.boxplot(df.re_s)
{'boxes': [<matplotlib.lines.Line2D at 0x2388914f048>],
 'caps': [<matplotlib.lines.Line2D at 0x2388914aa20>,
  <matplotlib.lines.Line2D at 0x2388914aba8>],
 'fliers': [<matplotlib.lines.Line2D at 0x2388914dc18>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x2388914d400>],
 'whiskers': [<matplotlib.lines.Line2D at 0x238891639b0>,
  <matplotlib.lines.Line2D at 0x23889163b38>]}
output_77_1.png
sns.boxplot(data=df.re_s)
<matplotlib.axes._subplots.AxesSubplot at 0x2388fcb31d0>
output_78_1.png

3、Review_Score的标准差是多少?

score.var()
1.0617336150160495

4、在Review_Score中,求落在两个标准差内的数据占总数的百分比。

mean_rs = score.mean()
var_rs=score.var()

#len(np.between[mean_rs-var_rs,mean_rs+var_rs])/len(score)
#错误记录:本应该是data.between,但我写成了np.between
len(score[score.between(mean_rs-var_rs,mean_rs+var_rs)])/len(score)
0.7091203275657113

5、对于 DoIT 部门,计算其收入和评分的相关系数。

it_s=df[df.dep=="DoIT"].an_s
print(it_s.head())
it_rs = df[df.dep=="DoIT"].re_s
print(it_rs.head())
708     64986
879     80746
1656    63777
1904    91184
2038    90967
Name: an_s, dtype: int64
708     5.5
879     6.3
1656    8.4
1904    7.4
2038    5.1
Name: re_s, dtype: float64
np.corrcoef(it_s,it_rs)[0,1]
0.0060245710104947512
plt.scatter(it_s,it_rs)
plt.show()
output_87_0.png

几乎看不到相关性


上一篇 下一篇

猜你喜欢

热点阅读