用python/pandas 的数据聚合 groupby 函数分
2018-08-22 本文已影响149人
LeeMin_Z
数据集链接:
https://github.com/kojoidrissa/pydata-book
# 定义最大差值 的函数
In [49]: def PeakToPeak(arr):
...: return arr.max() - arr.min()
In [51]: tips = pd.read_csv('ch08/tips.csv')
# tip_pct 小费占订单的百分比
In [52]: tips['tip_pct'] = tips['tip'] / tips['total_bill']
In [53]: tips[:6]
Out[53]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
5 25.29 4.71 Male No Sun Dinner 4 0.186240
# 键值对设为['sex','smoker'], 处理数据为 ['tip_pct']
In [54]: grouped = tips.groupby(['sex','smoker'])
In [55]: grouped_pct = grouped['tip_pct']
In [56]: grouped_pct.agg('mean')
Out[56]:
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
# 计算平均值那一列命名为 'foo', 仅是改名作用
In [59]: grouped_pct.agg([('foo','mean'),('bar',np.std)])
Out[59]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
# 批量计算
In [60]: functions = ['count','mean','max']
In [61]: result = grouped['tip_pct','total_bill'].agg(functions)
In [62]: result
Out[62]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
In [63]: result['total_bill']
Out[63]:
count mean max
sex smoker
Female No 54 18.105185 35.83
Yes 33 17.977879 44.30
Male No 97 19.791237 48.33
Yes 60 22.284500 50.81
2018.8.22
省略看图说话的数据分析.