Pandas - 10.2 转换与过滤
2022-07-30 本文已影响0人
陈天睡懒觉
transform 转换
转换与聚合成单个值的计算不同,数据转换后数量不会变,比如标准化,只是在不同的类中进行标准化。
import pandas as pd
df = pd.read_csv('data/gapminder.tsv', sep='\t')
def my_zscore(x):
return ((x - x.mean())/x.std())
transform_z = df.groupby('year').lifeExp.transform(my_zscore)
print(transform_z.shape) # (1704,)
print(df.shape) # (1704, 6)
对比分组标准化和不分组标准化,两个分组标准化结果类似,但不分组区别很大
from scipy.stats import zscore
sp_z_grouped = df.groupby('year').lifeExp.transform(zscore)
sp_z_nogroup = zscore(df.lifeExp)
print(transform_z.head())
'''
0 -1.656854
1 -1.731249
2 -1.786543
3 -1.848157
4 -1.894173
Name: lifeExp, dtype: float64
'''
print(sp_z_grouped.head())
'''
0 -1.662719
1 -1.737377
2 -1.792867
3 -1.854699
4 -1.900878
Name: lifeExp, dtype: float64
'''
print(sp_z_nogroup[:5])
# [-2.37533395 -2.25677417 -2.1278375 -1.97117751 -1.81103275]
以缺失值填充为例,用组内平均值代替,而不是整个数据的平均值。比如男性和女性的消费能力不同,区分男女计算平均值代替缺失值更加合理。
import seaborn as sns
import numpy as np
np.random.seed(42)
# 取出10个样本
tips_10 = sns.load_dataset('tips').sample(10)
# 随机将四个样本的'total_bill'值改成缺失值
tips_10.loc[np.random.permutation(tips_10.index)[:4], 'total_bill'] = np.NaN
print(tips_10)
'''
total_bill tip sex smoker day time size
24 19.82 3.18 Male No Sat Dinner 2
6 8.77 2.00 Male No Sun Dinner 2
153 NaN 2.00 Male No Sun Dinner 4
211 NaN 5.16 Male Yes Sat Dinner 4
198 NaN 2.00 Female Yes Thur Lunch 2
176 NaN 2.00 Male Yes Sun Dinner 2
192 28.44 2.56 Male Yes Thur Lunch 2
124 12.48 2.52 Female No Thur Lunch 2
9 14.78 3.23 Male No Sun Dinner 2
101 15.38 3.00 Female Yes Fri Dinner 2
'''
# 按sex统计缺失值的数量,Male3个,Female1个
count_sex = tips_10.groupby('sex').count()
print(count_sex)
'''
total_bill tip smoker day time size
sex
Male 4 7 7 7 7 7
Female 2 3 3 3 3 3
'''
# 返回给定向量的平均值
def fill_na_mean(x):
avg = x.mean()
return (x.fillna(avg))
total_bill_group_mean = tips_10.groupby('sex').total_bill.transform(fill_na_mean)
tips_10['fill_total_bill'] = total_bill_group_mean
print(tips_10)
'''
total_bill tip sex smoker day time size fill_total_bill
24 19.82 3.18 Male No Sat Dinner 2 19.8200
6 8.77 2.00 Male No Sun Dinner 2 8.7700
153 NaN 2.00 Male No Sun Dinner 4 17.9525
211 NaN 5.16 Male Yes Sat Dinner 4 17.9525
198 NaN 2.00 Female Yes Thur Lunch 2 13.9300
176 NaN 2.00 Male Yes Sun Dinner 2 17.9525
192 28.44 2.56 Male Yes Thur Lunch 2 28.4400
124 12.48 2.52 Female No Thur Lunch 2 12.4800
9 14.78 3.23 Male No Sun Dinner 2 14.7800
101 15.38 3.00 Female Yes Fri Dinner 2 15.3800
'''
filter 过滤器
import pandas as pd
import seaborn as sns
tips = sns.load_dataset('tips')
print(tips.shape) # (244, 7)
print(tips['size'].value_counts())
'''
2 156
3 38
4 37
5 5
6 4
1 4
Name: size, dtype: int64
'''
输出结果显示,人数为1、5和6的情况不常见,需要过滤掉这些数据,要求每组数量要超过30
tips_filtered = tips.groupby('size').filter(lambda x: x['size'].count() >= 30)
print(tips_filtered.shape) # (231, 7)
print(tips_filtered['size'].value_counts())
'''
(231, 7)
2 156
3 38
4 37
Name: size, dtype: int64
'''