Pandas - 10.2 转换与过滤

2022-07-30  本文已影响0人  陈天睡懒觉

transform 转换

转换与聚合成单个值的计算不同,数据转换后数量不会变,比如标准化,只是在不同的类中进行标准化。

import pandas as pd
df = pd.read_csv('data/gapminder.tsv', sep='\t')

def my_zscore(x):
    return ((x - x.mean())/x.std())

transform_z = df.groupby('year').lifeExp.transform(my_zscore)
print(transform_z.shape) # (1704,)
print(df.shape) # (1704, 6)

对比分组标准化和不分组标准化,两个分组标准化结果类似,但不分组区别很大

from scipy.stats import zscore

sp_z_grouped = df.groupby('year').lifeExp.transform(zscore)
sp_z_nogroup = zscore(df.lifeExp)

print(transform_z.head())
'''
0   -1.656854
1   -1.731249
2   -1.786543
3   -1.848157
4   -1.894173
Name: lifeExp, dtype: float64
'''

print(sp_z_grouped.head())
'''
0   -1.662719
1   -1.737377
2   -1.792867
3   -1.854699
4   -1.900878
Name: lifeExp, dtype: float64
'''

print(sp_z_nogroup[:5])
# [-2.37533395 -2.25677417 -2.1278375  -1.97117751 -1.81103275]

以缺失值填充为例,用组内平均值代替,而不是整个数据的平均值。比如男性和女性的消费能力不同,区分男女计算平均值代替缺失值更加合理。

import seaborn as sns
import numpy as np

np.random.seed(42)
# 取出10个样本
tips_10 = sns.load_dataset('tips').sample(10)
# 随机将四个样本的'total_bill'值改成缺失值
tips_10.loc[np.random.permutation(tips_10.index)[:4], 'total_bill'] = np.NaN
print(tips_10)
'''
     total_bill   tip     sex smoker   day    time  size
24        19.82  3.18    Male     No   Sat  Dinner     2
6          8.77  2.00    Male     No   Sun  Dinner     2
153         NaN  2.00    Male     No   Sun  Dinner     4
211         NaN  5.16    Male    Yes   Sat  Dinner     4
198         NaN  2.00  Female    Yes  Thur   Lunch     2
176         NaN  2.00    Male    Yes   Sun  Dinner     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
9         14.78  3.23    Male     No   Sun  Dinner     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2
'''
# 按sex统计缺失值的数量,Male3个,Female1个
count_sex = tips_10.groupby('sex').count()
print(count_sex)
'''
        total_bill  tip  smoker  day  time  size
sex                                             
Male             4    7       7    7     7     7
Female           2    3       3    3     3     3
'''
# 返回给定向量的平均值
def fill_na_mean(x):
    avg = x.mean()
    return (x.fillna(avg))

total_bill_group_mean = tips_10.groupby('sex').total_bill.transform(fill_na_mean)
tips_10['fill_total_bill'] = total_bill_group_mean
print(tips_10)
'''
     total_bill   tip     sex smoker   day    time  size  fill_total_bill
24        19.82  3.18    Male     No   Sat  Dinner     2          19.8200
6          8.77  2.00    Male     No   Sun  Dinner     2           8.7700
153         NaN  2.00    Male     No   Sun  Dinner     4          17.9525
211         NaN  5.16    Male    Yes   Sat  Dinner     4          17.9525
198         NaN  2.00  Female    Yes  Thur   Lunch     2          13.9300
176         NaN  2.00    Male    Yes   Sun  Dinner     2          17.9525
192       28.44  2.56    Male    Yes  Thur   Lunch     2          28.4400
124       12.48  2.52  Female     No  Thur   Lunch     2          12.4800
9         14.78  3.23    Male     No   Sun  Dinner     2          14.7800
101       15.38  3.00  Female    Yes   Fri  Dinner     2          15.3800
'''

filter 过滤器

import pandas as pd
import seaborn as sns

tips = sns.load_dataset('tips')
print(tips.shape) # (244, 7)

print(tips['size'].value_counts())
'''
2    156
3     38
4     37
5      5
6      4
1      4
Name: size, dtype: int64
'''

输出结果显示,人数为1、5和6的情况不常见,需要过滤掉这些数据,要求每组数量要超过30

tips_filtered = tips.groupby('size').filter(lambda x: x['size'].count() >= 30)
print(tips_filtered.shape) # (231, 7)
print(tips_filtered['size'].value_counts())
'''
(231, 7)
2    156
3     38
4     37
Name: size, dtype: int64
'''
上一篇下一篇

猜你喜欢

热点阅读