Pandas 分组 groupby
2022-05-21 本文已影响0人
陈天睡懒觉
import pandas as pd
df = pd.read_csv('data/gapminder.tsv',sep='\t')
print(df.head())
# 输出结果
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
print(df.dtypes)
# 输出结果
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
按照单变量分组和计算
按照year分组并求每组的lifeExp的均值,得到一个Series
mean_lifeExp_by_year = df.groupby('year')['lifeExp'].mean()
print(mean_lifeExp_by_year)
print(mean_lifeExp_by_year.iloc[0])
print(mean_lifeExp_by_year.loc[1952])
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
49.05761971830987
49.05761971830987
取出值或者索引
print(mean_lifeExp_by_year.values)
print(mean_lifeExp_by_year.index)
[49.05761972 51.50740113 53.60924901 55.67828958 57.64738648 59.57015746
61.53319718 63.21261268 64.16033803 65.01467606 65.69492254 67.00742254]
Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
2007],
dtype='int64', name='year')
转成DataFrame,to_frame()方法,仍然只有一列
mean_lifeExp_by_year_df = mean_lifeExp_by_year.to_frame()
print(mean_lifeExp_by_year_df.shape) # (12, 1)
按照多变量分组和计算多个值
按照year和continent分组并求每组的lifeExp,GDP的均值,得到一个DataFrame
multi_group_var = df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean()
print(type(multi_group_var))
print(multi_group_var.head(10))
# 输出结果
<class 'pandas.core.frame.DataFrame'>
lifeExp gdpPercap
year continent
1952 Africa 39.135500 1252.572466
Americas 53.279840 4079.062552
Asia 46.314394 5195.484004
Europe 64.408500 5661.057435
Oceania 69.255000 10298.085650
1957 Africa 41.266346 1385.236062
Americas 55.960280 4616.043733
Asia 49.318544 5787.732940
Europe 66.703067 6963.012816
Oceania 70.295000 11598.522455
观察到year不全,需要平铺DataFrame,使用reset_index方法。
flat = multi_group_var.reset_index()
print(flat.head(10))
# 输出结果
year continent lifeExp gdpPercap
0 1952 Africa 39.135500 1252.572466
1 1952 Americas 53.279840 4079.062552
2 1952 Asia 46.314394 5195.484004
3 1952 Europe 64.408500 5661.057435
4 1952 Oceania 69.255000 10298.085650
5 1957 Africa 41.266346 1385.236062
6 1957 Americas 55.960280 4616.043733
7 1957 Asia 49.318544 5787.732940
8 1957 Europe 66.703067 6963.012816
9 1957 Oceania 70.295000 11598.522455
分组计数
- 唯一值计数 nunique()
- 频数计数 value_counts()
唯一值计数 nunique()
country_nunique_by_continent = df.groupby('continent')['country'].nunique()
print(type(country_nunique_by_continent))
print(country_nunique_by_continent)
# 输出结果
<class 'pandas.core.series.Series'>
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2
Name: country, dtype: int64
频数计数 value_counts(),频数计数的结果无法铺平
country_count_by_continent = df.groupby('continent')['country'].value_counts()
print(type(country_count_by_continent))
print(country_count_by_continent.head(20))
# 输出结果
<class 'pandas.core.series.Series'>
continent country
Africa Algeria 12
Angola 12
Benin 12
Botswana 12
Burkina Faso 12
Burundi 12
Cameroon 12
Central African Republic 12
Chad 12
Comoros 12
Congo, Dem. Rep. 12
Congo, Rep. 12
Cote d'Ivoire 12
Djibouti 12
Egypt 12
Equatorial Guinea 12
Eritrea 12
Ethiopia 12
Gabon 12
Gambia 12
Name: country, dtype: int64
获取分组结果
单变量分组结果获取
groupby_year = df.groupby('year')
year_1952 = groupby_year.get_group(1952)
print(year_1952.head(10))
# 输出结果
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
12 Albania Europe 1952 55.230 1282697 1601.056136
24 Algeria Africa 1952 43.077 9279525 2449.008185
36 Angola Africa 1952 30.015 4232095 3520.610273
48 Argentina Americas 1952 62.485 17876956 5911.315053
60 Australia Oceania 1952 69.120 8691212 10039.595640
72 Austria Europe 1952 66.800 6927772 6137.076492
84 Bahrain Asia 1952 50.939 120447 9867.084765
96 Bangladesh Asia 1952 37.484 46886859 684.244172
108 Belgium Europe 1952 68.000 8730405 8343.105127
多变量分组结果获取,元组形式
groupby_year_continent = df.groupby(['year','continent'])
year_1952_Asia = groupby_year_continent.get_group((1952, 'Asia'))
print(year_1952.head(10))
# 输出结果
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
12 Albania Europe 1952 55.230 1282697 1601.056136
24 Algeria Africa 1952 43.077 9279525 2449.008185
36 Angola Africa 1952 30.015 4232095 3520.610273
48 Argentina Americas 1952 62.485 17876956 5911.315053
60 Australia Oceania 1952 69.120 8691212 10039.595640
72 Austria Europe 1952 66.800 6927772 6137.076492
84 Bahrain Asia 1952 50.939 120447 9867.084765
96 Bangladesh Asia 1952 37.484 46886859 684.244172
108 Belgium Europe 1952 68.000 8730405 8343.105127