Pandas 分组 groupby

2022-05-21  本文已影响0人  陈天睡懒觉
import pandas as pd
df = pd.read_csv('data/gapminder.tsv',sep='\t')
print(df.head())

# 输出结果
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
print(df.dtypes)

# 输出结果
country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

按照单变量分组和计算

按照year分组并求每组的lifeExp的均值,得到一个Series

mean_lifeExp_by_year = df.groupby('year')['lifeExp'].mean()
print(mean_lifeExp_by_year)
print(mean_lifeExp_by_year.iloc[0])
print(mean_lifeExp_by_year.loc[1952])
year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64
49.05761971830987
49.05761971830987

取出值或者索引

print(mean_lifeExp_by_year.values)
print(mean_lifeExp_by_year.index)
[49.05761972 51.50740113 53.60924901 55.67828958 57.64738648 59.57015746
 61.53319718 63.21261268 64.16033803 65.01467606 65.69492254 67.00742254]
Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007],
           dtype='int64', name='year')

转成DataFrame,to_frame()方法,仍然只有一列

mean_lifeExp_by_year_df = mean_lifeExp_by_year.to_frame()
print(mean_lifeExp_by_year_df.shape) # (12, 1)

按照多变量分组和计算多个值

按照year和continent分组并求每组的lifeExp,GDP的均值,得到一个DataFrame

multi_group_var = df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean()
print(type(multi_group_var))
print(multi_group_var.head(10))

# 输出结果
<class 'pandas.core.frame.DataFrame'>
                  lifeExp     gdpPercap
year continent                         
1952 Africa     39.135500   1252.572466
     Americas   53.279840   4079.062552
     Asia       46.314394   5195.484004
     Europe     64.408500   5661.057435
     Oceania    69.255000  10298.085650
1957 Africa     41.266346   1385.236062
     Americas   55.960280   4616.043733
     Asia       49.318544   5787.732940
     Europe     66.703067   6963.012816
     Oceania    70.295000  11598.522455

观察到year不全,需要平铺DataFrame,使用reset_index方法。

flat = multi_group_var.reset_index()
print(flat.head(10))

# 输出结果
   year continent    lifeExp     gdpPercap
0  1952    Africa  39.135500   1252.572466
1  1952  Americas  53.279840   4079.062552
2  1952      Asia  46.314394   5195.484004
3  1952    Europe  64.408500   5661.057435
4  1952   Oceania  69.255000  10298.085650
5  1957    Africa  41.266346   1385.236062
6  1957  Americas  55.960280   4616.043733
7  1957      Asia  49.318544   5787.732940
8  1957    Europe  66.703067   6963.012816
9  1957   Oceania  70.295000  11598.522455

分组计数

唯一值计数 nunique()

country_nunique_by_continent = df.groupby('continent')['country'].nunique()
print(type(country_nunique_by_continent))
print(country_nunique_by_continent)

# 输出结果
<class 'pandas.core.series.Series'>
continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64

频数计数 value_counts(),频数计数的结果无法铺平

country_count_by_continent = df.groupby('continent')['country'].value_counts()
print(type(country_count_by_continent))
print(country_count_by_continent.head(20))

# 输出结果
<class 'pandas.core.series.Series'>
continent  country                 
Africa     Algeria                     12
           Angola                      12
           Benin                       12
           Botswana                    12
           Burkina Faso                12
           Burundi                     12
           Cameroon                    12
           Central African Republic    12
           Chad                        12
           Comoros                     12
           Congo, Dem. Rep.            12
           Congo, Rep.                 12
           Cote d'Ivoire               12
           Djibouti                    12
           Egypt                       12
           Equatorial Guinea           12
           Eritrea                     12
           Ethiopia                    12
           Gabon                       12
           Gambia                      12
Name: country, dtype: int64

获取分组结果

单变量分组结果获取

groupby_year = df.groupby('year')
year_1952 = groupby_year.get_group(1952)
print(year_1952.head(10))

# 输出结果
         country continent  year  lifeExp       pop     gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333    779.445314
12       Albania    Europe  1952   55.230   1282697   1601.056136
24       Algeria    Africa  1952   43.077   9279525   2449.008185
36        Angola    Africa  1952   30.015   4232095   3520.610273
48     Argentina  Americas  1952   62.485  17876956   5911.315053
60     Australia   Oceania  1952   69.120   8691212  10039.595640
72       Austria    Europe  1952   66.800   6927772   6137.076492
84       Bahrain      Asia  1952   50.939    120447   9867.084765
96    Bangladesh      Asia  1952   37.484  46886859    684.244172
108      Belgium    Europe  1952   68.000   8730405   8343.105127

多变量分组结果获取,元组形式

groupby_year_continent = df.groupby(['year','continent'])
year_1952_Asia = groupby_year_continent.get_group((1952, 'Asia'))
print(year_1952.head(10))

# 输出结果
         country continent  year  lifeExp       pop     gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333    779.445314
12       Albania    Europe  1952   55.230   1282697   1601.056136
24       Algeria    Africa  1952   43.077   9279525   2449.008185
36        Angola    Africa  1952   30.015   4232095   3520.610273
48     Argentina  Americas  1952   62.485  17876956   5911.315053
60     Australia   Oceania  1952   69.120   8691212  10039.595640
72       Austria    Europe  1952   66.800   6927772   6137.076492
84       Bahrain      Asia  1952   50.939    120447   9867.084765
96    Bangladesh      Asia  1952   37.484  46886859    684.244172
108      Belgium    Europe  1952   68.000   8730405   8343.105127
上一篇下一篇

猜你喜欢

热点阅读