aggregate | 在R中进行分组统计
2021-04-24 本文已影响0人
木舟笔记
210424.jpg
aggregate 的简单用法指南
分组求均值
#导入内置数据
df <- chickwts
#查看数据集
head(df)
> head(df)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
aggregate
分组计算均值有两种方法。
- 法1:
#第一个参数:数值变量
#第二个参数:列表形似的分组变量
#第三个参数:用于汇总统计的函数(本例为均值mean)
group_mean <- aggregate(df$weight, list(df$feed), mean)
> group_mean
Group.1 x
1 casein 323.5833
2 horsebean 160.2000
3 linseed 218.7500
4 meatmeal 276.9091
5 soybean 246.4286
6 sunflower 328.9167
值得注意的是,数据框的列名发生了改变,可以使用colnames
函数修改。
colnames(group_mean) <- c("Group", "Mean")
group_mean
> group_mean
Group Mean
1 casein 323.5833
2 horsebean 160.2000
3 linseed 218.7500
4 meatmeal 276.9091
5 soybean 246.4286
6 sunflower 328.9167
- 法2:
group_mean <- aggregate(weight ~ feed, data = df, mean)
> group_mean
feed weight
1 casein 323.5833
2 horsebean 160.2000
3 linseed 218.7500
4 meatmeal 276.9091
5 soybean 246.4286
6 sunflower 328.9167
分组统计个数
group_count <- aggregate(df$feed, by = list(df$feed), FUN = length)
group_count
> group_count
Group.1 x
1 casein 12
2 horsebean 10
3 linseed 12
4 meatmeal 11
5 soybean 14
6 sunflower 12
分组统计总体分位数
#建立一个数据集:一个基金的一年的每日收益
set.seed(1)
library(lubridate)
Dates <- seq(dmy("01/01/2014"), dmy("01/01/2015"), by = "day")
Return <- rnorm(length(Dates))
install.packages("xts")
library(xts)
tserie <- xts(Return, Dates)
head(tserie)
> head(tserie)
[,1]
2014-01-01 -0.6264538
2014-01-02 0.1836433
2014-01-03 -0.8356286
2014-01-04 1.5952808
2014-01-05 0.3295078
2014-01-06 -0.8204684
可以计算每个月收益的5%和95%的分位数:
dat <- aggregate(tserie ~ month(index(tserie)), FUN = quantile,
probs = c(0.05, 0.95))
colnames(dat)[1] <- "Month"
dat
> dat
Month V1.5% V1.95%
1 1 -1.704122 1.427575
2 2 -1.099533 1.316474
3 3 -1.388600 1.819083
4 4 -1.083452 1.639272
5 5 -1.652789 1.259811
6 6 -1.406464 2.147217
7 7 -1.337666 1.637731
8 8 -1.669366 1.308261
9 9 -1.635192 1.155433
10 10 -1.371251 1.874883
11 11 -1.445358 1.505385
12 12 -2.091900 1.525886
按多个列聚合
#创建数据集
set.seed(1)
cat_var <- sample(c("A", "B", "C"), nrow(df), replace = TRUE)
df_2 <- cbind(df, cat_var)
head(df_2)
> head(df_2)
weight feed cat_var
1 179 horsebean A
2 160 horsebean C
3 136 horsebean A
4 227 horsebean B
5 217 horsebean A
6 168 horsebean C
- 可以根据多个分类变量进行统计
aggregate(df_2$weight, by = list(df_2$feed, df_2$cat_var), FUN = sum)
aggregate(weight ~ feed + cat_var, data = df_2, FUN = sum) #等效
feed cat_var weight
casein A 1005
horsebean A 532
linseed A 1079
meatmeal A 242
soybean A 1738
sunflower A 882
casein B 1131
horsebean B 494
linseed B 780
meatmeal B 2244
soybean B 1355
sunflower B 2109
casein C 1747
horsebean C 576
linseed C 766
meatmeal C 560
soybean C 357
sunflower C 956
#创建一个新数据集
set.seed(1)
num_var <- rnorm(nrow(df))
df_3 <- cbind(num_var, df)
head(df_3)
> head(df_3)
num_var weight feed
1 -0.6264538 179 horsebean
2 0.1836433 160 horsebean
3 -0.8356286 136 horsebean
4 1.5952808 227 horsebean
5 0.3295078 217 horsebean
6 -0.8204684 168 horsebean
- 处理两个或多个数值变量时,可以使用
cbind
函数来连接:
aggregate(cbind(df_3$num_var, df_3$weight), list(df_3$feed), mean)
Group.1 V1 V2
casein 0.4043795 323.5833
horsebean 0.1322028 160.2000
linseed 0.3491303 218.7500
meatmeal 0.2125804 276.9091
soybean -0.2314387 246.4286
sunflower 0.1651836 328.9167
当然,还可以将该函数同时应用于多个数值变量和分类变量。
往期内容;