R语言分组统计之aggregate()基础函数与reshape包
2021-09-15 本文已影响0人
小贝学生信
分组统计是指对一组或若干组观测值(例如各科成绩),基于分组水平(例如每个班级),统计出每个班级的数据分布特征(均值、总和...)。目前所了解的有三种方式,一是基础包
stats
提供的aggregate()函数;二是reshape2
包提供的melt()
、dcast()
组合函数;三是dplyr
函数提供的group_by()
、summarise()
函数。因为dplyr包表达操作相关已在之前的笔记整理dplyr表格操作 - 简书 (jianshu.com),本小节重点学习下前两种方式。
1、aggregate()
形式1:aggregate(观测值, 分组信息, 统计函数)
- 对于观测值,为dataframe格式,可以有多列;
- 对于分组信息,为list格式,可以包含多类分组,但需要保证list里的每个分组信息长度与前面的观测值一致;
- 统计函数即mean,sum之类
通过下面的例子可以快速理解~
head(state.x77)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
# Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
# California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
dim(state.x77)
#[1] 50 8
str(state.region)
#Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
table(state.region)
# state.region
# Northeast South North Central West
# 9 16 12 13
aggregate(state.x77, list(Region = state.region), mean)
# Region Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# 1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.7778 18141.00
# 2 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.6250 54605.12
# 3 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.8333 62652.00
# 4 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 134463.00
#两个分组的情况
aggregate(state.x77[,1:4], #选取指定列的观测值
list(Region = state.region,
Cold = state.x77[,"Frost"] > 130), #两个分组
mean)
形式2:aggregate(观测值 ~ 分组信息, 数据集, 统计函数)
- 使用这种形式的前提是dataframe同时包含有观测值与分组信息才可以
# value ~ group
aggregate(weight ~ feed, data = chickwts, mean)
# value ~ group1 + group2
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
# cbind(value1, value2) ~ group
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
# # cbind(value1, value2) ~ group1 + group2
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
## Dot notation: . 代表除了指定列以外的所有列
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
# supp dose len
# 1 OJ 0.5 13.23
# 2 VC 0.5 7.98
# 3 OJ 1.0 22.70
# 4 VC 1.0 16.77
# 5 OJ 2.0 26.06
# 6 VC 2.0 26.14
xtabs(len ~ ., data = ag)
# dose
# supp 0.5 1 2
# OJ 13.23 22.70 26.06
# VC 7.98 16.77 26.14
2、reshape2
包
思路:先melt融合,再dcast分组统计。前提也是分组信息与观测值信息在同一个数据框dataframe里。
- melt融合(也适用于ggplot2绘图的需要)
library(reshape2)
library(dplyr)
airquality %>% head
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
# id参数指定那些列为分类、分组信息、ID列(可以是字符串或者位置序号)
# 除id参数指定的列,其余均为观测值列
melt(airquality, id=c("Month", "Day")) %>% head
# Month Day variable value
# 1 5 1 Ozone 41
# 2 5 2 Ozone 36
# 3 5 3 Ozone 12
# 4 5 4 Ozone 18
# 5 5 5 Ozone NA
# 6 5 6 Ozone 28
melt(ChickWeight, id=2:4) %>% head
# variable.name参数指定变量列名,默认为variable
# value.name参数指定观测值列名, 默认为value
melt(airquality, id=c("Month", "Day"),
variable.name = "AA",
value.name = "aa") %>% head
# Month Day AA aa
# 1 5 1 Ozone 41
# 2 5 2 Ozone 36
# 3 5 3 Ozone 12
# 4 5 4 Ozone 18
# 5 5 5 Ozone NA
# 6 5 6 Ozone 28
melt()
也可用于list,并产生我之前需要手动整理的格式,很方便~![]()
- dcast分组统计: dcast(数据集, 分组 ~ variable,统计函数)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
head(aqm)
# group ~ variable column
dcast(aqm, Month ~ variable, mean)
#多个分组
dcast(aqm, Day + Month ~ variable, mean)
# margins参数,是否计算全局的统计指标
dcast(aqm, Month ~ variable, mean, margins = T)
#返回原始表格的形式
dcast(aqm, Day + Month ~ variable) %>% head
# 如果是dcast(数据集, 分组 ~ 分组)格式则是统计分组频数
chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)
head(chick_m)
dcast(chick_m, Time ~ variable, mean) # average effect of time
dcast(chick_m, Diet ~ variable, mean) # average effect of diet
dcast(chick_m, Diet ~ Chick) #统计不同类型分组的频数(两组)
dcast(chick_m, Time + Diet ~ Chick) ##统计不同类型分组的频数(三组)
由于melt融合的长表格结果形式也是ggplot2绘图所需的格式~