R语言数据分析

R语言分组统计之aggregate()基础函数与reshape包

2021-09-15  本文已影响0人  小贝学生信

分组统计是指对一组或若干组观测值(例如各科成绩),基于分组水平(例如每个班级),统计出每个班级的数据分布特征(均值、总和...)。目前所了解的有三种方式,一是基础包stats提供的aggregate()函数;二是reshape2包提供的melt()dcast()组合函数;三是dplyr函数提供的group_by()summarise()函数。因为dplyr包表达操作相关已在之前的笔记整理dplyr表格操作 - 简书 (jianshu.com),本小节重点学习下前两种方式。

1、aggregate()

形式1:aggregate(观测值, 分组信息, 统计函数)

head(state.x77)
#           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
# Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
# Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
# Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
# Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
# California      21198   5114        1.1    71.71   10.3    62.6    20 156361
# Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
dim(state.x77)
#[1] 50  8
str(state.region)
#Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
table(state.region)
# state.region
# Northeast         South North Central          West 
# 9            16            12            13 

aggregate(state.x77, list(Region = state.region), mean)
#       Region Population   Income Illiteracy Life Exp    Murder  HS Grad    Frost      Area
# 1     Northeast   5495.111 4570.222   1.000000 71.26444  4.722222 53.96667 132.7778  18141.00
# 2         South   4208.125 4011.938   1.737500 69.70625 10.581250 44.34375  64.6250  54605.12
# 3 North Central   4803.000 4611.083   0.700000 71.76667  5.275000 54.51667 138.8333  62652.00
# 4          West   2915.308 4702.615   1.023077 71.23462  7.215385 62.00000 102.1538 134463.00


#两个分组的情况
aggregate(state.x77[,1:4], #选取指定列的观测值
          list(Region = state.region, 
               Cold = state.x77[,"Frost"] > 130), #两个分组
          mean)

形式2:aggregate(观测值 ~ 分组信息, 数据集, 统计函数)

# value ~ group
aggregate(weight ~ feed, data = chickwts, mean)    
# value ~ group1 + group2
aggregate(breaks ~ wool + tension, data = warpbreaks, mean) 
# cbind(value1, value2) ~ group
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean) 
#  # cbind(value1, value2) ~ group1 + group2
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)

## Dot notation: . 代表除了指定列以外的所有列
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
#   supp dose   len
# 1   OJ  0.5 13.23
# 2   VC  0.5  7.98
# 3   OJ  1.0 22.70
# 4   VC  1.0 16.77
# 5   OJ  2.0 26.06
# 6   VC  2.0 26.14
xtabs(len ~ ., data = ag)
#       dose
# supp   0.5     1     2
# OJ 13.23 22.70 26.06
# VC  7.98 16.77 26.14

2、reshape2

思路:先melt融合,再dcast分组统计。前提也是分组信息与观测值信息在同一个数据框dataframe里。

library(reshape2)
library(dplyr)
airquality %>% head
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

# id参数指定那些列为分类、分组信息、ID列(可以是字符串或者位置序号)
# 除id参数指定的列,其余均为观测值列
melt(airquality, id=c("Month", "Day")) %>% head
#   Month Day variable value
# 1     5   1    Ozone    41
# 2     5   2    Ozone    36
# 3     5   3    Ozone    12
# 4     5   4    Ozone    18
# 5     5   5    Ozone    NA
# 6     5   6    Ozone    28
melt(ChickWeight, id=2:4) %>% head

# variable.name参数指定变量列名,默认为variable
# value.name参数指定观测值列名, 默认为value
melt(airquality, id=c("Month", "Day"), 
     variable.name = "AA",   
     value.name = "aa") %>% head   
#   Month Day    AA aa
# 1     5   1 Ozone 41
# 2     5   2 Ozone 36
# 3     5   3 Ozone 12
# 4     5   4 Ozone 18
# 5     5   5 Ozone NA
# 6     5   6 Ozone 28

melt()也可用于list,并产生我之前需要手动整理的格式,很方便~

aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
head(aqm)
# group ~ variable column
dcast(aqm, Month ~ variable, mean)
#多个分组
dcast(aqm, Day + Month ~ variable, mean)
# margins参数,是否计算全局的统计指标
dcast(aqm, Month ~ variable, mean, margins = T)
#返回原始表格的形式
dcast(aqm, Day + Month ~ variable) %>% head

# 如果是dcast(数据集, 分组 ~ 分组)格式则是统计分组频数
chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)
head(chick_m)
dcast(chick_m, Time ~ variable, mean) # average effect of time
dcast(chick_m, Diet ~ variable, mean) # average effect of diet
dcast(chick_m, Diet ~ Chick) #统计不同类型分组的频数(两组)
dcast(chick_m, Time + Diet ~ Chick) ##统计不同类型分组的频数(三组)

由于melt融合的长表格结果形式也是ggplot2绘图所需的格式~

上一篇 下一篇

猜你喜欢

热点阅读