R 数据可视化 —— ggplot 统计图层

2021-04-22 本文已影响0人名本无名

前言

虽然我们介绍了这么多节的 ggplot2，我们在绘制图层时基本上使用的都是 geom_*() 函数，却很少使用 stat_*() 函数。

当然，使用 geom_*() 函数已经可以完成绝大部分的绘图工作了，那还有必要使用 stat_*() 函数吗？

我们来看一例子，假设有如下数据

> select(diamonds, cut, price)
# A tibble: 53,940 x 2
   cut       price
   <ord>     <int>
 1 Ideal       326
 2 Premium     326
 3 Good        327
 4 Premium     334
 5 Good        335
 6 Very Good   336
 7 Very Good   336
 8 Very Good   337
 9 Fair        337
10 Very Good   338
# … with 53,930 more rows

我们想要绘制一个柱状图，用于展示每种切工的平均价格。

常规的方法是，使用 tidyverse 的函数来对数据进行整理，然后计算出需要的统计数值，并映射到相应的图形属性，即

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop"
  ) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col()

现在，我们并不满足于此。现在，我们想要在柱状图上添加误差线

当然，这也很简单，我们可以再对数据进行统计计算，然后绘制

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop",
    se = sqrt(var(price)/length(price))
  ) %>%
  mutate(lower = mean_price - se, upper = mean_price + se) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)

en...，为了绘制这么一个简单的图片，我们写的代码比图片都长。

因为我们的观念还停留在，先准备好数据，然后将数据映射到图形属性。

这样就导致需要对数据进行很多统计计算，并不符合数据的整洁之道。

我们可以这样想，既然所有的统计信息都来源于同一个数据，那我们何不直接将数据传递给 ggplot，让数据的统计计算在内部进行呢？

我们可以这样改写

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar", width = 0.5)

两行代码就能搞定，为啥要写那么多呢，节约的时间喝杯茶多好。

原理解析

学习和理解了 stat_summary 函数的工作原理，那么其他的 stat_* 函数也就很好理解了。

那我们该如何理解 stat_summary 呢？还是来举个例子吧

使用上面的数据，我们绘制切工与价格的点图

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  geom_point()

然后使用不带参数的 stat_summary 来替换 geom_point 看看会发生什么

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary()

绘制的是 pointrange 对象。

我们先看看 stat_summary 函数

stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",
  position = "identity",
  ...,
  fun.data = NULL,
  fun = NULL,
  fun.max = NULL,
  fun.min = NULL,
  fun.args = list(),
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  fun.y,
  fun.ymin,
  fun.ymax
)

默认绘制的是 pointrange，那 pointrange 需要定义哪些属性映射呢？

x 或 y
ymin 或 xmin
ymax 或 xmax

但是，我们并没有定义 ymin、ymax，那应该是 stat_summary 计算出了相应的值，并传递给 pointrange

如何验证我们的猜想？首先，我们看到运行上述代码会输出一个警告信息

No summary function supplied, defaulting to `mean_se()`

也就是说，默认情况下会应用 mean_se() 函数变换

我们来看看 mean_se() 做了什么操作

> mean_se
function (x, mult = 1) 
{
    x <- stats::na.omit(x)
    se <- mult * sqrt(stats::var(x)/length(x))
    mean <- mean(x)
    new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + 
        se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>

我们可以看到，该函数返回的数据框包含三个值，正好是 pointrange 所需要传入的参数

我们可以使用 layer_data() 函数，来提取图层中使用的数据

> p <- select(diamonds, cut, price) %>%
+   ggplot(aes(cut, price, colour = cut)) +
+   stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
     colour x group        y     ymin     ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1     1 4358.758 4270.025 4447.491     1       FALSE  0.5        1    19   NA    NA      1
2 #3B528BFF 2     2 3928.864 3876.302 3981.426     1       FALSE  0.5        1    19   NA    NA      1
3 #21908CFF 3     3 3981.760 3945.953 4017.567     1       FALSE  0.5        1    19   NA    NA      1
4 #5DC863FF 4     4 4584.258 4547.223 4621.293     1       FALSE  0.5        1    19   NA    NA      1
5 #FDE725FF 5     5 3457.542 3431.600 3483.484     1       FALSE  0.5        1    19   NA    NA      1

然后与使用 mean_se() 函数的计算结果对比

> select(diamonds, cut, price) %>%
+   group_by(cut) %>%
+   summarise(mean_se(price))
# A tibble: 5 x 4
  cut           y  ymin  ymax
* <ord>     <dbl> <dbl> <dbl>
1 Fair      4359. 4270. 4447.
2 Good      3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
4 Premium   4584. 4547. 4621.
5 Ideal     3458. 3432. 3483.

我们可以看到，y、ymin、 ymax 这三个参数的值与 mean_se() 的计算结果是一致的

使用

既然可以定了变换函数，那我们定义自己的统计变换，就可以根据需要对图形进行一些个性化调整了。

stat_summary() 函数的参数 fun.data 可以指定统计变换函数，默认为 mean_se()

fun.data 传入的函数，要求返回数据框，而数据框变量名为属性映射参数

下面我们来绘制一些个性化的图片

1. 95% 置信区间误差线

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(
    geom = "errorbar", width = 0.5,
    fun.data = ~mean_se(., mult = 1.96)
  )

注意：我们使用 ~ 符号来构造匿名函数，相当于

function(x) {mean_se(x, mult = 1.96)}

2. 指定填充色

我们使用变换函数来设置满足条件的分组的颜色，将分组的中值大于和小于阈值的组用颜色分开

func_median_color <- function(x, cut_off) {
  tibble(y = median(x)) %>%
    mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price)) +
  stat_summary(
    fun.data = func_median_color,
    fun.args = c(cut_off = 2800),
    geom = "bar"
  )

image

我们将额外的参数传递给 fun.args，替换匿名函数的方式，即相当于

fun.data = ~ func_median_color(., cut_off = 2800)

3. 设置点线图点的大小

我们根据分组中的观测值的数目来设置点线图中点的大小

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary(
    fun.data = function(x) {
      mean_se(x) %>%
        mutate(size = length(x) * 5 / nrow(diamonds))
    }
  )

R 数据可视化 —— ggplot 统计图层

前言

原理解析

使用

1. 95% 置信区间误差线

2. 指定填充色

3. 设置点线图点的大小

猜你喜欢

热点阅读