R语言ggplot2:第五章 工具箱
第5章 工具箱
目录
- 5.1 简介
- 5.2 图层叠加的总体策略
- 5.3 基本图形类型
- 5.4 展示数据分布
- 5.5 处理遮盖问题
- 5.6 曲线图
- 5.7 绘制地图
- 5.8 揭示不确定性
- 5.9 统计摘要
- 5.10 添加图形注释
- 5.11 含权数据
5.1 简介
混合使用ggplot2和qplot来概述基本的几何对象和统计变换
5.2 图层叠加的总体策略
图层由三种用途:
- 用以展示数据本身
- 用以展示数据的统计摘要
- 用以添加额外的元数据(metadata),上下文信息和注解。
library(ggplot2)
5.3 基本图形类型
面积图、条形图、线条图、散点图、多边形、添加标签、色深图(水平图),以下代码绘制了以上的几何对象
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
p + geom_point() + labs(title = "geom_point")
p + geom_bar(stat = "identity") + labs(title = "geom_bar(stat = \"identity\")")
p + geom_line() + labs( title = "geom_line")
p + geom_area() + labs(title = "geom_area")
p + geom_path() + labs(title = "geom_path")
p + geom_text(aes(label = label)) + labs(title = "geom_text")
p + geom_tile() + labs(title = "geom_tile")
p + geom_polygon() + labs(title = "geom_polygon")
上面的元素比较简单,不再贴图了。
5.4 展示数据分布
例子:对于一维连续分布,最重要的是直方图(默认统计count)或者是频率多边形(默认统计density)。永远不要奢望默认的参数可以取得强有力的表现。
这三幅图均展示了一个有趣的模式:随着钻石质量的提高,分布逐渐左偏移且愈发对称。
depth_dist <- ggplot(diamonds, aes(depth)) + xlim(58, 68)
depth_dist +
geom_histogram(aes(y = ..density..), binwidth = 0.1) +
facet_grid(cut ~.)
data:image/s3,"s3://crabby-images/1d53d/1d53de714fa84de02a8c940c03697e984ac8965f" alt=""
depth_dist + geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill")
data:image/s3,"s3://crabby-images/1feec/1feec7fba53fe435a12232de58cde411f9c13c0d" alt=""
depth_dist + geom_freqpoly(aes(y = ..density.., colour = cut), binwidth = 0.1)
data:image/s3,"s3://crabby-images/a0df0/a0df01ed7d2df031577e8989bb2e5f98214af0e6" alt=""
例子:针对类别性或连续性变量取条件所得到的的箱线图
library(plyr)
qplot(cut, depth, data = diamonds, geom = "boxplot")
data:image/s3,"s3://crabby-images/c673f/c673f7128a786712ea1989952cd03fbd838e61b3" alt=""
qplot(carat, depth, data = diamonds, geom = "boxplot", group = round_any(carat, 0.1, floor),xlim = c(0, 3))
data:image/s3,"s3://crabby-images/c96f6/c96f63d015683758d975cc8bbbeec0e8883347e7" alt=""
例子:扰动点图通过在离散型分布上添加随机噪声以避免遮盖绘制问题,这是一种较为粗糙的方法
qplot(class, cty, data = mpg, geom = "jitter")
data:image/s3,"s3://crabby-images/d5e41/d5e411acc226d8ab4ed834eb13b64c048986f4aa" alt=""
qplot(class, drv, data = mpg, geom = "jitter")
data:image/s3,"s3://crabby-images/2f5eb/2f5eb2719c81e35871c4c742eca3404817730011" alt=""
例子:密度图,必须是已知潜在的密度分布为平滑、连续且无界的时候使用这种密度图
qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70))
data:image/s3,"s3://crabby-images/e423d/e423dc003a67a21d728982ab8c86d016cc37d234" alt=""
qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70), fill = cut, alpha = I(0.2))
data:image/s3,"s3://crabby-images/88128/88128d4858f62da8b9b7baf6513e7a1d0bd36772" alt=""
5.5 处理遮盖问题
散点图是研究两个连续变量间关系的重要工具。但是当数据量很大时,这些点经常会出现重叠现象,从而掩盖真实的关系。根据这种图形得到任何结论都是值得怀疑的,这种问题被称为遮盖绘制(overplotting)。
- 方法一:小规模的遮盖绘制问题可以通过绘制更小的点
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y))
norm + geom_point()
data:image/s3,"s3://crabby-images/7f635/7f6350295316eebb74ae6d4809f9c6559fbde051" alt=""
norm + geom_point(shape = 1)
data:image/s3,"s3://crabby-images/c8e44/c8e449b80bf0cf0ef2d0dfc1b4eb4918b7c0ec40" alt=""
norm + geom_point(shape = ".") ##点的大小为像素级
data:image/s3,"s3://crabby-images/23c08/23c08dadaabfa507c86645df003cf99d1bce19cf" alt=""
- 方法二:更大数据集,调整透明度, R中最小为1/256
norm + geom_point(colour = "black", alpha = 1/3)
data:image/s3,"s3://crabby-images/7ce64/7ce641d98e67089da3ae89ea688267174ad8b7ff" alt=""
norm + geom_point(colour = "black", alpha = 1/5)
data:image/s3,"s3://crabby-images/eac6d/eac6d89072f6c9944fb9670989edc4b18a35054e" alt=""
norm + geom_point(colour = "black", alpha = 1/10)
data:image/s3,"s3://crabby-images/bf2cf/bf2cf3afc6f93696caa3ae9d512bb93cc23dd791" alt=""
- 方法三:在点上增加随机扰动减轻重叠
td <- ggplot(diamonds, aes(table, depth)) + xlim(50, 70) + ylim(50, 70)
td + geom_point()
td + geom_jitter()
data:image/s3,"s3://crabby-images/5b905/5b905bcf75e11b0b2ec74f96432f921490715c16" alt=""
jit <- position_jitter(width = 0.5)
td + geom_jitter(position = jit)
data:image/s3,"s3://crabby-images/7525e/7525e50b81ad412d8e0b1516b6392e1fda78e8bd" alt=""
td + geom_jitter(position = jit, colour = "black", alpha = 1/10)
data:image/s3,"s3://crabby-images/c6a9f/c6a9fa9a79344fbe8635fb26def5d53dcbc0db9b" alt=""
td + geom_jitter(position = jit, colour = "black", alpha = 1/50)
data:image/s3,"s3://crabby-images/a4dad/a4dad14f52174f18a41178c96fa0b7996d7ee514" alt=""
td + geom_jitter(position = jit, colour = "black", alpha = 1/200)
data:image/s3,"s3://crabby-images/ea031/ea0312eaafd8c6d8c3d2eeb0166932ad98e75a45" alt=""
- 方法四;借鉴二维核密度图的思想,分箱统计其中的数据,可视化该数值
d <- ggplot(diamonds, aes(carat, price)) + xlim(1,3) +theme(legend.position = "none")
d + stat_bin2d()
data:image/s3,"s3://crabby-images/d25e3/d25e37d1ee881dd059c7d4060b33dd9344c365cc" alt=""
d + stat_bin2d(bins = 10)
data:image/s3,"s3://crabby-images/624d9/624d971f04e7bb9455386023601ca7fed27f5d03" alt=""
d + stat_bin2d(binwidth = c(0.02, 200))
data:image/s3,"s3://crabby-images/a436d/a436d5ef60f9890ddba91fb9babfe96becb224a4" alt=""
d + stat_binhex()
data:image/s3,"s3://crabby-images/9616a/9616af50f54b312e2f9d4e78e11f55d575d948f9" alt=""
d + stat_binhex(bins = 10)
data:image/s3,"s3://crabby-images/37195/37195d245fee46bbfb156db471372717b6a4f2f7" alt=""
d + stat_binhex(binwidth = c(0.02, 200))
data:image/s3,"s3://crabby-images/4fdc1/4fdc1df28972bcd3de57c29d34db31265be171ba" alt=""
- 方法五:使用stat_density2d做二维密度估计,并添加等高线或者是着色瓦片直接显示密度,或者是大小院分布密度成比例的点
d <- ggplot(diamonds, aes(carat, price)) + xlim(1, 3) + theme(legend.position = "none")
d + geom_point() + geom_density2d()
data:image/s3,"s3://crabby-images/f6dff/f6dff4b057080cdf5d19b37e566212ea3ee8d4f5" alt=""
d + stat_density2d(geom = "point", aes(size = ..density..), contour = F) + scale_size_area()
data:image/s3,"s3://crabby-images/0859e/0859e71cec0c39e6b0dfeeeb33a6868d3913b1dc" alt=""
d + stat_density2d(geom = "tile", aes(fill = ..density..), contour = F)
data:image/s3,"s3://crabby-images/6427e/6427e4c91e9981a28685077b8dc8cec627d232bf" alt=""
last_plot() + scale_fill_gradient(limits = c(1e-5, 8e-4))
data:image/s3,"s3://crabby-images/8dee4/8dee436566105d188d0a407e0e101f8a456b7303" alt=""
5.6 曲线图
常用工具:着色瓦片,等高线图,气泡图
5.7 绘制地图
maps包与ggplot2的结合十分方便,使用地图的原因,一是为了空间数据添加参考轮廓线,一个是不同区域填充颜色构建等值线图
添加地图边界可以用borders()来完成,以下是一个使用实例。
library(maps)
data(us.cities)
big_cities <- subset(us.cities, pop > 500000)
qplot(long, lat, data = big_cities) +borders("state", size = 0.5)
data:image/s3,"s3://crabby-images/b5096/b509619bbb95c7244957873920c59626c28db958" alt=""
tx_cities <- subset(us.cities, country.etc == "TX")
ggplot(tx_cities, aes(long, lat))+
borders("county", "texas", colour = "grey70") +
geom_point(colour = "black", alpha = 0.5)
data:image/s3,"s3://crabby-images/de396/de396a54f2ed577f95bd89d63bc0db6ac8598713" alt=""
等值线图:使用map_data()将地图数据转换为数据框,此数据框之后可以通过merge()操作与数据融合,最后绘制等值线,如下所示:
library(maps)
states <- map_data("state")
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
choro <- merge(states, arrests, by = "region")
choro <- choro[order(choro$order),]
qplot(long, lat, data = choro, group = group, fill = assault, geom = "polygon")
data:image/s3,"s3://crabby-images/2ee38/2ee3878f9ce4fd0c25b21a1068e2e4e3f5f2d57d" alt=""
qplot(long, lat, data = choro, group = group, fill = assault / murder, geom = "polygon")
data:image/s3,"s3://crabby-images/75de2/75de2e95e04900453227c064ec4070d2aee3b81e" alt=""
例子:对地图数据进行标注
library(plyr)
ia <- map_data("county", "iowa")
mid_range <- function(x) mean(range(x, na.rm = TRUE))
centres <- ddply(ia, .(subregion), colwise(mid_range, .(lat, long)))
ggplot(ia, aes(long, lat))+
geom_polygon(aes(group = group), fill = NA, colour = "grey60") +
geom_text(aes(label = subregion), data = centres, size = 2, angle = 45)
data:image/s3,"s3://crabby-images/048ce/048ceeb0ee4fef9220695bbc60281aa4f02cfc60" alt=""
5.8 揭示不确定性
在ggplot中,对于不确定信息的可视化主要有四种几何对象:
连续型X变量:geom_ribbon(仅展示区间),geom_smooth(stat = "identity")(同时展示区间和中间值)
离散型X变量:geom_errorbar(仅展示区间),geom_crossbar(同时展示区间和中间值);geom_linerange(仅展示区间),geom_pointrange(同时展示区间和中间值)
对于线性模型,effect包(Fox, 2008)非常适合提取这类值。下面的例子拟合了一个双因素含交互效应回归模型,并且展示了如何提取边际效应和条件效应。
d <- subset(diamonds, carat <2.5 & rbinom(nrow(diamonds), 1, 0.2) == 1)
d$lcarat <- log10(d$carat)
d$lprice <- log10(d$price)
#剔除整体的线性趋势
detrend <- lm(lprice ~ lcarat, data = d)
d$lprice2 <- resid(detrend)
mod <- lm(lprice2 ~ lcarat*color, data = d)
library(effects)
effectdf <- function(...){
suppressWarnings(as.data.frame(effect(...)))
}
color <- effectdf("color", mod)
both1 <- effectdf("lcarat:color", mod)
carat <- effectdf("lcarat", mod, default.levels = 50)
both2 <- effectdf("lcarat:color", mod, default.leves = 3)
## 图 进行数据变换以移除显而易见的效应,1为对x轴和y轴的数据均以10对底的对数以剔除非线性, 2 为剔除了主要的线性趋势
qplot(lcarat, lprice, data = d, colour = color)
data:image/s3,"s3://crabby-images/93f07/93f07aa6d8e1b8e8795ee2f3ee69506e3d0ff3bc" alt=""
qplot(lcarat, lprice2, data = d, colour = color)
data:image/s3,"s3://crabby-images/74afe/74afea7da869bb8f1a424f66edb043af60579a30" alt=""
## 图 展示模型估计结果中变量color的不确定性,左图为color的边际效应,有图则是针对变量carat的不同水平,变量color的条件效应,误差棒显示了95%的逐点置信区间
fplot <- ggplot(mapping = aes(y = fit, ymin = lower, yamx = upper)) +
ylim(range(both2$lower, both2$upper))
fplot %+% color + aes(x = color) + geom_point() + geom_errorbar(aes(ymin = lower, ymax = upper))
data:image/s3,"s3://crabby-images/d3bdc/d3bdc9f6ed938be543f872c336b3c90b8909f39e" alt=""
fplot %+% both2 +
aes(x = color, colour = lcarat, group = interaction(color, lcarat)) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
geom_line(aes(group = lcarat)) +
scale_colour_gradient()
data:image/s3,"s3://crabby-images/a3c37/a3c374525898cf991045ae95ec6b8c332c9c5d6b" alt=""
## 图 展示模型估计结果中变量carat的不确定性
fplot %+% carat + aes(x = lcarat) + geom_smooth(stat = "identity", se = TRUE)
data:image/s3,"s3://crabby-images/43c10/43c1093448c3408ee00d868def3a1b0659f61f36" alt=""
ends <- subset(both1, lcarat == max(lcarat))
fplot %+% both1 + aes(x = lcarat, colour = color)+
geom_smooth(stat = "identity", se = TRUE) +
scale_colour_hue() +
theme(legend.position = "none")+
geom_text(aes(label = color, x = lcarat +0.02),ends)
data:image/s3,"s3://crabby-images/fe5e1/fe5e13f5277ab72cbe8c626a60185fbfc4c3335a" alt=""
5.9 统计摘要
stat_summary():对于每个x取值,计算对应y值的统计摘要
5.9.1 单独的摘要计算函数
midm <- function(x) mean(x, trim = 0.5)
m2 + stat_summary(aes(colour = "trimmed"), fun.y = midm, geom = "point") +
stat_summary(aes(colour = "raw"), fun.y = mean, geom = "point") +
scale_colour_hue("Mean")
5.9.2 统一的摘要计算函数
fun.data可以支持更复杂的函数,比如来自Hmisc包的摘要计算函数。
iqr <- function(x,...) {
qs <- quantile(as.numberic(x), c(0.25,0.75), na.rm = T)
names(qs) <- c("ymin", "ymax")
qs
}
m + stat_summary(fun.data = "iqr", geom = "ribbon")
5.10 添加图形注解
这些注解仅仅是额外的数据而已。有逐个添加或者是批量添加两种方式。
下面的例子:向经济数据中添加有关美国总统的信息
绘制原始失业率曲线
(unemp <- qplot(date, unemploy, data = economics, geom = "line", xlab = "", ylab = "No. unemployed (1000s)"))
data:image/s3,"s3://crabby-images/95ce7/95ce7f821c9b9e0e753055eed20be83d675ee613" alt=""
# 添加总统就职时间竖线
presidential <- presidential[-(1:3),]
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
unemp + geom_vline(aes(xintercept = as.numeric(start)), data = presidential)
data:image/s3,"s3://crabby-images/4b0e4/4b0e458dd9d555e3ab9c6f985bef8ef080ee20b2" alt=""
library(scales)
unemp + geom_rect(aes(NULL, NULL, xmin = start, xmax = end, fill = party), ymin = yrng[1], ymax = yrng[2], data = presidential, alpha = 0.2)+
scale_fill_manual(values = c("blue","red"))
data:image/s3,"s3://crabby-images/a4c94/a4c94783cfa0d956c402caafb6881c20d69621a9" alt=""
last_plot() + geom_text(aes(x = start, y = yrng[1],label = name), data = presidential, size = 3, hjust = 0, vjust = 0)
data:image/s3,"s3://crabby-images/e1670/e16705e9ec627d69e9439f98b437e7d5824ca67b" alt=""
caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40), collapse = "\n")
unemp + geom_text(aes(x, y, label = caption), data = data.frame(x = xrng[2], y = yrng[2]), hjust = 1, vjust = 1, size = 4)
data:image/s3,"s3://crabby-images/d93f4/d93f4c9fddb071663fc2f2d57056216e7133e2fe" alt=""
highest <- subset(economics, unemploy == max(unemploy))
unemp + geom_point(data = highest, size = 3, colour = "red", alpha = 0.5)
data:image/s3,"s3://crabby-images/f7104/f710471eef2c458540821e897b3972611dfad2bf" alt=""
5.11 含权数据
例子:使用点的大小来表达权重
qplot(percwhite, percbelowpoverty, data = midwest)
data:image/s3,"s3://crabby-images/2d58e/2d58eb5dd4140860ecceb3c0bf933a5dff77c213" alt=""
qplot(percwhite, percbelowpoverty, data = midwest, size = poptotal / 1e6) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
data:image/s3,"s3://crabby-images/b1869/b1869cbb92c6ad8de9f09c8f8d188a1786457e2c" alt=""
qplot(percwhite, percbelowpoverty, data = midwest, size = area) +
scale_size_area()
data:image/s3,"s3://crabby-images/6179a/6179ab14f7fb61665df91f4bab9423ea92b9ed0f" alt=""
例子:将人口密度作为权重,观察白种人比例和贫困线以下人口比例的关系
lm_smooth <- geom_smooth(method = lm, size = 1)
qplot(percwhite, percbelowpoverty, data = midwest) + lm_smooth
data:image/s3,"s3://crabby-images/40d08/40d081aee031aca1430ae162601ff156fda040ed" alt=""
qplot(percwhite, percbelowpoverty, data = midwest, weight = popdensity, size = popdensity) +lm_smooth
data:image/s3,"s3://crabby-images/4d7de/4d7de1a8e52e8056acef9ba42b86ef54502cd520" alt=""
例子:不含权重的直方图展示了郡的数量,含权重信息的直方图展示了人口数量
qplot(percbelowpoverty, data = midwest, binwidth = 1)
data:image/s3,"s3://crabby-images/5d89b/5d89b0fafba6fe2a26cb6ba730181275ad977de9" alt=""
qplot(percbelowpoverty, data = midwest, weight = poptotal, binwidth = 1) +ylab("population")
data:image/s3,"s3://crabby-images/14209/14209317ccdc19589690b0cf6437c80e96c639d8" alt=""
本章完结,撒花~