R-ggplot2-今天来说说坐标轴在绘图过程中出现乱序的问题

2019-12-14 本文已影响0人 TroyShen

0.问题导入
1.随机生成示例数据
2.默认数据可视化（图1）
3.字母加数字内置顺序排列错误排除
4.字母+数字内置顺序排列错误排除后结果可视化（图2）
5.字符串排列问题示例数据生成
6.R中字符串顺序排列示例（图3）
7.自定义横轴顺序
8.自定义横轴顺序后结果可视化（图4）
9.本篇总结
10.本文所用到package(若没有需要通过install.packages 进行安装)
11.致谢

0. 问题导入

有时候，我们在绘图过程中，横轴/纵轴往往不是连续变量，如时间，指标值等，就会存在R内置排序的问题。
对于字母与数字组合，若位数大于2，如 S1, S2, S3 ... S10, S11, S12...等这样的元素，R会优先按照个位大小进行优先排序，就会导致S1，S10，S11，S12，S2...,S9这样的乱序结果（如图1）。而我们有时候会需要进行顺序的样本比较，那么这样的问题如何解决呢？本篇给出解决方案～

图1 字母+数字乱序示例

1. 随机生成示例数据

x = paste0('S',1:20)
y = runif(20,-5,5)

pl_df = data.frame(x = x,y = y)

neg_index = which(pl_df$y < 0)
pl_df$fill = 'Positive Zone'
pl_df$fill[neg_index] = 'Negative Zone'

colnames(pl_df) = c('Sample','Value','Type')

数据结构一览：

Sample 代表样本编号，在地里应用中可以是多个栅格，或是多个样点；
Value表示样本对应的指标值；
Type表示各样本点指标值是否小于0的判断结果。

head(pl_df)
  Sample     Value          Type
1     S1  3.596238 Positive Zone
2     S2  3.188980 Positive Zone
3     S3 -1.661844 Negative Zone
4     S4  4.682527 Positive Zone
5     S5 -3.507134 Negative Zone
6     S6  4.383677 Positive Zone

2. 默认数据可视化（图1）

fontsize  = 12

p = ggplot()+
  geom_bar(data = pl_df, aes(x = Sample, y = Value, fill = Type),position = 'dodge',stat = 'identity')+
  theme_bw()+
  theme(
    axis.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    axis.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.position = 'bottom',
    legend.direction = 'horizontal',
    legend.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5)
  )+
  xlab('Sample Index')+
  ylab('Testing Value')

#dir.create('plot')
png('plot/plot1.png',
    height = 15, 
    width = 25,
    units = 'cm',
    res = 800)
print(p)
dev.off()

3. 字母加数字内置顺序排列错误排除

对，就是一行，而往往为了解决这个问题，我们可能需要上网搜很多帖子，花费最少1-2个小时解决这个问题。本文直接为大家提供方便，直接给出解决方案。

pl_df$Sample = factor(pl_df$Sample,
                      levels = x)

给完解决方案，我们来说说后面的原理。由于字母加数字组合之后，R系统会从字符串第一位开始比较，直到最后一位。比较规则以a-z, A-z, 1-9来进行先后排序，而不会根据个位+十位组合后的数值大小进行比较。这也是出现乱序的原因。

这里提出的解决方案是将如下Sample元素的Level 按照我们的需求进行自定义。
注意：如下展示的是更新前的pl_df$Sample 的Levels.

unique(pl_df$Sample)
 [1] S1  S2  S3  S4  S5  S6  S7  S8  S9  S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20
Levels: S1 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S2 S20 S3 S4 S5 S6 S7 S8 S9

并转换为如下 Levels为我们期望的排列顺序（更新后的pl_df$Sample 的Levels）

 pl_df$Sample
 [1] S1  S2  S3  S4  S5  S6  S7  S8  S9  S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20
Levels: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20

4. 字母+数字内置顺序排列错误排除后结果可视化（图2）

如图2红框所示，我们解决了R中字母+数字内置顺序排列的错误，按照我们预设的正确顺序进行了排列。

p2 = ggplot()+
  geom_bar(data = pl_df, aes(x = Sample, y = Value, fill = Type),position = 'dodge',stat = 'identity')+
  theme_bw()+
  theme(
    axis.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    axis.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.position = 'bottom',
    legend.direction = 'horizontal',
    legend.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5)
  )+
  xlab('Sample Index')+
  ylab('Testing Value')

#dir.create('plot')
png('plot/plot2.png',
    height = 15, 
    width = 25,
    units = 'cm',
    res = 800)
print(p2)
dev.off()

图2 字母+数字内置顺序排列错误排除后结果可视化

然而，小编写到这的时候又想到了一个问题。有时候，我们需要比较的对象是全字符编码的，如"Beijing", "Shanghai"等的城市名，或是病患A，病患B等。R的绘图系统会默认根据字母a-z/A-Z进行顺序排列。但这真的是我们想要的吗？

比如，我们我们可能需要比较Beijing/Shanghai/AnHui等多个地区的GDP产值。但若根据字符排序，AnHui会排到Beijing前面，而Shanghai则会与Beijing分开。如果我们的研究重点是将Beijing-Shanghai进行横向比较的化，问题该怎么解决呢？本篇同样给出解决方案

5. 字符串排列问题示例数据生成

index = round(runif(20,1,26))
index2 = round(runif(20,1,26))
x2 = paste0(LETTERS[index],letters[index2])

pl_df2 = data.frame(City_initial = x2, GDP = y)
pl_df2$Type = 'Positive Value'
neg_index = which(pl_df2$GDP <0)
pl_df2$Type[neg_index] = 'Negative Value'

数据结构预览：

  City_initial       GDP           Type
1           We  3.596238 Positive Value
2           Kh  3.188980 Positive Value
3           Dd -1.661844 Negative Value
4           Pi  4.682527 Positive Value
5           Xw -3.507134 Negative Value
6           Hm  4.383677 Positive Value

6. R中字符串顺序排列示例（图3）

若不经处理直接按照以下代码绘图，R会根据字符串进行顺序排列，结果如图3。

p3 = ggplot()+
  geom_bar(data = pl_df2, aes(x = City_initial, y = GDP, fill = Type),position = 'dodge',stat = 'identity')+
  theme_bw()+
  theme(
    axis.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    axis.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.position = 'bottom',
    legend.direction = 'horizontal',
    legend.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5)
  )+
  xlab('City Initials')+
  ylab('Testing Value')

#dir.create('plot')
png('plot/plot3.png',
    height = 15, 
    width = 25,
    units = 'cm',
    res = 800)
print(p3)
dev.off()

图3 R中字符串顺序排列示例

7. 自定义横轴顺序

自定义横轴顺序与字母+数字的定义方式原理一样，我们需要进行如下设置：

#转换前City_initial 结构
unique(pl_df2$City_initial)
 [1] We Kh Dd Pi Xw Hm Sf Pn Oc Ux Qm Fi Om Et Ks Wm Eb Ou Hv On
Levels: Dd Eb Et Fi Hm Hv Kh Ks Oc Om On Ou Pi Pn Qm Sf Ux We Wm Xw
#自定义横轴顺序
pl_df2$City_initial = factor(pl_df2$City_initial,
                             levels = x2)
#转换后City_initial 顺序
[1] We Kh Dd Pi Xw Hm Sf Pn Oc Ux Qm Fi Om Et Ks Wm Eb Ou Hv On
Levels: We Kh Dd Pi Xw Hm Sf Pn Oc Ux Qm Fi Om Et Ks Wm Eb Ou Hv On

8. 自定义横轴顺序后结果可视化（图4）

p4 = ggplot()+
  geom_bar(data = pl_df2, aes(x = City_initial, y = GDP, fill = Type),position = 'dodge',stat = 'identity')+
  theme_bw()+
  theme(
    axis.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    axis.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.position = 'bottom',
    legend.direction = 'horizontal',
    legend.text = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5),
    legend.title = element_text(face = 'bold',color = 'black',size = fontsize,hjust = 0.5)
  )+
  xlab('City Initials')+
  ylab('Testing Value')

#dir.create('plot')
png('plot/plot4.png',
    height = 15, 
    width = 25,
    units = 'cm',
    res = 800)
print(p4)
dev.off()

图4 自定义字符串顺序后结果可视化

9. 总结

本篇主要解决了两个问题：

1. 在字符串+数字/数字单独做横轴时，如何解决如1，11...等排序乱序的问题？
2.如何打破R中内置排列规则，按照我们的需求自定义样本横轴排列顺序，如将Beijing, Shanghai排列到一起等？

10. 本文所用到package(若没有需要通过install.packages 进行安装)

library('ggplot2')

11. 致谢

首先，感谢大家的持续关注，小编会继续努力，持续更新下去的！

大家如果觉得有用，还麻烦大家关注点赞，也可以扩散到朋友圈，多谢大家啦～

大家如果在使用本文代码的过程有遇到问题的，可以留言评论，也可以私信我哈~~

小编联系方式