生物信息学习史

利用ggplot2进行数据可视化

2020-04-26  本文已影响0人  忍冬_a284

2020-04-25

1.1. first step --意识到ggplot绘制其实是由一层层图层组成,一个命令即可增加一层

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))


ggplot()creates a coordinate system 坐标系 that you can add layers图层 to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

1.2. The function geom_point() adds a layer of points to your plot, which creates a scatterplot.

Themapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
ggplot()--function; geom_point--function 函数; mapping--argument 参数
增加另一个数据的值:
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color=Species))

ggplot(data=iris)+geom_point(mapping = aes(x=Species,y=Sepal.Length,color=Sepal.Width))

实际上命令可叠加

ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,size=Species,color=Species))
Warning message:
Using size for a discrete variable is not advised. 

1.3. 还可手动设置对象的图形属性

ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color="grey"))


此处,color设置在aes()内部,意为:将“grey”这个字符串赋予color
ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),color="grey")

此处,color设置于aes()外部,不改变变量信息,只是改变geom_point()散点图的外观

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

1.4. 还可分面

注意:facet()是和aes()平级的函数

     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Species,nrow=2)

注意:species是离散变量。如果对连续变量sepal.width分面:

>     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Sepal.Width,nrow=4)

对iris数据进行统计:

> p<-iris
> distinct(p,iris)
> distinct(p,Sepal.Length)     #展示非重复数据
   Sepal.Length
1           5.1
2           4.9
3           4.7
4           4.6
5           5.0
6           5.4
7           4.4
8           4.8
9           4.3
10          5.8
11          5.7
12          5.2
13          5.5
14          4.5
15          5.3
16          7.0
17          6.4
18          6.9
19          6.5
20          6.3
21          6.6
22          5.9
23          6.0
24          6.1
25          5.6
26          6.7
27          6.2
28          6.8
29          7.1
30          7.6
31          7.3
32          7.2
33          7.7
34          7.4
35          7.9
> count(p,Sepal.Length)    #统计非重复数据
# A tibble: 35 x 2
   Sepal.Length     n
          <dbl> <int>
 1          4.3     1
 2          4.4     3
 3          4.5     1
 4          4.6     4
 5          4.7     2
 6          4.8     5
 7          4.9     6
 8          5      10
 9          5.1     9
10          5.2     4
# … with 25 more rows

1.5. 比较facet_grid() 一般需要将具有更多唯一值的变量放在列上

ggplot(data=mpg)+
    geom_point(mapping = aes(drv,y=cyl))
> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~cyl)
ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(cyl~drv)
> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~.)
  1. 关于stroke
 ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width,stroke=1,fill="lightpink",color=Species),shape=21)

放大可见描边内部形状填充了lightpink

1.6. 几何对象

> ggplot(data=iris)+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width,color=Species))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
>  

将相同对象纳入不同命令处理时,可以这样:

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point()+
+     geom_smooth()

(当然最基本函数是这样:)

> ggplot(data = iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width))
二者出图结果一致(这是必然的)

还可以单独对某一函数施加命令:

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species))+
+     geom_smooth()

同理,可以对不同图层施加不同数据:局部可以覆盖全局

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species),show.legend = F)+
+     geom_smooth(data=filter(iris,Species=="setosa"))

思考题

p1 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5)

p2 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv))

p3 <- ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, color = drv))

p4 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5)

p5 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, linetype = drv))

p6 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv))

library(gridExtra)     #把几张图排到一起
grid.arrange(p1, p2, p3, p4, p5, p6, ncol= 2, nrow = 3)

1.7. 统计变换

geom_bar
view(diamonds)
geom_bar的默认统计变换是stat_count,stat_count会计算出两个新变量-count(计数)和prop(proportions,比例)。

直方图默认的y轴是x轴的计数。此例子中x轴是是五种cut(切割质量),直方图自动统计了这五种质量的钻石的统计计数,当你不想使用计数,而是想显示各质量等级所占比例的时候就需要用到prop

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

group=1的意思是把所有钻石作为一个整体,显示五种质量的钻石所占比例体现出来。如果不加这一句,就是每种质量的钻石各为一组来计算,那么比例就都是100%,

> ggplot(data = diamonds) + 
+     stat_summary(
+         mapping = aes(x = cut, y = depth),
+         fun.min = min,
+         fun.max = max,
+         fun = median
+     )
stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",    #`stat_summary`默认几何对象
  position = "identity",    #`geom_pointrange`的默认统计变换,二者不可逆

因此,对于stat_summary,如果不适用该统计变换函数,而使用几何对象函数:

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )

(本图未加error bar)

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Complementary geoms and stats

geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_hex() stat_hex()
geom_freqpoly() stat_bin()
geom_histogram() stat_bin()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
geom_smooth() stat_smooth()
geom_violin() stat_violin()
geom_sf() stat_sf()
geom_pointrange() stat_identity()

They tend to have their names in common, stat_smooth() and geom_smooth(). However, this is not always the case, with geom_bar() and stat_count() and geom_histogram() and geom_bin() as notable counter-examples.
If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

ggplot2 geom layers and their default stats

geom default stat
geom_abline() -
geom_hline() -
geom_vline() -
geom_bar() stat_count()
geom_col() -
geom_bin2d() stat_bin_2d()
geom_blank() -
geom_boxplot() stat_boxplot()
geom_countour() stat_countour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_dotplot() -
geom_errorbarh() -
geom_hex() stat_hex()
geom_freqpoly() stat_bin() x
geom_histogram() -stat_bin() x
geom_crossbar() -
geom_errorbar() -
geom_linerange() -
geom_pointrange() -
geom_map() -
geom_point() -
geom_map() -
geom_path() -
geom_line() -
geom_step() -
geom_point() -
geom_polygon() -
geom_qq_line() stat_qq_line() x
geom_qq() stat_qq() x
geom_quantile() stat_quantile() x
geom_ribbon() -
geom_area() -
geom_rug() -
geom_smooth() stat_smooth() x
geom_spoke() -
geom_label() -
geom_text() -
geom_raster() -
geom_rect() -
geom_tile() -
geom_violin() stat_ydensity() x
geom_sf() stat_sf() x

ggplot2 stat layers and their default geoms

stat default geom
stat_ecdf() geom_step()
stat_ellipse() geom_path()
stat_function() geom_path()
stat_identity() geom_point()
stat_summary_2d() geom_tile()
stat_summary_hex() geom_hex()
stat_summary_bin() geom_pointrange()
stat_summary() geom_pointrange()
stat_unique() geom_point()
stat_count() geom_bar()
stat_bin_2d() geom_tile()
stat_boxplot() geom_boxplot()
stat_countour() geom_contour()
stat_sum() geom_point()
stat_density() geom_area()
stat_density_2d() geom_density_2d()
stat_bin_hex() geom_hex()
stat_bin() geom_bar()
stat_qq_line() geom_path()
stat_qq() geom_point()
stat_quantile() geom_quantile()
stat_smooth() geom_smooth()
stat_ydensity() geom_violin()
stat_sf() geom_rect()

关于geom_smooth:有3个回归函数
glm是广义线性回归函数,当然你也可以用它来做线性回归
lm是线性回归函数,不能拟合广义线性回归模型
loess

 >p1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = glm,se=FALSE)
> p2<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = lm,se=FALSE)
> p3<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = loess,se=FALSE)
library(gridExtra)
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

关于group=1

> p1=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, y = ..prop..))
> p2=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
> p3=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,group=1))
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

因为纵轴是..prop..,即分类变量中每个类别占总量的比,group=1就是将这些类别当作一组的这样一个整体去分别计算各个类别的占比,所以须有group=1。
否则,默认的就是各个类别各自一个“组”,在计数时就是普通的条形图,而在计算占比时每个类别都是百分百占比,所以每个条形图都是顶头的一样高。既第一条代码所画的图片。
若是还有填充的映射,如fill=color,则每种颜色代表的color的一个分类在每个条形图中都是高度为1,7种颜色堆叠在一起,纵坐标的顶头都是7。既第二条代码所画的图片。
作者:咕噜咕噜转的ATP合酶
链接:https://www.jianshu.com/p/f36c3f8cfb24

1.8 位置变换

ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,y=Sepal.Length,fill=Species),stat="identity")
> p1=ggplot(data=iris)+  
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5,position = "identity")
>  p2=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5)
> p3=ggplot(data=iris)+    
+  geom_bar(mapping = aes(x=Sepal.Width,color=Species),fill=NA,position = "identity")
> p4=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),position = "fill")
> grid.arrange(p1,p2,p3,p4,ncol=2,nrow=2)
p5=ggplot(data = iris)+
+ geom_bar(mapping=aes(x=Sepal.Width,fill=Species),position="dodge")
> grid.arrange(p1,p2,p3,p4,p5,ncol=2,nrow=3)
仔细比较有无position=“identity”,可以看到,加上position时可使柱状图彼此重叠。(而非堆积)

关于“过绘制”:
默认取整,因此部分重叠的点未能显示

 p6=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),position="jitter")
> p7=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))
> grid.arrange(p6,p7,ncol=2,nrow=2)
jitter为每个数据点添加了随机扰动
ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Sepal.Length))+
+ geom_jitter()

也可以生成相同结果

微调jitter

 p8=ggplot(data=mpg,mapping=aes(x=cty,y=hwy))+
+     geom_jitter(aes(color=class))
>p <- ggplot(mpg, aes(cyl, hwy)) 
p9 <- p+geom_jitter(aes(color=class))
> grid.arrange(p8,p9,ncol=2,nrow=2)
p10=ggplot(data=mpg,mapping=aes(x=cyl,y=hwy))+
+     geom_jitter(aes(color=class))
> grid.arrange(p8,p9,p10,ncol=2,nrow=2)

Compare and contrast geom_jitter() with geom_count().

The geom geom_jitter() adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

However, the reduction in overlapping comes at the cost of slightly changing the x and y values of the points.

The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

The geom_count() geom does not change x and y coordinates of the points. However, if the points are close together and counts are large, the size of some points can itself create overplotting. For example, in the following example, a third variable mapped to color is added to the plot. In this case, geom_count() is less readable than geom_jitter() when adding a third variable as a color aesthetic.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_jitter()
image
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_count()
image

As that example shows,unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

1.9 坐标系

coord_flip--置换X Y轴
coord_quickmap--为地图选择合适纵横比
coord_polar--极坐标系

usa<-map_data("usa")
nz<-map_data("nz")
ggplot(usa, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()
ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar()
ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

The argument theta = "y" maps y to the angle of each section. If coord_polar() is specified without theta = "y", then the resulting plot is called a bulls-eye chart.

上一篇下一篇

猜你喜欢

热点阅读