R_for_Data_Science_Script&EDA&pr

2020-04-16 本文已影响0人城管大队哈队长

这是英文版的第6,7,8章

6 Workflow: scripts

这章节其实没啥可说的，只是想起一篇文章，其中有一些小tips

# 看了你安装的包
installed.packages()

# 看下你加载的包
(.packages())

# 批量加载
load_package <- c("ComplexHeatmap","circlize")
if (!all(load_package %in% (.packages()))) lapply(load_package, library, character.only = TRUE)

来自An efficient way to install and load R packages
原文还有很多东西

7.1 Introduction

Exploratory Data Analysis（EDA）是一个不断迭代的过程
1. Generate questions about your data.

Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

关于EDA的探索，《Bioinformatics Data skills》里面有一章节也讲的也挺好（应该是Chapter 8: A Rapid Introduction to the R Language），而且是关于生信的。

7.2 Questions

EDA是一个不断探索你数据的过程，而其中最简单的方法就是不断地问问题
You can loosely word these questions as:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
To make the discussion easier, let’s define some terms:
- A variable is a quantity, quality, or property that you can measure.
- A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
- Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.（这应该是tidyverse生态的核心数据格式，行为观测，列为变量）
  
  但我们的生信数据一般是反过来的，行为变量（基因），列为观测（各个样本）

7.3 Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. Every variable has its own pattern of variation, which can reveal interesting information（其实这话在一定程度上，也就意味着每个变量都有其特有的分布）。

像RNA-Seq的count值的分布就是属于伽马-泊松混合分布（即二项分布），而RNA-Seq得到的p-value值的分布就是零假设为真的基因p-value与有差异的基因的p-value的混合分布

The best way to understand that pattern is to visualise the distribution of the variable's values

你也可以用vcd包的goodfit来看下你的数据和某一类分布的拟合程度
norm_data <- rpois(1000,lambda = 5)
ofit = goodfit(norm_data, "poisson")
plot(ofit, xlab = "")
参考：
常见分布的拟合方法
Modern Statistics for Modern Biology_4.4.3

可视化你变量的分布取决于你的变量是连续型变量还是离散型变量

如果是离散型变量的话，一般就是用geom_bar

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

img

高度可以用

diamonds %>% 
  count(cut)
#> # A tibble: 5 x 2
#>   cut           n
#>   <ord>     <int>
#> 1 Fair       1610
#> 2 Good       4906
#> 3 Very Good 12082
#> 4 Premium   13791
#> 5 Ideal     21551

计算出来，然后

library(tidyverse)
diamonds %>% 
  count(cut) %>% 
  ggplot(aes(x = cut, y = n)) +
  geom_bar(stat = "identity")

如果是连续型变量的话，用geom_histogram

ggplot(data = diamonds) +
   geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

image

You can compute this by hand by combining dplyr::count() and ggplot2::cut_width():

diamonds %>% 
  count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#>   `cut_width(carat, 0.5)`     n
#>   <fct>                   <int>
#> 1 [-0.25,0.25]              785
#> 2 (0.25,0.75]             29498
#> 3 (0.75,1.25]             15977
#> 4 (1.25,1.75]              5313
#> 5 (1.75,2.25]              2002
#> 6 (2.25,2.75]               322
#> # … with 5 more rows

# 其实就是cut_width就是把连续型变量切割成区间
> cut_width(diamonds$carat, 0.5) %>% 
+   head()
[1] [-0.25,0.25] [-0.25,0.25] [-0.25,0.25] (0.25,0.75] 
[5] (0.25,0.75]  [-0.25,0.25]
11 Levels: [-0.25,0.25] (0.25,0.75] ... (4.75,5.25]

cut_interval makes n groups with equal range

cut_number makes n groups with (approximately) equal numbers of observations

cut_width makes groups of width width.

在用geom_histogram的时候，你需要不断调整你的binwidth或则bins数目，才能得到一个合适的图像。

binwidth太小，你的图像就会很sharp，很难看到趋势。太大的话，你的图像就会过于smooth，也很难看到趋势。

# 这里是选择了一部分数据集，binwidth也变小了
smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

image

除了用geom_hist之外，你还可以用geom_freqpoly

# 用这个图的话，你就会发现其实geom_freqpoly就是hist的顶点连线
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)  +
  geom_freqpoly(binwidth = 0.1)

# 这里是用不同cut了
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

image

我们可以从可视化中发现一些趋势，比如我们从下图中就可以看到有一些亚群数据存在。

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

image

还有像这个数据

ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)

image

在生信中，最常见的一个亚群现象可能就是如果你的样本污染的话，你的fastQC的GC含量那里，就会出现双峰。这是因为一个物种的GC含量会呈现于类似正态分布的分布，但如果你的样本里面混入了其他物种，比如说菌等，那么你的GC含量就会有双峰，甚至几个峰。

另一个我能想起来的例子，就是“邪恶的鸢尾花”

# 你会发现Width会出现双峰
# 其实是因为Petal.width是好几种花的数据
hist(iris$Petal.Width)

image

Many of the questions above will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable.

文中提到了异常数据的发现，这让我想起breakdown point这个概念。其代表的是一个估计器对于脏数据的最大容忍程度。比如median对于异常值并不敏感，breakdown point是1/2。而mean则对异常值很敏感，其值是 1/N，N代表了你的数据量。我个人理解就是你要改变整个数据的median，你需要改变一半数据，而如果你要改变整个数据的mean，你只需要改变一个数据。这也就是为啥有时候我们对于异常值影响很大的相关性数据，会考虑使用Sperman，而不是Pearson。或者我们在做线性回归的时候，为了考虑模型的稳健性，会考虑用median，而不是mean。

纯属瞎想……

参考：

稳健回归（Robustness regression）

Modern Statistics for Modern Biology_8.7.4 Robustness

当你有很多数据的时候，你很难在柱状图中看到异常数据。比如下图

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

image

这时候你可以考虑Zoom in你的数据

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

image

(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.) 注意这个区别

Solution的Exercise 7.3.4会对此有些解释

让我想起了ggforce的缩放

library(ggforce)
#> Loading required package: ggplot2
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
  geom_point() +
  facet_zoom(x = Species == "versicolor")

image

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

Exercise 7.3.4

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image

However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 14714 rows containing non-finite values (stat_bin).
#> Warning: Removed 6 rows containing missing values (geom_bar).

image

7.4 Missing values

如果你遇到了一些异常值，但不想费心地处理，可以考虑两种对待异常值的方法

一种是把含有异常值的整行都丢弃

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))

I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

另一种是把异常值取代为NA

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false. Alternatively to ifelse, use dplyr::case_when(). case_when() is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.

Exercise 7.4.1

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

这个solution推荐看一下

Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 9 rows containing non-finite values (stat_bin).

img

In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category.

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))

img

In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.

7.5 Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

注意这跟协变量（covariance）是不一样的

下面的

一个分类变量、一个连续变量

两个分类变量

两个连续变量

都值得去看一看

7.5.1 A categorical and continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:

geom_freqpoly()的纵坐标是数目，如果你的分类变量对应的数目差距很大的话，你如果放一个图里面，那么数目少那组的自身变化就很难看出来了。

For example, let’s explore how the price of a diamond varies with its quality:

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

img

这里紫色的Fair组你就很难看出趋势来

It’s hard to see the difference in distribution because the overall counts differ so much:

ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))

img

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

img

我们可以看到，似乎质量差的钻石反而价格更高

这跟直接geom_density得到的密度图还是不一样的。但我也不知道density是怎么出来的

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_density(mapping = aes(colour = cut))

image

Smoothed density estimates

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of（这个解释甚好）:
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
  
  image

看起来上面这个图是个不对称的分布

从boxplot同样发现了质量相对较低的钻石价格更高，而高质量的钻石价格偏低

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

image

有个小插曲，怎么对连续型变量在图中根据某一顺序进行排序

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

img

If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip().

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

img

这个比较有用的应该是你画GO plot的时候，可以按某个顺序排下来，这样图会很清爽。

Exercise 7.5.1.2

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

这里算是前面那个质量越差反而价格越高的解释吧

这个来自solution 核心意思就是因为质量差的那些钻石，其个头比较大，所以相对来说，价格比较大

如果不想看的话，可以跳过。

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat, clarity, color, and cut. I ignore the dimensions（就是所谓的diamond数据里面的x、y、z） of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, I use a scatter plot to visualize their relationship.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

img

However, since there is a large number of points in the data, I will use a boxplot by binning carat (as suggested in the chapter).

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

一个小问题就是如果你是ggplot2 3.3.0的话，是跑不出这样的结果的

img

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color are in the wrong order. Before plotting, I will reverse the order of the color levels so they will be in increasing order of quality on the x-axis. The color column is an example of a factor variable, which is covered in the “Factors” chapter of R4DS.

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

img

fct_rev这个命令很奇怪

# 变不回去了
> f <- factor(c("a", "b", "c"))
> fct_rev(f)
[1] a b c
Levels: c b a
> fct_rev(f)
[1] a b c
Levels: c b a

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

img

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

img

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

Y叔介绍过的翻云覆雨图

Exercise 7.5.1.4

One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn?

How do you interpret the plots?

基因组数据一般就是大量数据，个人感觉如果对于一些基因组数据使用小提琴图、箱式图、点图等效果都不好，因为会是密密麻麻的一团。一般都得过滤之后才会展现出一些趋势。

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()

img

Exercise 7.5.1.6

If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

这个很适合实验上的表型数据展现。ggbeeswarm

There are two methods:

geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )

img

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

img

7.5.2 Two categorical variables

To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination（把所有组合个数头统计出来）. One way to do that is to rely on the built-in geom_count():

# geom_count的确是个好函数唉
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

# 应该是等价于
diamonds %>% 
  group_by(cut, color) %>% 
  mutate(n = n()) %>% 
  ggplot(mapping = aes(x = cut, y = color, size = n)) +
  geom_point()

img

上面的图让我想起了Motif的图
image
这个图Y叔有讲过一点：怎样用HOMER算出的P-value画出CNS级别的泡泡图？

还有一种展现方法：visualise with geom_tile() and the fill aesthetic:

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

img

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.

关于seriation包，搜到了这个【ComplexHeatmap包学习笔记】2.5 Seriation

Exercise 7.5.2.1

How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

其实核心就是用百分比，类似于之前的密度图。我觉得还是很值得我们学习的

To clearly show the distribution of cut within color, calculate a new variable prop which is the proportion of each cut within a color. This is done using a grouped mutate.

diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1)) # from the viridis colour palette library

img

Similarly, to scale by the distribution of color within cut,

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1))

img

I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.（数据集内部比较与数据集之间的比较）

# 如果你不限定颜色范围的话，图例颜色标度会自适应你的颜色范围
diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis()
image
这样子看起来非常的清楚，但有一个问题是这个颜色标度只适应于你现在的数据集，如果你还要跟其他的同类型数据集进行比较的话，就要scale_fill_viridis(limits = c(0, 1))了。这样不同数据集之间才可以比较

geom_tile有个问题是不能聚类。但Y叔的ggtree却可以做到聚类，这一点比patchwork还要好（patchwork可以拼图，但其只是对准坐标轴）
Example of aligning tree with data side by side by creating composite plot.

Exercise 7.5.2.3

Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

It’s usually better to use the categorical variable with a larger number of categories（这个变量里面有很多种） or the longer labels（标签名字很长） on the y axis. If at all possible, labels should be horizontal because that is easier to read（y轴上的标签也应该是垂直，更易读）.

However, switching the order doesn’t result in overlapping labels.

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(y = color, x = cut)) +
  geom_tile(mapping = aes(fill = n))

img

Another justification, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.

7.5.3 Two continuous variables

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above).（尤其是你比如说一个甲基化数据，一个RNA-Seq数据，画个点图，如果不经过筛选过滤，你的数据点会铺满整个页面，毫无规律可言） You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency（透明度）
面对这种情况一个办法就是用：geom_bin2d() and geom_hex() to bin in two dimensions. -》use a fill color to display how many points fall into each bin

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

img

当然，我觉得用二维核密度估计图也是可以的
m <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
 geom_point() +
 xlim(0.5, 6) +
 ylim(40, 110)

m + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon")
img
来自：geom_density_2d

Another option is to bin one continuous variable so it acts like a categorical variable（把连续型变量分bin，我们之前遇到过了）

# 下面代码出不来下图的结果
# 可能是版本的问题
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

根据的geom_boxplot的reference来看，ggplot2 3.3.0版本的确不是下图的结果。

但我在服务器的ggplot2 3.2.1跑出来的就是跟书里面一样的图

不过这个问题应该只存在于你的x变量是连续型变量的时候

img

One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE （如果设置了这个参数，那么你boxplot的宽度就会和你的数据数目成正比）

Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number()

# 同样地，下面代码出不来下图的结果
# 可能是版本的问题
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

img

Exercise 7.5.3.1

Instead of summarizing the conditional distribution with a box plot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors（之前ggplot2那章提到过，超过8个颜色就不会显示了） should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

img

Alternatively, I could use cut_width to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

# boundary = 0稍微可以注意下，如果你不设置这个，似乎就不会是0-1,1-2了
ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

img

Exercise 7.5.3.2

Visualize the distribution of carat, partitioned by price.

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

img

Plotted with a box plot with 10 equal-width bins of 2,000. The argument boundary = 0 ensures that first bin is $0–2,000.

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")

img

Exercise 7.5.3.4

Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

唯一需要注意的是，用离散型变量去分组

# scale_fill_viridis需要预先加载library(viridis)
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1) +
  scale_fill_viridis()

img

ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot()

img

ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot()

img

Exercise 7.5.3.5

Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

img

Why is a scatterplot a better display than a binned plot for this case?

In this case, there is a strong relationship between xx and yy. The outliers in this case are not extreme in either xx or yy. A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of xx was an outlier even though it appears to fit the bivariate pattern well.

个人感觉，geom_hex之类的是展现宏观趋势的，对于少数点的效果并不好，当然你也可以把bin变得很多，bin_width变得很小，但这样其实跟散点图是一样的了。

7.6 Patterns and models

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
- Could this pattern be due to coincidence (i.e. random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
想到一个有意思的图

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

img

这让我想起对于这种出现subgroup的散点图，如果你做相关性，很容易出现相关性很大的情况

下图来自elife文章《Ten common statistical mistakes to watch out for when writing or reviewing a manuscript》，也有翻译的中文版《论文写作或审稿时的十种常见统计错误》

其中A-F都是不相关的数据
image

所以说，我个人感觉用散点图做相关性的时候，是不能把几个重复放一块上去的，因为重复自身就很容易出现subgroup

所以ggstatsplot的上面和右边的barplot还是有必要的，他可以标识出你的数据是否出现了subgroup
image

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
这里的模型给我提供了一个想法

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals（剩下的部分应该就是carat所不能解释的） (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

library(modelr)

# 之所以这里要log，我猜是因为是因为carat和price的量级不一样
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))

img

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))

img

You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

上面描述的就是在你有三个变量下（price（这个可以说是响应变量了），carat，cut（这两个是自变量））的建模过程。其实可以类比到你有RNA-Seq、甲基化、H3K27me3数据。我们就可以首先把甲基化和RNA-Seq的数据建模，然后用残差来对H3K27me3和RNA-Seq建模。

7.7 ggplot2 calls

这里也没啥好说的，就是说了些小技巧

# 可以简写
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

--------------------------------------------------------------

# tidyvese整理好的数据直接可以导入到ggplot2函数里面画图

# 值得一提的是，H神在自己的tidyverse style guide可不是这么说的，hhhh
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
    geom_tile()

# 他提到准确的写法应该是
# ggplot2的+号的规则类似于 %>%。同时要注意，如果你ggplot2的数据来源于dplyr的管道结果，则只能有一个层次的缩进。
# 这也是R studio的默认写法
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
  geom_tile()

7.8 Learning more

推荐了三本书
- ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd ed 这书有github仓库，如果你足够厉害的话，可以自己编译出来pdf版本（反正我编译了好多次都失败了）。当然，网上也有pdf版本的。中文版的已经在2019年出版了，大家感兴趣可以去看看。
- R Graphics Cookbook 中文版叫《R数据可视化手册》。此书英文版有了最近刚有了第二版R Graphics Cookbook, 2nd edition。而且作者也有一本R的cookbook，也是不错的。cookbook有一个中文网络翻译版：Cookbook for R 中文版
  
  作者的cookbook for R注意和 R Cookbook区分，虽然后者也是本很好的书
- Graphical Data Analysis with R (Chapman & Hall/CRC The R Series) 这书我没看过，不知道……

8 Workflow: projects

这一章也没啥好说的，我个人感觉最重要的就是你设置下图这个东西！！！！

我个人感觉，一定要设置这个！！！

如果你有一些很耗时间跑出来的结果（比如Seurat跑出来的一些中间结果），不舍得退出的时候自动删除。你可以考虑用save或者saveRDS。

img

两个快捷键

Press Cmd/Ctrl + Shift + F10 to restart RStudio.
Press Cmd/Ctrl + Shift + S to rerun the current script.

The most important difference is how you separate the components of the path. Mac and Linux uses slashes (e.g. plots/diamonds.pdf) and Windows uses backslashes (e.g. plots\diamonds.pdf). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.

我个人习惯每次一个任务就建一个Project，而不是setwd, getwd搞来搞去的…… 然后就是我会在建完project之后就立刻建下面5个文件夹

# plot放图
# rawdata放原始数据
# result放文本结果
# script放脚本和函数
# temp_code放我暂时丢失的脚本和函数
.
├── plot
├── rawdata
├── result
├── script
└── temp_code

R_for_Data_Science_Script&EDA&pr

6 Workflow: scripts

7.1 Introduction

7.2 Questions

7.3 Variation

7.4 Missing values

7.5 Covariation

7.5.1 A categorical and continuous variable

7.5.2 Two categorical variables

7.5.3 Two continuous variables

7.6 Patterns and models

7.7 ggplot2 calls

7.8 Learning more

8 Workflow: projects

猜你喜欢

热点阅读