R_for_Data_Science_Script&EDA&pr

2020-04-16  本文已影响0人  城管大队哈队长

这是英文版的第6,7,8章

6 Workflow: scripts


7.1 Introduction

  1. Search for answers by visualising, transforming, and modelling your data.
  2. Use what you learn to refine your questions and/or generate new questions.

关于EDA的探索,《Bioinformatics Data skills》里面有一章节也讲的也挺好(应该是Chapter 8: A Rapid Introduction to the R Language),而且是关于生信的。


7.2 Questions


7.3 Variation

像RNA-Seq的count值的分布就是属于伽马-泊松混合分布(即二项分布),而RNA-Seq得到的p-value值的分布就是零假设为真的基因p-value与有差异的基因的p-value的混合分布

The best way to understand that pattern is to visualise the distribution of the variable's values

你也可以用vcd包的goodfit来看下你的数据和某一类分布的拟合程度

norm_data <- rpois(1000,lambda = 5)
ofit = goodfit(norm_data, "poisson")
plot(ofit, xlab = "")

参考:
常见分布的拟合方法
Modern Statistics for Modern Biology_4.4.3

如果是离散型变量的话,一般就是用geom_bar

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
img

高度可以用

diamonds %>% 
  count(cut)
#> # A tibble: 5 x 2
#>   cut           n
#>   <ord>     <int>
#> 1 Fair       1610
#> 2 Good       4906
#> 3 Very Good 12082
#> 4 Premium   13791
#> 5 Ideal     21551

计算出来,然后

library(tidyverse)
diamonds %>% 
  count(cut) %>% 
  ggplot(aes(x = cut, y = n)) +
  geom_bar(stat = "identity")

如果是连续型变量的话,用geom_histogram

ggplot(data = diamonds) +
   geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
image

You can compute this by hand by combining dplyr::count() and ggplot2::cut_width():

diamonds %>% 
  count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#>   `cut_width(carat, 0.5)`     n
#>   <fct>                   <int>
#> 1 [-0.25,0.25]              785
#> 2 (0.25,0.75]             29498
#> 3 (0.75,1.25]             15977
#> 4 (1.25,1.75]              5313
#> 5 (1.75,2.25]              2002
#> 6 (2.25,2.75]               322
#> # … with 5 more rows

# 其实就是cut_width就是把连续型变量切割成区间
> cut_width(diamonds$carat, 0.5) %>% 
+   head()
[1] [-0.25,0.25] [-0.25,0.25] [-0.25,0.25] (0.25,0.75] 
[5] (0.25,0.75]  [-0.25,0.25]
11 Levels: [-0.25,0.25] (0.25,0.75] ... (4.75,5.25]
                                         

cut_interval makes n groups with equal range

cut_number makes n groups with (approximately) equal numbers of observations

cut_width makes groups of width width.

binwidth太小,你的图像就会很sharp,很难看到趋势。太大的话,你的图像就会过于smooth,也很难看到趋势。

# 这里是选择了一部分数据集,binwidth也变小了
smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)
image

除了用geom_hist之外,你还可以用geom_freqpoly

# 用这个图的话,你就会发现其实geom_freqpoly就是hist的顶点连线
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)  +
  geom_freqpoly(binwidth = 0.1)
# 这里是用不同cut了
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)
image
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)
image

还有像这个数据

ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)
image

在生信中,最常见的一个亚群现象可能就是如果你的样本污染的话,你的fastQC的GC含量那里,就会出现双峰。这是因为一个物种的GC含量会呈现于类似正态分布的分布,但如果你的样本里面混入了其他物种,比如说菌等,那么你的GC含量就会有双峰,甚至几个峰。

另一个我能想起来的例子,就是“邪恶的鸢尾花”

# 你会发现Width会出现双峰
# 其实是因为Petal.width是好几种花的数据
hist(iris$Petal.Width)
image

Many of the questions above will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable.

纯属瞎想……

参考:

稳健回归(Robustness regression)

Modern Statistics for Modern Biology_8.7.4 Robustness

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)
image

这时候你可以考虑Zoom in你的数据

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))
image

(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.) 注意这个区别

Solution的Exercise 7.3.4会对此有些解释

让我想起了ggforce的缩放

library(ggforce)
#> Loading required package: ggplot2
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
  geom_point() +
  facet_zoom(x = Species == "versicolor")
image

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
image

However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 14714 rows containing non-finite values (stat_bin).
#> Warning: Removed 6 rows containing missing values (geom_bar).
image

7.4 Missing values

一种是把含有异常值的整行都丢弃

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))

I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

另一种是把异常值取代为NA

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false. Alternatively to ifelse, use dplyr::case_when(). case_when() is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.


What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

这个solution推荐看一下

Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 9 rows containing non-finite values (stat_bin).
img

In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category.

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))
img

In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.


7.5 Covariation

注意这跟协变量(covariance)是不一样的

下面的

  • 一个分类变量、一个连续变量
  • 两个分类变量
  • 两个连续变量

都值得去看一看


7.5.1 A categorical and continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:

geom_freqpoly()的纵坐标是数目,如果你的分类变量对应的数目差距很大的话,你如果放一个图里面,那么数目少那组的自身变化就很难看出来了。

For example, let’s explore how the price of a diamond varies with its quality:

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
img

这里紫色的Fair组你就很难看出趋势来

It’s hard to see the difference in distribution because the overall counts differ so much:

ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))
img

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
img

我们可以看到,似乎质量差的钻石反而价格更高

这跟直接geom_density得到的密度图还是不一样的。但我也不知道density是怎么出来的

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_density(mapping = aes(colour = cut))
image

Smoothed density estimates

看起来上面这个图是个不对称的分布

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()
image
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
img

If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip().

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
img

这个比较有用的应该是你画GO plot的时候,可以按某个顺序排下来,这样图会很清爽。


这个来自solution 核心意思就是因为质量差的那些钻石,其个头比较大,所以相对来说,价格比较大

如果不想看的话,可以跳过。

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat, clarity, color, and cut. I ignore the dimensions(就是所谓的diamond数据里面的x、y、z) of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, I use a scatter plot to visualize their relationship.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()
img

However, since there is a large number of points in the data, I will use a boxplot by binning carat (as suggested in the chapter).

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

一个小问题就是如果你是ggplot2 3.3.0的话,是跑不出这样的结果的

img

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color are in the wrong order. Before plotting, I will reverse the order of the color levels so they will be in increasing order of quality on the x-axis. The color column is an example of a factor variable, which is covered in the “Factors” chapter of R4DS.

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()
img

fct_rev这个命令很奇怪

# 变不回去了
> f <- factor(c("a", "b", "c"))
> fct_rev(f)
[1] a b c
Levels: c b a
> fct_rev(f)
[1] a b c
Levels: c b a

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))
img

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()
img

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

Y叔介绍过的翻云覆雨图

基因组数据一般就是大量数据,个人感觉如果对于一些基因组数据使用小提琴图、箱式图、点图等效果都不好,因为会是密密麻麻的一团。一般都得过滤之后才会展现出一些趋势。

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()
img

这个很适合实验上的表型数据展现ggbeeswarm

There are two methods:

I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))
img
ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )
img
ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )
img
ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )
img
ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )
img
ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))
img

7.5.2 Two categorical variables

# geom_count的确是个好函数唉
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

# 应该是等价于
diamonds %>% 
  group_by(cut, color) %>% 
  mutate(n = n()) %>% 
  ggplot(mapping = aes(x = cut, y = color, size = n)) +
  geom_point()
img

上面的图让我想起了Motif的图

image

这个图Y叔有讲过一点:怎样用HOMER算出的P-value画出CNS级别的泡泡图?

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))
img

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.

关于seriation包,搜到了这个【ComplexHeatmap包学习笔记】2.5 Seriation


其实核心就是用百分比,类似于之前的密度图。我觉得还是很值得我们学习的

To clearly show the distribution of cut within color, calculate a new variable prop which is the proportion of each cut within a color. This is done using a grouped mutate.

diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1)) # from the viridis colour palette library
img

Similarly, to scale by the distribution of color within cut,

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1))
img

I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.(数据集内部比较与数据集之间的比较)

# 如果你不限定颜色范围的话,图例颜色标度会自适应你的颜色范围
diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis()
image

这样子看起来非常的清楚,但有一个问题是这个颜色标度只适应于你现在的数据集,如果你还要跟其他的同类型数据集进行比较的话,就要scale_fill_viridis(limits = c(0, 1))了。这样不同数据集之间才可以比较

geom_tile有个问题是不能聚类。但Y叔的ggtree却可以做到聚类,这一点比patchwork还要好(patchwork可以拼图,但其只是对准坐标轴)

Example of aligning tree with data side by side by creating composite plot.

It’s usually better to use the categorical variable with a larger number of categories(这个变量里面有很多种) or the longer labels(标签名字很长) on the y axis. If at all possible, labels should be horizontal because that is easier to read(y轴上的标签也应该是垂直,更易读).

However, switching the order doesn’t result in overlapping labels.

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(y = color, x = cut)) +
  geom_tile(mapping = aes(fill = n))
img

Another justification, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.


7.5.3 Two continuous variables

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))
img img

当然,我觉得用二维核密度估计图也是可以的

m <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
 geom_point() +
 xlim(0.5, 6) +
 ylim(40, 110)

m + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon")
img

来自:geom_density_2d

# 下面代码出不来下图的结果
# 可能是版本的问题
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

根据的geom_boxplot的reference来看,ggplot2 3.3.0版本的确不是下图的结果。

但我在服务器的ggplot2 3.2.1跑出来的就是跟书里面一样的图

不过这个问题应该只存在于你的x变量是连续型变量的时候

img

One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE (如果设置了这个参数,那么你boxplot的宽度就会和你的数据数目成正比)

# 同样地,下面代码出不来下图的结果
# 可能是版本的问题
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
img

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors(之前ggplot2那章提到过,超过8个颜色就不会显示了) should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")
img

Alternatively, I could use cut_width to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

# boundary = 0稍微可以注意下,如果你不设置这个,似乎就不会是0-1,1-2了
ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")
img

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")
img

Plotted with a box plot with 10 equal-width bins of 2,000. The argument boundary = 0 ensures that first bin is $0–2,000.

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")
img

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

唯一需要注意的是,用离散型变量去分组

# scale_fill_viridis需要预先加载library(viridis)
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1) +
  scale_fill_viridis()
img
ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot()
img
ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot()
img
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
img

Why is a scatterplot a better display than a binned plot for this case?

In this case, there is a strong relationship between xx and yy. The outliers in this case are not extreme in either xx or yy. A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of xx was an outlier even though it appears to fit the bivariate pattern well.

个人感觉,geom_hex之类的是展现宏观趋势的,对于少数点的效果并不好,当然你也可以把bin变得很多,bin_width变得很小,但这样其实跟散点图是一样的了。


7.6 Patterns and models

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))
img

这让我想起对于这种出现subgroup的散点图,如果你做相关性,很容易出现相关性很大的情况

下图来自elife文章《Ten common statistical mistakes to watch out for when writing or reviewing a manuscript》,也有翻译的中文版《论文写作或审稿时的十种常见统计错误》

其中A-F都是不相关的数据

image

所以说,我个人感觉用散点图做相关性的时候,是不能把几个重复放一块上去的,因为重复自身就很容易出现subgroup

所以ggstatsplot的上面和右边的barplot还是有必要的,他可以标识出你的数据是否出现了subgroup

image

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals(剩下的部分应该就是carat所不能解释的) (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

library(modelr)

# 之所以这里要log,我猜是因为是因为carat和price的量级不一样
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))
img

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))
img

You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

上面描述的就是在你有三个变量下(price(这个可以说是响应变量了),carat,cut(这两个是自变量))的建模过程。其实可以类比到你有RNA-Seq、甲基化、H3K27me3数据。我们就可以首先把甲基化和RNA-Seq的数据建模,然后用残差来对H3K27me3和RNA-Seq建模。


7.7 ggplot2 calls

# 可以简写
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

--------------------------------------------------------------

# tidyvese整理好的数据直接可以导入到ggplot2函数里面画图

# 值得一提的是,H神在自己的tidyverse style guide可不是这么说的,hhhh
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
    geom_tile()

# 他提到准确的写法应该是
# ggplot2的+号的规则类似于 %>%。同时要注意,如果你ggplot2的数据来源于dplyr的管道结果,则只能有一个层次的缩进。
# 这也是R studio的默认写法
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
  geom_tile()

7.8 Learning more


8 Workflow: projects

我个人感觉,一定要设置这个!!!

如果你有一些很耗时间跑出来的结果(比如Seurat跑出来的一些中间结果),不舍得退出的时候自动删除。你可以考虑用save或者saveRDS。

img
  1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
  2. Press Cmd/Ctrl + Shift + S to rerun the current script.
# plot放图
# rawdata放原始数据
# result放文本结果
# script放脚本和函数
# temp_code放我暂时丢失的脚本和函数
.
├── plot
├── rawdata
├── result
├── script
└── temp_code
上一篇 下一篇

猜你喜欢

热点阅读