给女朋友写的生统资料_Part18
apply和function
差异基因的检验估计会用到function和apply。不过差异基因表达的function和apply估计也不会有太大的难度,所以我这里就直接放上《R语言实战》的例子。详细地题目应对我还是放到后面具体题目(当然也是因为我R语言太菜了……)
function
18_1.pngfunction可以去看下《R语言实战》第二版的P436,我这里截个图
function的基本语法为
functionname <- function(parameters){
statements
return(value)
}
functionname就是你的函数名字,你也可以写成wodehanshu这种名字。
parameters里面就是设定你要用到的参数
statement就是你执行的一系列操作,一般来说你这一系列操作会产生一个值。然后用return返回
例子的话,上面已经举了。
apply
apply的使用我还是放一个《R语言实战》的例子,在P95和P96
18_2.png 18_3.png差异基因表达
差异基因估计是用t-test做检验,然后用p.adjust做矫正(虽然照理说不能用t-test做)。我这里就不详细地讲了。举几个例子吧
题目1
我们第4次作业的第四题
Type 1 diabetes is a multigenic disease caused by T-cell mediated destruction of the insulin producing β-cells. Although conventional (targeted) approaches of identifying causative genes have advanced our knowledge of this disease, many questions remain unanswered.
Here we have a gene data from NOD mouse after(case) and before(control) treatment. The data can be found in “Data.txt”.Use the information mentioned above to answer the following questions:
1.use paired t-test to find genes which have significant expression (p<0.05) between case and control sample. Give the number of differential expressed genes and give the names of top 10 significantly differential expression genes. hint: “apply(data,1,function(x){…})” can apply function to every row in data more quickly than “for{}”, “names()” or “rownames()” can be used to extract names of differentially expressed genes
这里是输入了基因表达的数据。分别是10个control和10个case,然后做配对的t-test。我们可以手动地对每个基因(即每一列)对t-test。但这样太麻烦,所以我们就可以用apply。apply可以对每一列做批量化的操作。
# 读取和整理数据
# paste0这个操作就是把“gene”这个字符串和对应的列名粘起来,这样使得列名比较直观一点
test4 <- read.table("rawdata/test4.txt",header = T)
rownames(test4) <- paste0("gene",rownames(test4))
# 看下数据(我后来发现没有gene1了。。。。。但我最后的答案跟作业答案一样,应该是原始文件的锅)
> head(test4)
control1 control2 control3 control4 control5 control6 control7 control8 control9 control10 case1 case2
gene2 6.917468 6.308350 5.318841 5.886811 5.082975 5.629453 4.919035 3.134226 4.242564 5.783208 3.574525 5.371273
gene3 7.862730 7.065809 6.783732 6.275773 3.063104 5.131017 3.708938 2.766133 6.755942 3.954392 7.163509 5.600665
gene4 7.425996 7.707939 7.550885 5.708736 5.900326 6.888329 5.987753 5.948979 7.281995 6.758037 5.367602 7.898202
gene5 6.544029 8.364054 7.239746 9.037322 8.783692 8.863040 10.602350 9.046190 9.007177 7.866627 8.079780 8.128662
gene6 4.986893 7.248867 7.098780 6.787237 6.252609 6.428099 7.758556 6.726613 7.123883 6.584567 6.642005 7.376994
gene7 6.995863 6.448682 6.136518 7.098878 4.984325 5.896835 6.488555 5.299349 5.524948 6.763232 7.004367 6.478264
case3 case4 case5 case6 case7 case8 case9 case10
gene2 7.217279 6.786191 5.523913 5.355584 6.693710 4.364467 6.270432 6.993901
gene3 6.079167 7.125393 5.169236 4.524597 4.715383 2.946439 4.200206 5.798403
gene4 6.321147 7.810430 6.231793 6.367243 7.579535 7.207376 7.548390 6.522594
gene5 8.716447 9.940214 9.224614 9.139514 9.429897 9.373779 9.287744 9.275545
gene6 7.080039 7.100746 6.453575 6.618557 7.614747 7.141351 6.076726 7.369737
gene7 6.273696 6.413472 4.983931 5.629765 5.213316 5.931045 5.776625 6.522169
# 构建函数(我个人可能比较推荐先构建函数,然后把函数放到apply中,这样可能比较直观一点)
# 我输入数据为x,然后我提取1:10个数为data1,11:20个数为data2。然后做t.test
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:20]
t_result <- t.test(data1,data2,paired = T)
p_value <- t_result$p.value
return(p_value)
}
# 利用apply,一行行地把数据放入myFun里面,一行行地做t.test
# 输出应该是个向量
test4_result <- apply(test4_paired_result, 1, result)
# 也可以直接一行代码
apply(test4, 1, function(x) {t.test(x[1:10],x[11:20],paired = T)$p.value})
# 数下差异基因的数量
> sum(test4_result < 0.05)
[1] 2296
# 输出前10个
## 先对得到的p-value进行排序
test4_result[order(test4_result)]
## 然后得到前10
> test4_result[order(test4_result)][1:10]
gene28801 gene27868 gene27438 gene21642 gene24019 gene12323 gene12962 gene28939 gene2387
4.277344e-06 6.093302e-05 8.229606e-05 9.889894e-05 1.124717e-04 1.261658e-04 1.298200e-04 1.462493e-04 1.522920e-04
gene18712
1.601297e-04
2.Adjust the p-values in question a) with bonferroni and FDR method to find differentially expressed genes in stringent way( list the differentially expressed gene names and the adjusted p-value)
矫正的话,直接用p.adjust就可以了。这里再用sum或者mean数一数差异基因的个数
test4_adjust_bonf <- p.adjust(test4_result, method = "bonferroni")
mean(test4_adjust_bonf < 0.05)
test4_adjust_fdr <- p.adjust(test4_result, method = "fdr")
mean(test4_adjust_fdr < 0.05)
题目2
把题目中的paired数据变成非配对的数据。那么我们就要多一步方差齐性检验了。
# 在函数里面做判断
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:20]
if (var.test(data1,data2)$p.value > 0.05){
t_result <- t.test(data1,data2,var.equal = T)
p_value <- t_result$p.value
return(p_value)
} else {
t_result <- t.test(data1,data2,var.equal = F)
p_value <- t_result$p.value
return(p_value)
}
}
test4_result <- apply(test4, 1, myFun)
题目3
来自17年考试的第5题
Nonalcoholic steatohepatitis or NASH is a common liver disease, and was found to be linked to obesity and diabetes, suggesting an important role of adipose tissue in the pathogenesis of NASH. Therefore, the mouse model was used to investigate the interaction between adipose tissue and liver. Wildtype male C57Bl/6 mice were fed low fat food (LFD) or high fat food (HFD) for 21 weeks. The detailed data can be found in GDS4013.txt. Hint:provide the R code and result calculated with R. (本题共20分)
1.Read the data. Check whether there are NAs in the data? If NAs exist, you should deaf with them before next steps. Hint: delete the rows with NAs in data. (5分)
# 先读取数据
> test <- read.table("rawdata/GDS4013.txt",header = T)
> head(test)
LFD LFD.1 LFD.2 LFD.3 LFD.4 LFD.5 LFD.6 LFD.7 LFD.8 LFD.9 HFD HFD.1 HFD.2
COPG1 7.756187 7.513726 7.927167 7.742323 7.974731 7.632936 7.419638 8.022296 7.649602 7.694316 7.717317 7.721638 7.873688
ATP6V0D1 8.968808 9.004967 9.025234 9.139161 9.142125 8.998360 9.097204 9.181852 9.127648 9.149562 9.029930 8.948781 9.095943
GOLGA7 9.698317 9.798994 9.765514 9.581075 9.776361 9.829636 9.670068 9.675262 9.812124 9.637824 9.625283 9.822128 9.700576
PSPH 6.897664 NA 6.744140 7.225862 7.343744 6.877918 6.422594 7.188237 6.819006 6.524627 6.514469 6.949181 7.044664
TRAPPC4 5.406903 5.473714 5.749002 5.416623 5.351018 5.431658 5.606044 5.401849 5.374199 5.331412 5.439976 5.487543 5.290085
DPM2 6.532734 6.215751 6.520986 6.143827 6.597586 6.731978 6.553594 6.659309 6.288114 6.561733 6.819006 6.616581 6.349064
HFD.3 HFD.4 HFD.5 HFD.6 HFD.7
COPG1 7.637193 7.570261 7.549871 7.687248 7.603990
ATP6V0D1 9.092459 8.981442 9.014060 9.122049 9.000419
GOLGA7 9.739647 9.679455 9.790654 9.817816 9.761968
PSPH 6.708345 6.413345 6.617198 6.843793 6.467956
TRAPPC4 5.485993 5.376865 5.522681 5.430362 5.490806
DPM2 6.554808 6.416926 6.562119 6.524627 6.771978
# 利用table看看缺失值的多少
> table(is.na(test))
FALSE TRUE
481694 4
# 利用na.omit
test_new <- na.omit(test)
> table(is.na(test_new))
FALSE
481626
2.Find differentially expressed genes (DEGs) (up-regulated, HF>LF) between LFD and HFD samples, list the DEGs with adjusted p-values less than 0.05. For expressions of each gene, check the homogeneity of the variances in two groups at first. Hint: FDR should be used to adjust p-values.
# 因为不是配对检验,所以我们首先要做的是检验方差齐性
## 所以我们需要先构建一个函数来检验方差的齐性
myFun_var <- function(data){
LFD <- data[1:10]
HFD <- data[11:18]
var_result <- var.test(LFD,HFD)
p_value <- var_result$p.value
}
## 应用apply来得到p-value
var_pvalue <- apply(test_new, 1, myFun_var)
## 然后往表达矩阵上多加一列p-value,得到有附加信息的表达矩阵
> test_new_var <- cbind(test_new,var_pvalue)
> head(test_new_var)
LFD LFD.1 LFD.2 LFD.3 LFD.4 LFD.5 LFD.6 LFD.7 LFD.8 LFD.9 HFD HFD.1
COPG1 7.756187 7.513726 7.927167 7.742323 7.974731 7.632936 7.419638 8.022296 7.649602 7.694316 7.717317 7.721638
ATP6V0D1 8.968808 9.004967 9.025234 9.139161 9.142125 8.998360 9.097204 9.181852 9.127648 9.149562 9.029930 8.948781
GOLGA7 9.698317 9.798994 9.765514 9.581075 9.776361 9.829636 9.670068 9.675262 9.812124 9.637824 9.625283 9.822128
TRAPPC4 5.406903 5.473714 5.749002 5.416623 5.351018 5.431658 5.606044 5.401849 5.374199 5.331412 5.439976 5.487543
DPM2 6.532734 6.215751 6.520986 6.143827 6.597586 6.731978 6.553594 6.659309 6.288114 6.561733 6.819006 6.616581
PSMB5 10.217143 9.992716 9.974211 10.201814 10.244172 10.023191 10.072604 10.147738 10.267069 10.024750 10.142353 10.119065
HFD.2 HFD.3 HFD.4 HFD.5 HFD.6 HFD.7 var_pvalue
COPG1 7.873688 7.637193 7.570261 7.549871 7.687248 7.603990 0.1119611
ATP6V0D1 9.095943 9.092459 8.981442 9.014060 9.122049 9.000419 0.5807521
GOLGA7 9.700576 9.739647 9.679455 9.790654 9.817816 9.761968 0.6524620
TRAPPC4 5.290085 5.485993 5.376865 5.522681 5.430362 5.490806 0.1770907
DPM2 6.349064 6.554808 6.416926 6.562119 6.524627 6.771978 0.6063077
PSMB5 10.086752 10.240642 10.086752 10.246507 10.291952 10.129569 0.3887486
# 得到了方差检验的p-value,我们就可以来把这个条件加入到t.test里面,然后做t-test了
## 首先也要构建t-test的函数
myFun_t_test <- function(data){
LFD <- data[1:10] # 1:10列为LFD
HFD <- data[11:18] # 11:18列为HFD
var_result <- data[19] # 19列为方差齐性的p-value
t_result <- t.test(LFD,HFD,alternative = "less",var.equal = var_result > 0.05) ## 这里var_result是p-value,p-value如果大于0.05了(方差齐性),就会输出TRUE。
p_value <- t_result$p.value
}
## 利用apply,一行行地输入函数里面,做t-test
t_test_pvalue <- apply(test_new_var, 1, myFun_t_test)
# 得到了p-value之后,就用fdr方法做矫正
> t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
> t_test_padjust[t_test_padjust < 0.05]
SEMA5B BC048403
0.01698630 0.04004457
大家可能会奇怪这里为什么是less了,一般我们不都是two.side么
选择单双边实际上是跟你的假设有关系的,我们这里是假设up-regulated, HF>LF。只是单边而已。那为啥又是less,而不是greater呢。因为t.test(a,b,alternative)里面的alternative是针对前一个比后一个,即如果是greater的话,就是a比b大。我们这里是LFD放在了前面,假设是HF>LF的话,那么我们写的时候应该是LFD less than HFD。
举个例子的话
# 模拟两个值
> a <- rnorm(20)
> b <- rnorm(20,mean = 1)
# 可以看到下面的统计检验,如果是greater的话,即a>b,是不显著的。而less是显著的
> t.test(a,b,alternative = "greater")
Welch Two Sample t-test
data: a and b
t = -1.5134, df = 30.883, p-value = 0.9298
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-0.7961539 Inf
sample estimates:
mean of x mean of y
0.4103696 0.7858315
> t.test(a,b,alternative = "less")
Welch Two Sample t-test
data: a and b
t = -1.5134, df = 30.883, p-value = 0.07017
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 0.04523018
sample estimates:
mean of x mean of y
0.4103696 0.7858315
然后先方差齐性检验,后做t-tes的代码,除了上面的,也可以用题目2的代码
# 用了if else
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:18]
if (var.test(data1,data2)$p.value > 0.05){
t_result <- t.test(data1,data2,var.equal = T,alternative = "less")
p_value <- t_result$p.value
return(p_value)
} else {
t_result <- t.test(data1,data2,var.equal = F,alternative = "less")
p_value <- t_result$p.value
return(p_value)
}
}
t_test_pvalue <- apply(test_new_var, 1, myFun)
t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
t_test_padjust[t_test_padjust < 0.05]
也可以用答案里面简略的
# 方差齐性检验
p.vartest <- apply(GDS4013_new, 1, function(x)var.test(x[1:10],x[11:18])$p.value)
# 利用cbind合并p-value和表达矩阵
GDS4013_new_data <- cbind(GDS4013_new, p.vartest)
# t.test检验
p.value <- apply(GDS4013_new_data, 1, function(x)t.test(x[1:10],x[11:18],alternative = 'less', var.equal = (x[19] >= 0.05))$p.value)
p.fdr <- p.adjust(p.value, "fdr")
> sum(p.fdr<0.05)
[1] 2
> sig.p.fdr<-p.fdr[p.fdr<0.05]
> names(sort(sig.p.fdr))
[1] "SEMA5B" "BC048403"
富集分析
富集分析的原理我应该会在考完试有空的时候再重新理一下,我这里就直接放上用fihser exact test(fisher 精确检验)做富集分析的代码。
稍微提下富集的原理。
一般差异表达的基因会有上千个,一个个手动看肯定不现实。所以我们就需要从宏观上来看这些基因,用到的一个方法就是富集分析。一般用的比较多的是GO富集分析,就是比如我们得到1000个差异基因,然后我们感兴趣的GO通路在差异基因中有50个基因。总的基因有2w个,在这2w个基因中,我们感兴趣的GO有200个。我们想知道感兴趣GO相比于总体来说,在差异基因集中是不是富集的。
fisher exact test输入的格式是列联表的格式,就像下面
差异基因 | 无差异基因 | |
---|---|---|
在通路中 | 50 | 200-50=150 |
不在通路中 | 1000-50=950 | 19000-150=18850 |
变成代码就是
> test <- matrix(c(50,950,150,18850),2)
> test
[,1] [,2]
[1,] 50 150
[2,] 950 18850
> fisher.test(test)
Fisher's Exact Test for Count Data
data: test
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
4.670869 9.230328
sample estimates:
odds ratio
6.61189
看下我们某年的一个GO富集题目
Usually you are interested in the function indicated by differentially expressed genes, for which GO enrichment is a widely used method. In order to find whether the differentially expression genes (downregulated, p<=0.05) are enriched in “leukocyte activation during immune response” (GO term), please show a conclusive result using fisher exact test. Known genes annotated with this GO are listed in “GO_2_2.txt”
全部基因总共是14082个基因,前面差异表达做出来的差异基因是617个。然后GO_2_2.txt里面给出了leukocyte activation during immune response这个GO的基因,基因数目为193。我们就需要找差异基因中有多少这个GO通路的基因,答案里用到的intersect,其实就是取交集。我举个例子
> total <- c(1:10)
> part <- c(1,4,5)
> intersect(part,total)
[1] 1 4 5
# 还可以知道有多少
> length(intersect(part,total))
[1] 3
然后答案里的intersect出来是10个。这样列联表的数据就全了。
差异基因 | 无差异基因 | |
---|---|---|
在通路中 | 10 | 193-10=183 |
不在通路中 | 617-10=607 | 14082-607-183-10=13282 |
然后就可以做fisher了
> test <- matrix(c(10,607,183,13282),nrow = 2)
> test
[,1] [,2]
[1,] 10 183
[2,] 607 13282
> fisher.test(test)
Fisher's Exact Test for Count Data
data: test
p-value = 0.5923
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5609475 2.2650703
sample estimates:
odds ratio
1.195686
然后有人提到答案里面其实是不准的,因为GO2_2的表中,有几个基因是不在总的基因里面的。
画出列联表的步骤我觉得可以是这样的
- 差异基因数目定下了,然后GO表和差异基因取交集,那么就得到了列联表的左上。
- 然后差异基因减去左上,就得到了左下
- 然后GO表减去左上,就得到了右上
- 然后总的基因减去左上、左下、右上,就得到了右下。
我只是按照了答案的思路来,不保证对哈。大家自行斟酌,原理就是这么个原理。