R语言实现bootstrap和jackknife检验方法

2020-12-19 本文已影响0人杨康chin

写在最前面：
首先需要说一下，本文的bootstrap和jackknife都算是蒙特卡罗方法(Monte Carlo method)的一种。应用广泛的的MCMC链(马尔可夫链蒙特卡洛方法;Markov chain Monte Carlo)也是蒙特卡罗与马尔可夫链的结合。简单来说，蒙特卡罗方法就是从已知样本的分布中随机抽取新的样本集进行评估，然后放回，再次抽取的方法。根据具体方法的不同，抽取样本集的手段也不同。

bootstrap方法

bootstrap抽样方法将观测到的样本视为一个有限的总体，是唯一的信息来源，从中有放回的随机抽样来评估总体特征，以及对抽样总体进行推断统计。bootstrap 也分参数bootstrap和非参数bootstrap，前者的分布已完全知道。但在生信领域一般没有这种情况。所以下面讨论的是非参数bootstrap。

图示

直接上例子：
假设现在有bootstrap包中的law数据集如下，

> library(bootstrap)
> data(law)
> law
   LSAT  GPA
1   576 3.39
2   635 3.30
3   558 2.81
4   578 3.03
5   666 3.44
6   580 3.07
7   555 3.00
8   661 3.43
9   651 3.36
10  605 3.13
11  653 3.12
12  575 2.74
13  545 2.76
14  572 2.88
15  594 2.96

现在我们要计算LSAT成绩（美国法学入学考试）和GPA之间的相关系数。但因为样本量太少了，所以我们使用bootstrap重复抽样评估其标准误。

library(bootstrap)
data(law)
law
cor(law$LSAT, law$GPA)

print(cor(law$LSAT, law$GPA))
#设置bootstrap循环
B <- 200 #重复抽样的次数
n <- nrow(law) #样本大小（15个）
R <- numeric(B) #储存每次循环后计算得到的相关系数
#对R的标准差进行bootstrap评估（通过计算每次抽样的标准差）
for (b in 1:B) {
  #randomly select the indices
  i <- sample(1:n, size = n, replace = TRUE)
  # 从1:n的范围内有放回地抽取n个样本
  LSAT <- law$LSAT[I] 
  GPA <- law$GPA[I]
  R[b] <- cor(LSAT, GPA)
  }
#output
print(se.R <- sd(R))  #多次抽样的标准差即为标准误
hist(R, prob = TRUE)

200次循环抽样后，计算得se.R标准误为0.1474629
得到如下的图：

200次循环后后的相关系数概率分布

1e6次循环抽样后，计算得se.R标准误为0.1333802
得到如下的图：

1e6次循环后的相关系数概率分布

bootstrap包及函数

如果用bootstrap包的bootstrap函数会快一些：

n<-15
theta<-function(x,law){cor(law[x,1],law[x,2])}
result<-bootstrap(1:n,200,theta,law)$thetastar
hist(result,prob=T)
sd(result)

bootstrap函数的用法：bootstrap(抽取样本范围,重复次数,进行bootstrap的函数,bootstrap的数据集)

$bootstrap单次取样的样本分布 \to$
$bootstrap多次取样的分布(将每一次bootstrap作为一个样本的分布) \to$
$从自然真实总体中观测到的样本集分布$

利用bootstrap计算偏差(bias):

偏差定义为bootstrap结果（多个数值）与原数据统计结果（单个数值）的均值：

library(bootstrap)
data(law)
origin<-cor(law$LSAT, law$GPA)
n<-15
theta<-function(x,law){cor(law[x,1],law[x,2])}
result<-bootstrap(1:n,200,theta,law)$thetastar
hist(result,prob=T)
se<-sd(result)
bias <- mean(result - origin)
bias

得到bias大约为0.001817608，比较小

计算bootstrap置信区间(CI)

换一个包，boot包

library(boot)
data(law)
theta.boot<-function(dat,ind){
  cor(dat[ind,1],dat[ind,2])}
dat<-cbind(law$LSAT,law$GPA)
boot.obj <- boot(dat, statistic = theta.boot, R = 2000)
print(boot.obj)
print(boot.ci(boot.obj,type = c("basic", "norm", "perc")))


> print(boot.ci(boot.obj,type = c("basic", "norm", "perc")))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.obj, type = c("basic", "norm", "perc"))

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.5225,  1.0503 )   ( 0.5906,  1.0958 )   ( 0.4569,  0.9621 )  
Calculations and Intervals on Original Scale

这里用了三种方法计算置信区间：basic、正态和百分数。样本相关系数分布接近正态，则正态置信区间接近百分数区间。此外还有“Better Bootstrap Confivendence Interval” 更好的bootstrap置信区间，称为BCa区间，使用偏差和偏度对百分数置信区间进行矫正。设置type="bca"即可。

Jackknife方法

简单的说，bootstrap是从原有真实样本中有放回地抽取n个。jacknife就是每次都抽取n-1个样本，也就是每次只剔除一个原样本。

Jackknife偏差：

同样地，如果以bootstrap包中的law数据进行演示：

library(bootstrap)
data(law)
n <- nrow(law)
LSAT <- law$LSAT
GPA <- law$GPA
theta.hat <- cor(LSAT,GPA)
print (theta.hat)
theta.jack <- numeric(n)
for (i in 1:n)
  theta.jack[i] <-cor(LSAT[-i],GPA[-i])
bias <- (n-1) * (mean(theta.jack) - theta.hat)

Jackknife计算的bias为-0.006473623。这里jackknife的偏差公式相比于bootstrap有一个(n-1)系数，推导就不写了。

Jackknife标准误

se <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2))
print(se)

标准误se为0.1425186，与bootstrap得出的比较接近。

Jackknife失效的情况：

当统计量不太平滑的时候，Jacknife有很大误差。比如说对中位数进行统计，其变化很大。在进行Jacknife之后最好再跑一次bootstrap，看看是否相差很大。

jackknife-after-bootstrap方法对bootstrap的抽样计算标准差

居然还能这么嵌套着玩，针对每次bootstrap形成的数列向量计算jackknife的标准差，这样可以看出bootstrap若干次取样之间的差异。

#jackknife-after-bootstrap 
# initialize
data(law)
n <- nrow(law)
LSAT <- law$LSAT
GPA <- law$GPA
B <- 2000
theta.b <- numeric(B)
# set up storage for the sampled indices 
indices <- matrix(0, nrow = B, ncol = n)
# jackknife-after-bootstrap step 1: run the bootstrap 
for (b in 1:B) {
  i <- sample(1:n, size = n, replace = TRUE) 
  LSAT <- law$LSAT[I]
  GPA <- law$GPA[I]
  theta.b[b] <- cor(LSAT,GPA)
  #save the indices for the jackknife 
  indices[b, ] <- I
}
#jackknife-after-bootstrap to est. se(se) 
se.jack <- numeric(n)
for (i in 1:n) {
  #in i-th replicate omit all samples with x[I] 
  keep <- (1:B)[apply(indices, MARGIN = 1,FUN = function(k) {!any(k == i)})] 
  ## 根据jackknife的特性，对bootstrap抽样矩阵indices进行8次取样，每次去除存在k=I的向量。
  ## 也就是说第一次取样不能包含1，第二次不能包含2，把这些行都去掉再计算。
  se.jack[i] <- sd(theta.b[keep])
}
print(sd(theta.b))
print(sqrt((n-1) * mean((se.jack - mean(se.jack))^2)))

算出来分别为0.1344824和0.08545141。后者较小，表面bootstrap取样之间的variance较小。

交叉检验：Cross Validation

简单来说就是一种数据分割检验的方法，将数据分割为K份，称为"K-fold"交叉检验，每次第i个子集作为测试集来评估模型，其余的用来构建模型。Admixture使用的就是这个原理。Jackknife也属于Cross Validation的应用之一。

使用R语言ape包对系统发育树进行bootstrap

library(ape)
library(geiger)

现在我创建一个这样的alignment：

瞎写的alignment

data<-read.dna("sample.fasta",format="fasta")
> data
8 DNA sequences in binary format stored in a matrix.

All sequences of same length: 23 

Labels:
human1
human2
human3
human4
human5
human6
...

Base composition:
    a     c     g     t 
0.326 0.272 0.147 0.255 
(Total: 184 bases)

> dist.gene(data) ##简单计算遗传距离
       human1 human2 human3 human4 human5 human6 human7
human2      5                                          
human3      7      2                                   
human4      8      3      1                            
human5      9      5      7      8                     
human6     10      6      8      7      1              
human7     10      6      8      9      1      2       
human8      6      6      8      9     10     11     11

> stree = nj(dist.gene(data))
> stree

Phylogenetic tree with 8 tips and 6 internal nodes.

Tip labels:
  human1, human2, human3, human4, human5, human6, ...

Unrooted; includes branch lengths.
> str(stree)  ## 看看具体是怎么编码的
List of 4
 $ edge       : int [1:13, 1:2] 9 13 14 14 13 9 12 12 9 10 ...
 $ edge.length: num [1:13] 4.38 0.2 0 1 0.8 ...
 $ tip.label  : chr [1:8] "human1" "human2" "human3" "human4" ...
 $ Nnode      : int 6
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"

tree1<-root(stree,outgroup=c("human2","human4","human3")) ;plot(tree1,show.node.label =T);

这棵树长这样，符合遗传距离：

例子

进行bootstrap：

myBoots <- boot.phylo(tree, data, function(xx) {nj(dist.gene(xx))}, 
B = 1000,  mc.cores = 6)

> myBoots
[1] 100  73  91  86  99  73
## 得到支持度

phylogeny的bootstrap是对每一个节点都进行bootstrap取样并建树，比如说在9号节点，查看其bootstrap子集建的树符合系统发育关系((human2,human4,human3)(human8,human1,human6,human7,human5))的百分比（不管内部怎么样，先看这个节点）。发现Node1支持率是100（1000次都符合）。而后移到下一个节点，并且只看节点内部的分支支持率是多少。

其实原理都比较简单，计算bootstrap也会有专门的软件。

参考资料：
1）中科大张伟平教授课件
2）https://ecomorph.wordpress.com/2014/10/09/phylogenetic-trees-in-r-4/