基于R的高级统计(入门介绍)
用R实现生物统计学相关的数值运算、算法优化问题,basic的一些思想方法。参考书是Advanced Statistical Computing。
Introduction
Principle of Optimization Transfer
- 假设我们想求f的最大值,但是直接求很困难
- 我们可以构造一个f的"近似函数",也叫局部下界函数g。
- 我们回避直接计算f最大值的问题,转换成:计算f在某点的下界函数g的最大值,也就是“transfer” 求f变成求g的问题
- 2和3的步骤迭代进行直至最值收敛。
This is the optimization “transfer”. EM算法中会用到
不仅仅是求f的最大,求f的最小也是一样的道理。
Textbooks vs. Computers
在计算机上进行统计学、数值运算时常常会有以下问题:
- Overflow - When numbers get too big, they cannot be represented on a computer and so often
NA
s are produced instead;- Underflow - Similar to overflow, numbers can get too small for computers to represent, resulting in errors or warnings or inaccurate computation;
- Near linear dependence - the existence of linear dependence in matrix computations depends on the precision of a machine. Because computers are finite precision, there are commonly situations where one might think there is no linear dependence but the computer cannot tell the difference.
前两个问题都会导致数值计算的不准确性,因为计算机的精度是有限的。第三种是实践中经常出现的问题,在矩阵中常常会出现向量共线性
让我们看看解决方案:
Using Logarithms
取对数,尤其是涉及求概率密度一类值、求f(x)/g(x)的情况下,这样可以有效避免overflow和underflow的问题。毕竟概率值可以是非常小的数
when you see expressions like f(x)/g(x), you should think that this means
The two are equivalent but the latter is likely more numerically stable.
Linear Regression
The typical linear regression model, written in matrix form, is represented as follows,
where y is an n×1 observed response, X is the n×p predictor matrix, β is the p×1 coefficient vector, and εε is n×1 error vector.
In most textbooks the solution for estimating β, whether it be via maximum likelihood or least squares, is written as
..And indeed, that is the solution. In R, this could be translated literally as
betahat <- solve(t(X) %*% X) %*% t(X) %*% y
solve函数用来求矩阵的逆或者解线性方程组
实践中一般不会直接求矩阵的逆,计算繁杂,数值不稳定不精准:
The primary reason is that computing the direct inverse of X′X is very expensive computationally and is a potentially unstable operation on a computer when there is high colinearity amongst the predictors.
转换成下式:
solve(crossprod(X), crossprod(X, y))
这样就回避了求矩阵的逆,用高斯消元法可以直接求出beta_hat
set.seed(2017-07-13)
X <- matrix(rnorm(5000 * 100), 5000, 100)
y <- rnorm(5000)
将两种计算方法进行比较:
library(microbenchmark)
microbenchmark(solve(t(X) %*% X) %*% t(X) %*% y)
Unit: milliseconds
expr min lq mean median
solve(t(X) %*% X) %*% t(X) %*% y 9.193138 10.17331 12.03675 10.67279
uq max neval
13.17549 49.96505 100
microbenchmark(solve(t(X) %*% X) %*% t(X) %*% y,
solve(crossprod(X), crossprod(X, y)))
Unit: milliseconds
expr min lq mean
solve(t(X) %*% X) %*% t(X) %*% y 9.948198 10.496105 14.089429
solve(crossprod(X), crossprod(X, y)) 1.813236 1.920714 2.079828
median uq max neval
11.323277 12.327857 52.286925 100
1.999431 2.159445 2.906995 100
可以明显看到第二种方法有着更快的计算速度
但这种方法没法解决X矩阵中存在共线性向量组的问题,举例来说:
W <- cbind(X, X[, 1] + rnorm(5000, sd = 0.0000000001))#对X额外添加一列和第一列非常相似的向量,sd提供微小扰动
solve(crossprod(W), crossprod(W, y))
Error in solve.default(crossprod(W), crossprod(W, y)): system is computationally singular: reciprocal condition number = 8.86918e-17
因为这时的X已经不是满秩的了,为(singular)奇异阵,这种情况在实际运用中其实很常见
考虑用QR分解解决共线性矩阵方程组求解的问题,将X写成Q*R,改写下式
为:
其中Q为标准正交阵,R为上三角阵,回避了求X'X的步骤。具体参考:
推导1施密特正交化求正交基的思想,可以看成是:因为每一个基都要与前述所有的基垂直,所以生成新的基时,需要计算新向量在每个前述向量上的投影,并使新基垂直于这些投影方向之和。 推导2
The QR decomposition has the added benefit that we do not have to compute the cross product X′X at all, as this matrix can be numericaly unstable if it is not properly centered or scaled.
这样即使W矩阵是奇异阵,QR分解也不会报错:
Qw <- qr(W)
str(Qw)
List of 4
$ qr : num [1:5000, 1:101] 70.88664 -0.01277 0.01561 0.00158 0.02451 ...
$ rank : int 100
$ qraux: num [1:101] 1.01 1.03 1.01 1.02 1.02 ...
$ pivot: int [1:101] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "class")= chr "qr"
注意到因为有共线性向量的存在,W的秩是100而不是101。
#求beta_hat
betahat <- qr.coef(Qw, y)
比较三种计算方法,QR以牺牲计算速度为代价解决了共线性的问题
library(ggplot2)
m <- microbenchmark(solve(t(X) %*% X) %*% t(X) %*% y,
solve(crossprod(X), crossprod(X, y)),
qr.coef(qr(X), y))
autoplot(m)
In practice, we do not use functions like qr()
or qr.coef()
directly because higher level functions like lm()
do the work for us. However, for certain narrow, highly optimized cases, it may be fruitful to turn to another matrix decomposition to compute linear regression coefficients, particularly if this must be done repeatedly in a loop
Multivariate Normal Distribution
p维高斯分布表述为:
在上面的方程中,“|Σ|”表示矩阵Σ的行列式。n维的多元正态分布,也称为多元高斯分布,是用均值向量
最麻烦的部分是这块:
我们让z=x−μ把它简化成:
写成R代码就是:
t(z) %*% solve(Sigma) %*% z
可以看到又出现了求矩阵的逆的问题,我们尝试把协方差矩阵分解成:
这里R是一个上三角矩阵,或者可以看成sigma的均方根。这种方法叫做Cholesky 分解(因为协方差矩阵一定是正定的)。原式变为:
求解v:
这样一来,原来求复杂的Σ的逆的问题就转化成求非常简单的两向量叉乘,这就是Cholesky 分解的好处。
下面的解法用的是naive computing,求逆的老办法:
set.seed(2017-07-13)
z <- matrix(rnorm(200 * 100), 200, 100)
S <- cov(z)
quad.naive <- function(z, S) {
Sinv <- solve(S)
rowSums((z %*% Sinv) * z)
}
library(dplyr)
quad.naive(z, S) %>% summary#简单看看这个函数的输出
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.57 93.54 100.59 100.34 106.39 129.12
下面则是用Cholesky 分解进行计算的方法:
quad.chol <- function(z, S) {
R <- chol(S)
v <- backsolve(R, t(z), transpose = TRUE)
colSums(v * v)
}
quad.chol(z, S) %>% summary#可以看到这种方法和naive的方法输出是一样的
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.57 93.54 100.59 100.34 106.39 129.12
接下来比较一下运行时间:
library(microbenchmark)
microbenchmark(quad.naive(z, S), quad.chol(z, S))
Unit: microseconds
expr min lq mean median uq max
quad.naive(z, S) 589.173 748.979 1092.4526 821.7025 895.2265 19462.21
quad.chol(z, S) 271.405 459.147 626.7297 489.6825 523.8645 10488.25
neval
100
100
可以看到Cholesky 分解只用了naive computing大概60%的计算时间