DALS014-Linear Models线性模型03比较(co

2019-08-25 本文已影响0人 backup备份

title: DALS014-Linear Models线性模型03比较(contrast)与交互项
date: 2019-08-19 12:0:00
type: "tags"
tags:

线性回归
线性模型
categories:
生物统计

前言

这一部分是《Data Analysis for the life sciences》的第5章线性模型的第3小节。这一部分的主要内容涉及线性回归的比较与交互项。

文献解读

我们会使用一篇文献中的数据集来学习复杂的线性模型，这个数据集包括了蜘蛛不同腿上的不同摩擦系数，具体地说就是，我们要研究腿部的推拉运动是否会产生不同强度的摩擦力。

文献信息如下所示：

Jonas O. Wolff & Stanislav N. Gorb, Radial arrangement of Janus-like setae permits friction control
in spiders⁷³, Scientific Reports, 22 January 2013.

通过这篇文献的摘要我们大致知道这篇文献的主要研究内容是，研究了一种游猎型蜘蛛Cupiennius Salei（蜘蛛目，蜘蛛科）在其远端腿上有毛状附着垫(爪状毛)，这个附着垫是由定向分枝的刚毛组成的。作者测量了爪状毛在光滑玻璃上的摩擦力，从而揭示了垫内刚毛排列的功能效应。

文献的Figure1如下所示：

image

Figure1一些有关蜘蛛爪状毛(claw tufts)的电镜照片，这些都是专业知识，我们不用管，我们主要的兴趣在Figure4上，如下所示：

image

Figure4比较了拉(pulling)与推(pushing)这两个动作，也就是Figure3中的A图，如下所示：

image

导入数据

作者已经在他的Github上分享的文献的原始实验数据，如下所示：

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)

先来大概看一下数据，如下所示：

> table(spider$leg,spider$type)
    
     pull push
  L1   34   34
  L2   15   15
  L3   52   52
  L4   40   40

其中L1~L4是蜘蛛身上不同的腿，现在我们画一个箱线图来看一下这些数据，如下所示：

boxplot(spider$friction ~ spider$type * spider$leg,
col=c("grey90","grey40"), las=2,
main="Comparison of friction coefficients of different leg pairs")

image

从这张图上我们可以直接看出2点信息：

拉(pulling)比推(pushing)产生的摩擦力更大；
后腿（也就是L4）产生的拉力(puling friction)最高。

另外，如果仔细看箱线图的话，还会发现，不同组之间的平均值分布不同，这就是组内方差(within-group variance)。对于线性模型来说，有时候组内方差会成为一个问题，因为我们会假设每组的均值会在总体均值附近波动，也就是说误差 $\varepsilon_{i}$ 的分布是相同的，也就是说，每组的方差是相同的。如果忽视不同的组的不同方差，那么一些方差比较小的组就会显示过于“保守”（因为方差的总体估计会大于这些组的估计），并且具有较大方差的那些之间的比较置信度会过高。如果这些异常分布与摩擦力的范围有关，那么摩擦力值较大的数据造成的影响也大，因此可以采用log或sqrt转换来对数据进行预处理。

如果不进行数据转换（例如log转换或sqrt转换），那么我们也可以使用其他的统计检验，例如Welch t检验或Satterthwaite approximation（这两种都是t检验的扩展，可以在方差不齐的时候使用），或Wilcoxon秩和检验。但是，在这里，我们为了说明一些问题，我们会采用一个线性模型，这个模型假定不同组的方差相同。

单变量线性模型

先看简单的案例，也就是先研究一个变量，即我们现在只研究L1腿，如下所示：

head(spider)
spider.sub <- spider[spider$leg == "L1",]
head(spider.sub)
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
(coefs <- coef(fit))

结果如下所示：

 head(spider)
  leg type friction
1  L1 pull     0.90
2  L1 pull     0.91
3  L1 pull     0.86
4  L1 pull     0.85
5  L1 pull     0.80
6  L1 pull     0.87
> spider.sub <- spider[spider$leg == "L1",]
> head(spider.sub)
  leg type friction
1  L1 pull     0.90
2  L1 pull     0.91
3  L1 pull     0.86
4  L1 pull     0.85
5  L1 pull     0.80
6  L1 pull     0.87
> fit <- lm(friction ~ type, data=spider.sub)
> summary(fit)

Call:
lm(formula = friction ~ type, data = spider.sub)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33147 -0.10735 -0.04941 -0.00147  0.76853 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.92147    0.03827  24.078  < 2e-16
typepush    -0.51412    0.05412  -9.499  5.7e-14
               
(Intercept) ***
typepush    ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2232 on 66 degrees of freedom
Multiple R-squared:  0.5776,    Adjusted R-squared:  0.5711 
F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14
> (coefs <- coef(fit))
(Intercept)    typepush 
  0.9214706  -0.5141176

从上面的结果可以看出，L1这一组（L1这一组中有push，有pull）的均值为0.9214706，拉(pull)与推(push)摩擦力的差值是-0.5141176 ，现在我们在R中再计算一下，看是否相符，如下所示：

s <- split(spider.sub$friction, spider.sub$type)
mean(s[["pull"]])
mean(s[["push"]]) - mean(s[["pull"]])

结果如下所示：

> s <- split(spider.sub$friction, spider.sub$type)
> mean(s[["pull"]])
[1] 0.9214706
> mean(s[["push"]]) - mean(s[["pull"]])
[1] -0.5141176

上面的是常规方法，我们也可以通过设计矩阵来进行计算，如下所示：

X <- model.matrix(~ type, data=spider.sub)
colnames(X)
head(X)
tail(X)

结果如下所示：

> X <- model.matrix(~ type, data=spider.sub)
> colnames(X)
[1] "(Intercept)" "typepush"   
> head(X)
  (Intercept) typepush
1           1        0
2           1        0
3           1        0
4           1        0
5           1        0
6           1        0
> tail(X)
   (Intercept) typepush
63           1        1
64           1        1
65           1        1
66           1        1
67           1        1
68           1        1

现在我们来绘制一个 $\mathbf{X}$ 矩阵的示意图，其中，黑块表示1，白块表示0，这种方法对于我们理解线性模型比较有用。在这个图形中，y轴表示样本数目（也就是数据的行数），x轴是设计矩阵 $\mathbf{X}$ 的列。我们使用rafalib包中的imagemat()函数即可，如下所示：

library(rafalib)
imagemat(X, main="Model matrix for linear model with one variable")

image

计算估计系数(examining the estimated coefficients)

下图是由线性模型计算而来的估计系数：

set.seed(1) #same jitter in stripchart
stripchart(split(spider.sub$friction, spider.sub$type), 
           vertical=TRUE, pch=1, method="jitter", las=2, xlim=c(0,3), ylim=c(0,2))
a <- -0.25
lgth <- .1
library(RColorBrewer)
cols <- brewer.pal(3,"Dark2")
abline(h=0)
arrows(1+a,0,1+a,coefs[1],lwd=3,col=cols[1],length=lgth)
abline(h=coefs[1],col=cols[1])
arrows(2+a,coefs[1],2+a,coefs[1]+coefs[2],lwd=3,col=cols[2],length=lgth)
abline(h=coefs[1]+coefs[2],col=cols[2])
legend("right",names(coefs),fill=cols,cex=.75,bg="white")

image

其中，绿色箭头表示截矩(Intercept)，其范围是从0到参考组均值的距离(这里的参考组是pull)。橘黄色的箭头表示push组与pull组的差值，在上面图形中，箭头向下，表示负值。小圆圈表示的是每个数据，其中为了避免相同数据的重叠，就在水平上进行了一定程度的波动。

双变量线性模型

再进一步，现在我们研究整个数据集。为了研究不同的腿（L1，L2，L3和L4）的push和pull的差异，我们需要构建一个双变量R公式(R formula)，如下所示：

X <- model.matrix(~ type + leg, data=spider)
colnames(X)
head(X)
imagemat(X, main="Model matrix for linear model with two factors")

结果如下所示：

> X <- model.matrix(~ type + leg, data=spider)
> colnames(X)
[1] "(Intercept)" "typepush"    "legL2"       "legL3"       "legL4"      
> head(X)
  (Intercept) typepush legL2 legL3 legL4
1           1        0     0     0     0
2           1        0     0     0     0
3           1        0     0     0     0
4           1        0     0     0     0
5           1        0     0     0     0
6           1        0     0     0     0

image

上面的黑白图是对公式type+leg的简单表示。

从计算结果，以及黑白示意图上我们可以知道：

第1列是截矩(intercept)，所有值都是1；

第2列：含有1的是push，从示意图上可以看出来，1是黑色，连续的黑色方块一共是4块，因此一共是4组；

第3列：含有1的是L2；

第4列：含有1的是L3；

第5列：含有1的是L4。

这里没有提到L1是因为L1是leg变量的参考水平。同样的，里面也没有提到pull，因为pull是type变量的参考水平。

如果要计算出相应的系数，需要使用lm()函数，公式为~ type + leg，如下所示：

fitTL <- lm(friction ~ type + leg, data=spider)
summary(fitTL)
(coefs <- coef(fitTL))

结果如下所示：

> fitTL <- lm(friction ~ type + leg, data=spider)
> summary(fitTL)

Call:
lm(formula = friction ~ type + leg, data = spider)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46392 -0.13441 -0.00525  0.10547  0.69509 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05392    0.02816  37.426  < 2e-16 ***
typepush    -0.77901    0.02482 -31.380  < 2e-16 ***
legL2        0.17192    0.04569   3.763 0.000205 ***
legL3        0.16049    0.03251   4.937 1.37e-06 ***
legL4        0.28134    0.03438   8.183 1.01e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2084 on 277 degrees of freedom
Multiple R-squared:  0.7916,    Adjusted R-squared:  0.7886 
F-statistic:   263 on 4 and 277 DF,  p-value: < 2.2e-16
> (coefs <- coef(fitTL))
(Intercept)    typepush       legL2       legL3       legL4 
  1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

R的计算结果中，在coefficient这一部分中含有最小方差估计值。

数学表达式

我们拟合的线性模型可以使用下面的式子表示：
$Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3} + \beta_4 x_{i,4} + \varepsilon_i, i=1,\dots,N$
其中， $x$ 表示所有的指示变量(indicator variables)，在这个案例中指提是不同腿的push与pull。例如对于第3条腿的push来说，就是 $x_{i,1}$ 与 $x_{i,3}$ 等于1，剩下的等于0。 $\beta$ 指的是它们表示的效应大小。例如 $\beta_{0}$ 表示截矩，而 $\beta_{1}$ 表示pull效应(pull effect)，而 $\beta_{2}$ 一表示L2效应等。上面计算结果里没有直接表明系数，即 $\beta_{1}$ ，但是标明了估计值，例如 $\hat{\beta_{4}}$ 。

我们还可以使用矩阵来计算最小二乘估计，如下所示：

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}$
如下所示：

Y <- spider$friction
X <- model.matrix(~ type + leg, data=spider)
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% Y
t(beta.hat)
coefs

结果如下所示：

> t(beta.hat)
     (Intercept)   typepush     legL2     legL3     legL4
[1,]    1.053915 -0.7790071 0.1719216 0.1604921 0.2813382
> coefs
(Intercept)    typepush       legL2       legL3       legL4 
  1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

从上面的结果可以看出来，这个与lm()计算出来的结果一样。

计算估计值系数

现在绘制出上述结果的示意图，如下所示：

spider$group <- factor(paste0(spider$leg, spider$type))
stripchart(split(spider$friction, spider$group), 
           vertical=TRUE, pch=1, method="jitter", las=2, xlim=c(0,11), ylim=c(0,2))
cols <- brewer.pal(5,"Dark2")
abline(h=0)
arrows(1+a,0,1+a,coefs[1],lwd=3,col=cols[1],length=lgth)
abline(h=coefs[1],col=cols[1])
arrows(3+a,coefs[1],3+a,coefs[1]+coefs[3],lwd=3,col=cols[3],length=lgth)
arrows(5+a,coefs[1],5+a,coefs[1]+coefs[4],lwd=3,col=cols[4],length=lgth)
arrows(7+a,coefs[1],7+a,coefs[1]+coefs[5],lwd=3,col=cols[5],length=lgth)
arrows(2+a,coefs[1],2+a,coefs[1]+coefs[2],lwd=3,col=cols[2],length=lgth)
segments(3+a,coefs[1]+coefs[3],4+a,coefs[1]+coefs[3],lwd=3,col=cols[3])
arrows(4+a,coefs[1]+coefs[3],4+a,coefs[1]+coefs[3]+coefs[2],lwd=3,col=cols[2],length=lgth)
segments(5+a,coefs[1]+coefs[4],6+a,coefs[1]+coefs[4],lwd=3,col=cols[4])
arrows(6+a,coefs[1]+coefs[4],6+a,coefs[1]+coefs[4]+coefs[2],lwd=3,col=cols[2],length=lgth)
segments(7+a,coefs[1]+coefs[5],8+a,coefs[1]+coefs[5],lwd=3,col=cols[5])
arrows(8+a,coefs[1]+coefs[5],8+a,coefs[1]+coefs[5]+coefs[2],lwd=3,col=cols[2],length=lgth)
legend("right",names(coefs),fill=cols,cex=.75,bg="white")

image

上图是线性模型的估计值示意图。其中茶绿色(teal-green)箭头表示截矩(intercept)，它拟合的是参考组，这里指的是L1的pull。紫色(purple)、粉色(pink)，黄绿色(yello-green)表示提其它组与L1的差值。黄色箭头(orange arrow)表示是push和push的差值。

在这个案例中，每组拟合的均值是由拟合的系数推导出来的，它与我们简单地从8组中计算的均值不太一致。原因是，我们的模型使用了5个系数，而不是8个。我们假设效应是可加的(additive)。但是我们在下文中还会考虑更多的因素来进行拟合，例如考虑这个数据集中的交互因素。

现在看一下简单地计算平均值与线性模型推导出来的平均值，如下所示：

s <- split(spider$friction, spider$group)
mean(s[["L1pull"]])
coefs[1]
mean(s[["L1push"]])
coefs[1] + coefs[2]

计算结果如下所示：

> s <- split(spider$friction, spider$group)
> mean(s[["L1pull"]])
[1] 0.9214706
> coefs[1]
(Intercept) 
   1.053915 
> mean(s[["L1push"]])
[1] 0.4073529
> coefs[1] + coefs[2]
(Intercept) 
  0.2749082

上面的计算过程是比较push与pull的估计值，其中coefs[2]是每组均值差异的加权平均值。此外，权重(weight)是由每组的样本数量决定的。这里计算的过程非常简单，因为每组的样本数目(pull与push)都相等。如果样本数目不等（例如push和pull不等），那么权重的计算就比较复杂，但是，每组的样本数目，总的样本数目以及系数的数目都会计算出相应唯一的权重。这可能通过 $(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ 计算得到，如下所示：

means <- sapply(s, mean)
##the sample size of push or pull groups for each leg pair
ns <- sapply(s, length)[c(1,3,5,7)]
(w <- ns/sum(ns))
sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))
coefs[2]
sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))

结果如下所示：

> (w <- ns/sum(ns))
   L1pull    L2pull    L3pull    L4pull 
0.2411348 0.1063830 0.3687943 0.2836879 
> sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))
[1] -0.7790071
> coefs[2]
  typepush 
-0.7790071 
> sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))
[1] -0.7790071

比较系数(Contrasting coefficients)

有的时候，我们可能对模型中单个系数表示的比较结果感兴趣，例如push与pull的差值，也就是上面的coefs[2]。但是，有的时候，由单一的系数无法得到想要的比较结果，但是联合使用几个系数可以进行比较，这就是比较系数(contrast)。为了引入contrast的概念，我们先来看一下coefs这个变量，如下所示：

> coefs
(Intercept)    typepush       legL2       legL3       legL4 
  1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

这个结果包含截矩的估计值，push与pull的估计效应，L2与L1效应的估计值，L3与L1效应估计值，L4与L1效应的估计值。假设我们想要比较两组数的效应大小，并且其中一组不是L1怎么办？这个解决过程就叫比较(contrast)。

比较(contrast)是对估计效应(estimated coefficient)的联合过计算，即 $\mathbf{c^\top} \hat{\boldsymbol{\beta}}$ ，其中 $\mathbf{c}$ 是一个列向量，其行数与线性模型系数的数目相同。如果 $\mathbf{c}$ 中含有1个或多个0，那就表示相应的 $\hat{\beta}$ 中的估计系数就不参与相应的比较系数(contrast)计算。

如果我们想比较L2和L3，它就相当于在线性模型中比较这两个系数，因为它们都是与L1进行比较的，如下所示：

$(\mbox{L3} - \mbox{L1}) - (\mbox{L2} - \mbox{L1}) = \mbox{L3} - \mbox{L2 }$
对这两种进行比较的一个简单途径就是使用constrast包中的contrast()函数。我们只需要指定要比较的两组名称即可，现在我们来比较一下pull或push类型，如下所示：

library(contrast) #Available from CRAN
L3vsL2 <- contrast(fitTL,list(leg="L3",type="pull"),list(leg="L2",type="pull"))
L3vsL2

结果如下所示：

> L3vsL2
lm model parameter contrast

    Contrast       S.E.      Lower      Upper     t  df Pr(>|t|)
 -0.01142949 0.04319685 -0.0964653 0.07360632 -0.26 277   0.7915

第1列是Contrast，它表示的是L3与L2的估计值。

我们可以证明，系数的线性组合的最小二乘估计是这些估计的相同线性组合。因此，效应大小估计值(effect size estimate)就是两者之间的差值。使用contrast()函数进行比较的比较向量，被储备成为一个称为X变量中，如下所示：

coefs[4] - coefs[3]
(cT <- L3vsL2$X)
cT %*% coefs

计算结果如下所示：

> coefs[4] - coefs[3]
      legL3 
-0.01142949 
> (cT <- L3vsL2$X)
  (Intercept) typepush legL2 legL3 legL4
1           0        0    -1     1     0
attr(,"assign")
[1] 0 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"

attr(,"contrasts")$leg
[1] "contr.treatment"

> cT %*% coefs
         [,1]
1 -0.01142949

计算结果里面还有t检验与标准误，我们以前提到过，t检验是一个估计值，它的除数就标准误。比较估计值的标准误是由比较向量 $\mathbf{c}$ 乘以估计的协方差矩阵 $\hat{\Sigma}$ 的任意一边得到的，我们对var $\hat{\beta}$ 的估计值如下所示：
$\sqrt{\mathbf{c^\top} \hat{\boldsymbol{\Sigma}} \mathbf{c}}$

系数的协方差为：
$\boldsymbol{\Sigma} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$
我们使用样本估计值 $\hat{\sigma}^2$ 来估计 ${\sigma}^2$ ，如下所示：

Sigma.hat <- sum(fitTL$residuals^2)/(nrow(X) - ncol(X)) * solve(t(X) %*% X)
signif(Sigma.hat, 2)
sqrt(cT %*% Sigma.hat %*% t(cT))
L3vsL2$SE

计算结果如下所示：

> Sigma.hat <- sum(fitTL$residuals^2)/(nrow(X) - ncol(X)) * solve(t(X) %*% X)
> signif(Sigma.hat, 2)
            (Intercept) typepush    legL2    legL3    legL4
(Intercept)     0.00079 -3.1e-04 -0.00064 -0.00064 -0.00064
typepush       -0.00031  6.2e-04  0.00000  0.00000  0.00000
legL2          -0.00064 -6.4e-20  0.00210  0.00064  0.00064
legL3          -0.00064 -6.4e-20  0.00064  0.00110  0.00064
legL4          -0.00064 -1.2e-19  0.00064  0.00064  0.00120
> sqrt(cT %*% Sigma.hat %*% t(cT))
           1
1 0.04319685
> L3vsL2$SE
[1] 0.04319685

如果我们选择参数type="push"，那么我们就会获得与L3和L2比较的一样的结果。其原因就是，在差值的两侧都添加了typepush效应（这一句不太理解）我们可以取消它，如下所示：

L3vsL2.equiv <- contrast(fitTL,list(leg="L3",type="push"),list(leg="L2",type="push"))
L3vsL2.equiv$X

运行结果如下所示：

> L3vsL2.equiv <- contrast(fitTL,list(leg="L3",type="push"),list(leg="L2",type="push"))
> L3vsL2.equiv$X
  (Intercept) typepush legL2 legL3 legL4
1           0        0    -1     1     0
attr(,"assign")
[1] 0 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"

attr(,"contrasts")$leg
[1] "contr.treatment"

交互项(`interaction terms`)

在前面的线性模型案例中，我们假设pull与push的效应对于所有的腿(leg)都是相同的（也就是说橘黄色的箭头是一样的），但是从我们最终计算的结果来看，并不能很好的捕获数据的趋势，换句话讲，就是简单与每组的平均值并不完全一致。例如L1组，push与pull估计系数太大，而L3组，push与pull又太小。

此时，我们引入交互项(interaction terms)，这个引入的系数可以解决不同组中push与pull差值过大的问题。由于我们在模型中已经有了push与pull项，因此我们只需要再添加三个项就能够找到leg-piar特异性push和pull的差异。在下面我们会向模型的设计矩阵中添加交互项(interaction terms)，其方法是将代表现有项的设计矩阵列相乘。

我们可以在type和leg之间构建一个交互项，即type:leg，其中的冒号:表示两个变量存在交互作用，另外一种相同的做法就是~type*leg，这两种写法都是主要研究type与leg之间的交互作用，而不研究其他变量之间的交互作用，如下所示：

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
X <- model.matrix(~ type + leg + type:leg, data=spider)
colnames(X)
X <- model.matrix(~type*leg, data=spider)
colnames(X)
head(X)
library(rafalib)
imagemat(X, main="Model matrix for linear model with interactions")

计算结果如下所示：

> X <- model.matrix(~ type + leg + type:leg, data=spider)
> colnames(X)
[1] "(Intercept)"    "typepush"       "legL2"          "legL3"         
[5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"
> X <- model.matrix(~type*leg, data=spider)
> colnames(X)
[1] "(Intercept)"    "typepush"       "legL2"          "legL3"         
[5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"
> head(X)
  (Intercept) typepush legL2 legL3 legL4 typepush:legL2 typepush:legL3
1           1        0     0     0     0              0              0
2           1        0     0     0     0              0              0
3           1        0     0     0     0              0              0
4           1        0     0     0     0              0              0
5           1        0     0     0     0              0              0
6           1        0     0     0     0              0              0
  typepush:legL4
1              0
2              0
3              0
4              0
5              0
6              0

image

其中，第6到8列表示的是：typepush:legL2，typepush:legL3与typepush:legL4，也就是说，它们是由第2列的typepush和第3到5列的legL2，legL3和legL4产生的。我们看最后1列，即typepush:legL4列，这是另外的一个系数，即 $\beta_{push,L4}$ ，此系数表示push与L4共同作用的样本。这就可能对L4-push这一组样本的均值的差异进行解释，例如当我们通过估计的截距、估计的typepush系数，估计的L4系数LegL4来对数据进行预测时产生偏差（例如数据点不在估计的直线上），当我们考虑了交互作用后，其结果如下所示：

fitX <- lm(friction ~ type + leg + type:leg, data=spider)
summary(fitX)
coefs <- coef(fitX)
coefs

结果如下所示：

> fitX <- lm(friction ~ type + leg + type:leg, data=spider)
> summary(fitX)

Call:
lm(formula = friction ~ type + leg + type:leg, data = spider)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46385 -0.10735 -0.01111  0.07848  0.76853 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.92147    0.03266  28.215  < 2e-16 ***
typepush       -0.51412    0.04619 -11.131  < 2e-16 ***
legL2           0.22386    0.05903   3.792 0.000184 ***
legL3           0.35238    0.04200   8.390 2.62e-15 ***
legL4           0.47928    0.04442  10.789  < 2e-16 ***
typepush:legL2 -0.10388    0.08348  -1.244 0.214409    
typepush:legL3 -0.38377    0.05940  -6.461 4.73e-10 ***
typepush:legL4 -0.39588    0.06282  -6.302 1.17e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1904 on 274 degrees of freedom
Multiple R-squared:  0.8279,    Adjusted R-squared:  0.8235 
F-statistic: 188.3 on 7 and 274 DF,  p-value: < 2.2e-16
> coefs
   (Intercept)       typepush          legL2          legL3          legL4 
     0.9214706     -0.5141176      0.2238627      0.3523756      0.4792794 
typepush:legL2 typepush:legL3 typepush:legL4 
    -0.1038824     -0.3837670     -0.3958824

计算估计系数

现在我们先画一下上面的计算结果，如下所示：

library(RColorBrewer)
spider$group <- factor(paste0(spider$leg, spider$type))
stripchart(split(spider$friction, spider$group), 
           vertical=TRUE, pch=1, method="jitter", las=2, xlim=c(0,11), ylim=c(0,2))
cols <- brewer.pal(8,"Dark2")
abline(h=0)
arrows(1+a,0,1+a,coefs[1],lwd=3,col=cols[1],length=lgth)
abline(h=coefs[1],col=cols[1])
arrows(2+a,coefs[1],2+a,coefs[1]+coefs[2],lwd=3,col=cols[2],length=lgth)
arrows(3+a,coefs[1],3+a,coefs[1]+coefs[3],lwd=3,col=cols[3],length=lgth)
arrows(5+a,coefs[1],5+a,coefs[1]+coefs[4],lwd=3,col=cols[4],length=lgth)
arrows(7+a,coefs[1],7+a,coefs[1]+coefs[5],lwd=3,col=cols[5],length=lgth)
#now the interactions:
segments(3+a,coefs[1]+coefs[3],4+a,coefs[1]+coefs[3],lwd=3,col=cols[3])
arrows(4+a,coefs[1]+coefs[3],4+a,coefs[1]+coefs[3]+coefs[2],lwd=3,col=cols[2],length=lgth)
arrows(4+a,coefs[1]+coefs[2]+coefs[3],4+a,coefs[1]+coefs[2]+coefs[3]+coefs[6],lwd=3,col=cols[6],length=lgth)
segments(5+a,coefs[1]+coefs[4],6+a,coefs[1]+coefs[4],lwd=3,col=cols[4])
arrows(6+a,coefs[1]+coefs[4],6+a,coefs[1]+coefs[4]+coefs[2],lwd=3,col=cols[2],length=lgth)
arrows(6+a,coefs[1]+coefs[4]+coefs[2],6+a,coefs[1]+coefs[4]+coefs[2]+coefs[7],lwd=3,col=cols[7],length=lgth)
segments(7+a,coefs[1]+coefs[5],8+a,coefs[1]+coefs[5],lwd=3,col=cols[5])
arrows(8+a,coefs[1]+coefs[5],8+a,coefs[1]+coefs[5]+coefs[2],lwd=3,col=cols[2],length=lgth)
arrows(8+a,coefs[1]+coefs[5]+coefs[2],8+a,coefs[1]+coefs[5]+coefs[2]+coefs[8],lwd=3,col=cols[8],length=lgth)
legend("right",names(coefs),fill=cols,cex=.75,bg="white")

image

估计的交互作用系数就能反映出在对push和pull进行比较时，leg-pair-specific导致的差值不同。上图的橘黄色简单表示的是与L1相比，push和pull的差值估计。如果估计的交互系数过大，那么push与pull差值的均值就与参考组相比（也就是L1组中push和pull差值的均值），非常不同。

比较(contrast)

在一些简单案例中，我们会使用contrast包来进行比较，混合计算估计系数。例如我们知道L2样本的push与pull效应。我们可以从上面箭图中看出来（也就是黄色箭头加上橘黄色简单），我们也可以在contrast()函数中指定要比较哪两组，如下所示：

library(contrast) ##Available from CRAN
L2push.vs.pull <- contrast(fitX,
list(leg="L2", type = "push"),
list(leg="L2", type = "pull"))
L2push.vs.pull
coefs[2] + coefs[6] ##we know this is also orange + yellow arrow

计算结果如下所示：

> L2push.vs.pull
lm model parameter contrast

 Contrast      S.E.      Lower      Upper     t  df Pr(>|t|)
   -0.618 0.0695372 -0.7548951 -0.4811049 -8.89 274        0
> coefs[2] + coefs[6] ##we know this is also orange + yellow arrow
typepush 
  -0.618

差值的差异(Differences of differences)

push与pull在L2中的差异和push与pull在L1中的差异是不同的，为什么会不同的，我们可以使用模型中的typepush:legL2估计系数，也就是上图中的黄色箭头。这个系数的p值是否等于0，可以从上面的summary(fitX)函数来查看。同样的，我们也可以看出来L3与L1中差值(push和pull的差值)的差异和L4与L1的差值。假设我们想知道push与pull在L3中的差异与L2中的差值是否相同。在上面的图形中，我们可以看到，除了L1之外的其他腿的push和pull效应就是typepush箭头加上该组的交互作用项。

如果我们想比较两个非参考组，那么数学公式如下所示：
$(\mbox{typepush} + \mbox{typepush:legL3}) - (\mbox{typepush} + \mbox{typepush:legL2})$
可以简写为：
$= \mbox{typepush:legL3} - \mbox{typepush:legL2}$
在这里我们无法使用contrast()函数直接比较，但是我们可以使用glht()函数进行比较（这个函数的全称是general liner hypothesis test)，这个函数在multcomp包中。为了使用glht()这个函数，我们需要构建一个矩阵，这个矩阵只有一行，其中-1表示typepush:legL2系数，其中+1表示typepush:legL3系数。我们将这个矩阵加入到linfct()函数中（这个函数是linear function的缩写），当成linfct()的参数，计算后，我们会获得一个有关估计交互系数比较的表，如下所示：

library(multcomp) ##Available from CRAN
C <- matrix(c(0,0,0,0,0,-1,1,0), 1)
C
L3vsL2interaction <- glht(fitX, linfct=C)
summary(L3vsL2interaction)

计算结果如下所示：

> C
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    0    0   -1    1    0
> L3vsL2interaction <- glht(fitX, linfct=C)
> summary(L3vsL2interaction)

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = friction ~ type + leg + type:leg, data = spider)

Linear Hypotheses:
       Estimate Std. Error t value Pr(>|t|)    
1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

DALS014-Linear Models线性模型03比较(co

前言

文献解读

导入数据

单变量线性模型

计算估计系数(examining the estimated coefficients)

双变量线性模型

数学表达式

计算估计值系数

比较系数(Contrasting coefficients)

交互项(`interaction terms`)

计算估计系数

比较(contrast)

差值的差异(Differences of differences)

猜你喜欢

热点阅读