GEO数据挖掘

maSigpro建模代码分析之make.design.matri

2021-01-15  本文已影响0人  小潤澤

前言

之前我们介绍了maSigpro的基本建模方式:分析时间序列的RNA-seq tool,这一次我们从代码的角度来分析其具体的建模原理
我们利用自带的数据集做运算

library(maSigPro) # load maSigPro library
data(data.abiotic)
data(edesign.abiotic)

edesign = edesign.abiotic
data = data.abiotic

design <- make.design.matrix(edesign, degree = 2)

fit <- p.vector(data, design, Q = 0.05, MT.adjust = "BH", min.obs = 20)

tstep <- T.fit(fit, step.method = "backward", alfa = 0.05)

我们将重点从这几段代码入手来解析其建模原理

make.design.matrix

那么第一步是将我们的设计矩阵做转换,首先我们的设计矩阵为这个样子:


至于该函数有一下几个参数:

"make.design.matrix" <-
function (edesign, degree = 2, time.col = 1, 
repl.col = 2, group.cols = c(3:ncol(edesign)))

当满足if (dim(as.matrix(edesign))[2] > 3)
执行:

if (dim(as.matrix(edesign))[2] > 3) {
        dummy.cols <- group.cols[2:length(group.cols)]
        treatm.label <- paste(colnames(edesign)[dummy.cols], 
            "vs", control.label, sep = "")
        groups.label <- c(control.label, treatm.label)
        matrix.dummy <- as.matrix(edesign[, dummy.cols])
        ## Shared origin
    dummy <- NULL
    j = 0
    origen  <- min(edesign[,time.col])
    origen <-edesign[edesign[,1]==origen,]
    for (i in 1:length(dummy.cols)) {
       share <- apply(origen[,c(3,dummy.cols[i])],1,sum)
       if  (!is.element(TRUE, share>1)) {
       j=j+1
          dummy <- cbind(dummy,matrix.dummy[,i])
          colnames(dummy)[j] <- treatm.label[i]
       } 
    }
        time <- as.matrix(edesign[, time.col])
        colnames(time) <- colnames(edesign)[time.col]
        dis <- cbind(dummy, time)
        rownames(dis) <- rownames(edesign)
        groups.vector <- c(colnames(dummy), control.label)
        colnames.dis <- colnames(dis)
        dis <- cbind(dis, dis[,ncol(dis)] * matrix.dummy)
        colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
            "x", colnames(edesign)[dummy.cols], sep = ""))
        groups.vector <- c(groups.vector, treatm.label)
        if (degree >= 2) {
            for (i in 2:degree) {
                colnames.dis <- colnames(dis)
                dis <- cbind(dis, edesign[, time.col]^i, edesign[, 
                  time.col]^i * edesign[, dummy.cols])
                colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
                  i, sep = ""), paste(colnames(edesign)[time.col], 
                  "", i, "x", colnames(edesign)[dummy.cols], 
                  sep = ""))
                groups.vector <- c(groups.vector, groups.label)
            }
        }
    }

满足这个条件意味着你的设计矩阵的列数要大于3,如果满足该条件则:

#group.cols是将control和control后面的列取出来
group.cols = c(3:ncol(edesign))
control.label <- colnames(edesign)[group.cols][1]
#指代设计矩阵第一列是时间
time.col = 1

dummy.cols <- group.cols[2:length(group.cols)]
#设计列名,即不同处理条件和control比较
treatm.label <- paste(colnames(edesign)[dummy.cols], 
                      "vs", control.label, sep = "")
groups.label <- c(control.label, treatm.label)
matrix.dummy <- as.matrix(edesign[, dummy.cols])

##初始化
dummy <- NULL
j = 0
#取出最小的时间序列,该例子中是最小的时间是3
origen  <- min(edesign[,time.col])
origen <-edesign[edesign[,1]==origen,]
origen
接下来就需要对dummy循环赋值了:
for (i in 1:length(dummy.cols)) {
  share <- apply(origen[,c(3,dummy.cols[i])],1,sum)
  if  (!is.element(TRUE, share>1)) {
    j=j+1
    dummy <- cbind(dummy,matrix.dummy[,i])
    colnames(dummy)[j] <- treatm.label[i]
  } 
}

此时设计出来的矩阵为:

dummy
通过该矩阵的形式我们可以看到所有的处理组(Cold,Heat,Salt)都是与Control组相比,比方说ColdvsControl则对应于列名含有Cold的组别,并用 1 来区别

若degree=1

#提取设计矩阵含有时间的那一列
time <- as.matrix(edesign[, time.col])
colnames(time) <- colnames(edesign)[time.col]
#合并dummy和time
dis <- cbind(dummy, time)
rownames(dis) <- rownames(edesign)
groups.vector <- c(colnames(dummy), control.label)
colnames.dis <- colnames(dis)
#dis[,ncol(dis)]实际上是提取time这一列
dis <- cbind(dis, dis[,ncol(dis)] * matrix.dummy)
colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
                                       "x", colnames(edesign)[dummy.cols], sep = ""))
groups.vector <- c(groups.vector, treatm.label)
dis[,ncol(dis)]
matrix.dummy
matrix.dummy是设计矩阵的后三列,代表不同的处理
dis矩阵(degree=1)
dis矩阵有一个特点,比方说Control_3H因为没有任何处理所以ColdvsControl,HeatvsControl和SaltvsControl均为0,但是该样本的时间点为3,所以time那一列为3,因为没有任何处理(Cold,Heat和Salt),所以后面的(Time×Cold,Time×Heat,Time×Salt)均为0;
又例如Cold_9H,只有Cold处理,所以ColdvsControl这一项为1,其他(HeatvsControl和SaltvsControl)为0,但是该样本的时间点为9,所以time那一列为9,因为只有Cold处理,所以后面的Time×Cold为9,其他(Time×Heat,Time×Salt)为0

若degree≥2

接下来就要进行degree的判断了,如果这里我们假设degree=3,这里的degree代表的是多项式回归的最高次数,我们前面简单介绍了dis矩阵的构造,下面

degree = 3
if (degree >= 2) {
  for (i in 2:degree) {
    colnames.dis <- colnames(dis)
    dis <- cbind(dis, edesign[, time.col]^i, 
                      edesign[, time.col]^i * edesign[,dummy.cols])
    colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], i, sep = ""), 
                       paste(colnames(edesign)[time.col], "", i, "x", colnames(edesign)[dummy.cols], sep = ""))
    groups.vector <- c(groups.vector, groups.label)
  }
}
dis矩阵(degree=3)左半边
dis矩阵(degree=3)右半边

我们可以看到,degree=3的dis矩阵和degree=1的dis矩阵相比多了Time2和Time3,例如Control_3HTime2是取了2次方的,即3^2=9,Time3是去了3次方的3^3=27
Cold_9HTime2是取了2次方的,即9^2=81,Time3是去了3次方的9^3=729,其他的解释与degree=1的dis矩阵一样
这么做的目的是引入与时间相关的高次多项式,更好的拟合模型

下期预告:maSigpro建模代码分析之p.vector

上一篇下一篇

猜你喜欢

热点阅读