生物节律之振荡节律检测（4）

2021-01-13 本文已影响0人 leoxiaobei

前面我们主要介绍的都是做cosinor分析的，下面介绍几个只是用来检测振荡是否显著的R包，毕竟多掌握点使用工具总是好的

1.Rain::rain

rain()使用非参数方法检测时间序列中的节奏。它在Jonckheere-Terpstra检验的基础上，使用了伞形约束的秩检验(Mack & Wolfe, 1981)，主要用来测试集合是否有趋势。

rain(x, #数值矩阵或数据框或数组，行为时间点，列为样本，允许重复
      deltat, #时间点间隔
      period, #周期
      period.delta = 0,# 允许的周期偏差
      peak.border = c(0.3, 0.7),#定义峰型，一般默认
      nr.series = 1, #每一时间点重复次数
      measure.sequence = NULL, #单独指定每一时间点重复次数，允许某一时间点没有测量，会覆盖前面nr.series参数
      method = "independent",#检测方法，"independent"代表多个周期被解释为一个周期的重复 ；"logitudinal"代表进行的是纵向数据（重复测量，不规则采样，不相等的时间间隔。）
      na.rm = FALSE, 
      adjp.method = "ABH", 
      verbose = getOption("verbose"))

# testing a biological dataset
data(menetRNASeqMouseLiver)
head(menetRNASeqMouseLiver)
#                 ZT_2_1 ZT_2_2 ZT_6_1 ZT_6_2 ZT_10_1 ZT_10_2 ZT_14_1 ZT_14_2 ZT_18_1 ZT_18_2 ZT_22_1 ZT_22_2
# 2810459M11Rik  30.29  29.61  26.33  24.07   22.40   16.60   22.69   23.27   26.78   27.49   29.49   30.01
# Abcb11        120.42 114.40  87.88  73.12   54.15   51.40   56.71   53.78   69.88   79.76   93.61   96.20
# Acot7           4.16   2.89   5.76   5.43    5.97    6.64    7.71    6.68    5.95    5.57    4.15    4.79
# Ahctf1          5.81   5.08   4.80   4.63    5.88    6.41    8.71    7.99    9.85   10.40    9.03    7.47
# AK017500        0.08   0.02   0.26   0.10    0.43    0.40    0.70    0.56    0.36    0.26    0.14    0.12
# AK030918        5.37   5.53   4.74   4.70    3.93    3.83    4.76    3.89    5.11    4.84    6.22    6.00
menet.ossc <- rain(t( menetRNASeqMouseLiver ), 
                   deltat = 4, #时间点间隔
                   period = 24,#周期
                   nr.series = 2, #每一时间点重复次数
                   peak.border = c(0.3, 0.7), 
                   verbose=TRUE)
head(menet.ossc)
#                 pVal        phase   peak.shape period
# 2810459M11Rik 2.727935e-05     4          8     24
# Abcb11        2.727935e-05     4          8     24
# Acot7         4.900195e-05    16         12     24
# Ahctf1        7.000279e-06    20         12     24
# AK017500      7.318473e-06    16         12     24
# AK030918      5.122931e-05    24         12     24

结果为一个数据框，行为基因，列分别为p值，相位（也不知道咋计算的，感觉不太可信），峰型（不关注），周期（前面设置的，不允许偏差，只能为24）
分析：只有p值可信点，相位感觉完全对不上，由head(menetRNASeqMouseLiver)可知，第一个基因2810459M11Rik肉眼判断其峰值大概在ZT0，第二个基因Abcb11肉眼判断其峰值大概在ZT2，可是head(menet.ossc)出来的分别为4和4，另外这个函数根本没有输入时间起点的参数，我想该列结果应该不太可信，建议只用来作为判断基因是否振荡的工具

2.Genecycle包：周期性表达基因的鉴定

library(GeneCycle)
data(caulobacter)# load data set
dim(caulobacter)# how many samples and how many genes?
#[1]   11 1444
caulobacter[1:11,1:6]
#       ORF06244   ORF03152   ORF03156   ORF03161  ORF00509  ORF02752
# 0-1   0.3665097 0.18985689 0.21925755 0.25281226 0.3466452 0.1362747
# 15-1  0.9726708 0.13047059 0.12333525 0.18136116 0.1808251 0.1430609
# 30-1  1.9293524 0.09408427 0.07837877 0.14845936 0.1614140 0.2361530
# 45-1  1.1993566 0.07229088 0.07498218 0.05748992 0.1614140 0.1362747
# 60-1  1.3771675 0.07229088 0.07498218 0.09974181 0.2144124 0.2529961
# 75-1  1.2179183 0.41492178 0.38232078 0.42025649 0.9933402 0.8474221
# 90-1  0.6636373 0.92885060 0.92366714 0.96674403 2.2961388 1.8085694
# 105-1 0.4513885 1.45325478 1.26560223 1.57037616 1.5763530 1.6990011
# 120-1 0.4991248 1.17140222 1.34218542 1.42520902 1.6520712 1.1989009
# 135-1 0.9820789 0.98617347 0.84588528 1.14233043 1.1189054 0.6845092
# 150-1 1.3924968 0.45861105 0.50393990 0.66502774 0.6161798 0.3957206

is.longitudinal(caulobacter)#是否为纵向数据
#如果不是，可使用as.longitudinal(x, repeats=c(n1,n2,n3,...,nn), time=c(x1,x2,x3,...,xn)转为纵向数据
# [1] TRUE
summary(caulobacter)# how many samples and how many genes?
# Longitudinal data:
#   1444 variables measured at 11 different time points
# Total number of measurements per variable: 11 
# Repeated measurements: none 
# 
# To obtain the measurement design call 'get.time.repeats()'.
get.time.repeats(caulobacter)
# $time
# [1]   0  15  30  45  60  75  90 105 120 135 150
# 
# $repeats
# [1] 1 1 1 1 1 1 1 1 1 1 1
plot(caulobacter, 1:9)# plot first nine time series

caulobacter数据集前九个基因表达趋势

#fisher.g.test Fisher’s Exact g Test for Multiple (Genetic) Time Series
#robust.g.test Robust g Test for Multiple (Genetic) Time Series

fish.g.test根据Fisher的精确g检验计算一个或多个时间序列的p-value(s)。该测试有助于检测数据集中未知频率的隐藏周期性。对于微阵列数据的应用，请参见Wichert, Fokianos和strimer（2004）。
robust.g.test计算Fisher's g-test（1929）的稳健非参数版本的p-value(s)。Ahdesmaki et al.（2005）详细描述了该方法，并对其在基因表达数据中的应用进行了广泛的讨论。从GeneCycle 1.1.0后还实现了Ahdesmaki et al.（2007）发表的基于稳健回归的方法（使用Tukey的基于双权重的m估计/回归）。
robust.spectrum计算基于稳健排名的周期图/相关图估计--详见Ahdesmaki et al.（2005）。另外，它也可以用于（自GeneCycle 1.1.0起）评估基于稳健回归的谱估计，适合处理非均匀采样数据（未知周期时间：返回谱估计，已知周期时间：返回p值）

#Usage
robust.spectrum(x, #行为时间点，列为基因的纵向数据
                algorithm = c("rank", "regression"), #算法
                t, #时间点向量，稳健回归算法需要
                periodicity.time = FALSE, #检索周期
                noOfPermutations = 300#置换反应次数
)
#示例1
robust.spectrum(caulobacter[,1:5], algorithm="rank")#采用Rank的算法，默认，返回谱估计矩阵
#           [,1]       [,2]       [,3]       [,4]       [,5]
# [1,] 0.8219205 3.43864638 3.21070515 3.35647383 2.65845748
# [2,] 0.6126124 1.52994232 1.31254973 1.57164532 1.27421407
# [3,] 1.9418157 3.18519461 3.00543980 3.31246925 3.33098836
# [4,] 4.2746610 4.35345989 4.47535225 4.30956755 4.20913835
# [5,] 1.3919670 0.01326347 0.47161824 0.11946108 0.02375192
# [6,] 0.8585238 0.88099060 0.77319178 0.86369303 0.60560641
# [7,] 0.9260288 0.19028452 0.05860190 0.26087934 0.21157125
# [8,] 0.8760912 0.59964174 0.84292195 0.56272235 0.64991590
# [9,] 0.7016780 0.02591820 0.42051324 0.01696016 0.26782486
# [10,] 0.2688421 0.23501394 0.49387822 0.14745982 0.23738684
# [11,] 0.1275503 0.36601601 0.05567804 0.35091542 0.33663486
#示例2
t=c(0,15,30,45,60,75,90,105,120,135,150)
robust.spectrum(x=caulobacter[,1:5], algorithm="regression", t=t)#采用稳健回归的算法，需要指定采样时间点，Rank算法则不需要提供
#             [,1]        [,2]         [,3]         [,4]        [,5]
# [1,] 3.967426933 0.999444157 0.9854746530 1.4290419558 2.541543680
# [2,] 0.149561620 0.402296097 0.3862254965 0.5114509386 0.596198404
# [3,] 0.058341539 0.047247760 0.0372429408 0.0433257775 0.025238915
# [4,] 0.169121561 0.001275981 0.0005069736 0.0005174022 0.003603036
# [5,] 0.081070375 0.001897917 0.0006175675 0.0009398086 0.007840526
# [6,] 0.005499027 0.002399405 0.0011976775 0.0003165645 0.001072624
#关于算法
#rank对应于基于rank的方法（Ahdesmaki et al. 2005），
#回归对应于基于回归的方法（Ahdesmaki et al. 2007），这种方法更适合于非均匀采样的时间序列（默认为rank算法）
#示例三
robust.spectrum(x=caulobacter[,1:5], algorithm="regression", t=t, periodicity.time = 150)#如果采用稳健回归的算法并且提供了检索的周期，则直接返回检索基因的p值
# [1] "Please note, returning p-values!"
#             [,1]
# [1,] 0.100774458
# [2,] 0.009113156
# [3,] 0.015642539
# [4,] 0.011468583
# [5,] 0.056042639

#Usage
robust.g.test(y, #谱估计矩阵
              index, #不懂该参数
              algorithm=c("rank", "regression"), #算法类型，默认为Rank
              perm = FALSE, #是否进行排列检验，稳健回归算法需要开启
              x, #原始纵向数据，稳健回归算法进行排列检验时需要
              t, #原始时间点向量，稳健回归算法进行排列检验时需要
              noOfPermutations = 300 #排列检验次数
)
#示例1
y=robust.spectrum(caulobacter[,1:5], algorithm="rank")
robust.g.test(y)
#[1] 0.14477124 0.08436690 0.09921162 0.10366398 0.09484573
#示例2
t=c(0,15,30,45,60,75,90,105,120,135,150)
robust.spectrum(x=caulobacter[,1:5], algorithm="regression", t=t)
robust.g.test(y = y, algorithm = "regression", perm=TRUE, x=caulobacter[,1:5], t=t)
#[1] 0.89379383 0.09560878 0.06551361 0.08384894 0.07460466

小结：无论是基于Rank的算法，还是基于稳健回归的算法，理论上来说都可以，但实际上现在还是用稳健回归的算法比较多，因为其使用范围更广泛（适合非均匀采样的时间序列），但是它也有缺点，就是需要进行排列检验，这一步可能计算时间会比较长，同时，基于回归的方法有可能出现回归不收敛的问题。恩，不管怎样，还是要以解决实际问题为主，哪种更符合预期用那个吧

3.DODR::dodr 检测两个时间序列集合之间节奏行为的差异

#Usage
dodr(val1, #行为时间点，列为基因名的矩阵1
     val2, #行为时间点，列为基因名的矩阵2
     times1, #时间点序列1
     times2 = times1, #时间点序列2，默认等于1
     norm = TRUE, #是否在分析之前对时间序列进行归一化(按平均值分割)
     period = 24,#周期
     method = "robust", #方法，默认为“robust”，可选有6种
     verbose = options("verbose")[[1]])
#示例
n=50
testTimes1 <- 0:15*3
testTimes2 <- testTimes1
tp <- length(testTimes1)
per1 <- 24
amp1 <- 0.3
ph1 <- 5
sd1 <- 0.1

per2 <- per1
amp2 <- amp1
ph2 <- ph1+4
sd2 <- sd1

#creating artificial oscillation sets
v1 <- 1 + amp1 * cos((testTimes1 - ph1)/per1*2*pi)
noise1 <- rnorm(length(testTimes1)*n, 0, sd1)
val1 <- matrix(v1 + noise1, ncol=n)

v2 <- 1 + amp2 * cos((testTimes2 - ph2)/per2*2*pi)
noise2 <- rnorm(length(testTimes2)*n, 0, sd2)
val2 <- matrix(v2 + noise2, ncol=n)

# run DODR
dodr <- dodr(val1, val2, testTimes1, testTimes2, 24, method = 'all')
dodr$p.value.table[1:3,]
#         HANOVA HarmNoisePred1 HarmNoisePred2 HarmScaleTest   robustDODR robustHarmScaleTest   meta.p.val
# 1 6.892586e-07    0.006121659   0.0048785736    0.94517113 4.825488e-06           0.2933788 4.135544e-06
# 2 6.253141e-04    0.445233962   0.0027451638    0.08228724 9.181591e-04           0.8151413 3.746025e-03
# 3 1.945558e-07    0.012435537   0.0004861898    0.33701079 1.029927e-06           0.9640713 1.167334e-06

不出意外的话，大家一般还是看最后一列meta.p.val吧，毕竟是综合的算法，可信度更高一点，其他好像也，没什么好说的，操作反正挺简单的

4.limoRhyde::limorhyde 将周期变量转换成线性模型中可用的分量

library(annotate)
library(data.table)
library(foreach)
library(GEOquery)
library(ggplot2)
library(knitr)
library(limma)
library(org.Mm.eg.db)
source(system.file('extdata', 'vignette_functions.R', package = 'limorhyde'))

period = 24
qvalRhyCutoff = 0.15
qvalDrCutoff = 0.1

eset = getGEO(filename = system.file('extdata', 'GSE34018_series_matrix.txt', package = 'limorhyde'))

sm = data.table(pData(phenoData(eset)))
sm = sm[, .(title, sample = geo_accession, genotype = `genotype/variation:ch1`)]
sm[, time := as.numeric(sub('_.*', '', sub('.*_ZT', '', title)))]
sm[, cond := factor(genotype, levels = c('wild-type', 'Reverb alpha/beta DKO'),labels = c('wild-type', 'knockout'))]
setorderv(sm, c('cond', 'time'))#通过引用对data_table进行快速行排序
kable(sm[1:5, ])

sm = cbind(sm, limorhyde(sm$time, 'time_'))
#根据傅里叶级数的一次谐波或周期平滑样条将周期时间变量分解为多个分量。
mapping = getGeneMapping(featureData(eset))
emat = log2(calcExprsByGene(eset, mapping))[, sm$sample]

emat[1:6,1:6]
#       GSM840516 GSM840517 GSM840518 GSM840519 GSM840520 GSM840521
# 11287  7.985859  7.930935  7.674688  7.899531  7.768563  7.708260
# 11298  7.719384  7.737210  7.888502  7.865563  7.772749  7.875787
# 11302 12.594006 12.557148 11.754031 11.851858 11.656702 11.712203
# 11303  7.868566  7.823121  7.907136  7.906957  7.934719  7.874863
# 11304  7.772076  7.930239  7.774563  7.768792  7.740474  7.811992
# 11305  8.889897  8.910168  8.817124  8.837143  8.694444  8.477298
sm[,c(2,4:7)]
#       sample time      cond time_cos      time_sin
# 1: GSM840516    0 wild-type      1.0  0.000000e+00
# 2: GSM840517    0 wild-type      1.0  0.000000e+00
# 3: GSM840518    4 wild-type      0.5  8.660254e-01
# 4: GSM840519    4 wild-type      0.5  8.660254e-01
# 5: GSM840520    8 wild-type     -0.5  8.660254e-01
# 6: GSM840521    8 wild-type     -0.5  8.660254e-01
# 7: GSM840522   12 wild-type     -1.0  1.224606e-16
# 8: GSM840523   12 wild-type     -1.0  1.224606e-16
# 9: GSM840524   16 wild-type     -0.5 -8.660254e-01
# 10: GSM840525   16 wild-type     -0.5 -8.660254e-01
# 11: GSM840526   20 wild-type      0.5 -8.660254e-01
# 12: GSM840527   20 wild-type      0.5 -8.660254e-01
# 13: GSM840504    0  knockout      1.0  0.000000e+00
# 14: GSM840505    0  knockout      1.0  0.000000e+00
# 15: GSM840506    4  knockout      0.5  8.660254e-01
# 16: GSM840507    4  knockout      0.5  8.660254e-01
# 17: GSM840508    8  knockout     -0.5  8.660254e-01
# 18: GSM840509    8  knockout     -0.5  8.660254e-01
# 19: GSM840510   12  knockout     -1.0  1.224606e-16
# 20: GSM840511   12  knockout     -1.0  1.224606e-16
# 21: GSM840512   16  knockout     -0.5 -8.660254e-01
# 22: GSM840513   16  knockout     -0.5 -8.660254e-01
# 23: GSM840514   20  knockout      0.5 -8.660254e-01
# 24: GSM840515   20  knockout      0.5 -8.660254e-01

接下来有三种选择：
（1）找振荡显著的基因design = model.matrix(~ time_cos + time_sin,data)
（2）找振荡模式在组间显著差异的design = model.matrix(~ cond * (time_cos + time_sin),data)
（3）找差异表达的基因（校正了时间点差异）design = model.matrix(~ cond + time_cos + time_sin, data)
示例如下（来自于官网vignettes）：

#####Identify rhythmic genes#####
rhyLimma = foreach(condNow = unique(sm$cond), .combine = rbind) %do% { #.combine参数指定以行形式合并结果
  design = model.matrix(~ time_cos + time_sin, data = sm[cond == condNow])
  fit = lmFit(emat[, sm$cond == condNow], design)
  fit = eBayes(fit, trend = TRUE)
  rhyNow = data.table(topTable(fit, coef = 2:3, number = Inf), keep.rownames = TRUE)
#coef      2              3
#      time_cos      time_sin
  setnames(rhyNow, 'rn', 'geneId')
  rhyNow[, cond := condNow]
}
design
rhyLimmaSummary = rhyLimma[, .(P.Value = min(P.Value)), by = geneId]#对于rhyLimma的结果，采取同一基因取最小p值的策略
rhyLimmaSummary[, adj.P.Val := p.adjust(P.Value, method = 'BH')]#校正p值
setorderv(rhyLimmaSummary, 'adj.P.Val')#排序

kable(rhyLimmaSummary[1:5, ])
#   |geneId | P.Value| adj.P.Val|
#   |:------|-------:|---------:|
#   |13170  |       0|     0e+00|
#   |353187 |       0|     0e+00|
#   |15496  |       0|     0e+00|
#   |12700  |       0|     1e-07|
#   |321018 |       0|     1e-07|

#####Identify differentially rhythmic genes#####
design = model.matrix(~ cond * (time_cos + time_sin), data = sm)
design
fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
drLimma = data.table(topTable(fit, coef = 5:6, number = Inf), 
keep.rownames = TRUE)
#coef                5                      6
#          condknockout:time_cos    condknockout:time_sin
setnames(drLimma, 'rn', 'geneId')

drLimma = drLimma[geneId %in% rhyLimmaSummary[adj.P.Val <= qvalRhyCutoff]$geneId]
drLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(drLimma, 'adj.P.Val')

kable(drLimma[1:5, ])
#   |geneId | condknockout.time_cos| condknockout.time_sin|   AveExpr|        F| P.Value| adj.P.Val|
#   |:------|---------------------:|---------------------:|---------:|--------:|-------:|---------:|
#   |12686  |            -1.5326997|            -0.0433879| 10.157657| 63.70706|       0|   2.0e-07|
#   |353187 |             0.6825313|            -0.3539853|  9.422013| 63.33300|       0|   2.0e-07|
#   |270166 |            -1.0453062|             0.2676608| 12.112605| 57.38229|       0|   4.0e-07|
#   |207818 |            -0.6412353|            -0.1518757| 10.482197| 57.06019|       0|   4.0e-07|
#   |15486  |            -0.7315828|             0.1282440| 12.503645| 44.84486|       0|   3.5e-06|

#####Identify differentially expressed genes#####
design = model.matrix(~ cond + time_cos + time_sin, data = sm)
design

fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
deLimma = data.table(topTable(fit, coef = 2, number = Inf), keep.rownames = TRUE)
#coef              2
#          condknockout
setnames(deLimma, 'rn', 'geneId')

deLimma = deLimma[!(geneId %in% drLimma[adj.P.Val <= qvalDrCutoff]$geneId)]
deLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(deLimma, 'adj.P.Val')

kable(deLimma[1:5, ])
#   |geneId |     logFC|   AveExpr|         t| P.Value| adj.P.Val|        B|
#   |:------|---------:|---------:|---------:|-------:|---------:|--------:|
#   |11302  | -2.271205| 11.260315| -23.07448|       0|         0| 33.11873|
#   |68680  | -1.653067|  9.096749| -18.20824|       0|         0| 27.88355|
#   |67442  | -1.208461| 14.375311| -16.67143|       0|         0| 25.88517|
#   |71904  | -1.457833| 10.350755| -16.00380|       0|         0| 24.95519|
#   |15507  | -1.158876|  9.441377| -15.27866|       0|         0| 23.89920|

总结：

Rain包和Genecycle只能用来检测基因是否振荡，
DODR包可以用来检测基因振荡模式是否存在差异
limoRhyle包配合limma包不仅可以用来检测基因是否振荡，也可以用来检测基因振荡模式是否存在差异，还可以检测差异表达的基因（在校正时间点之后），算是挺全面的包了
但是，这类包都已经脱离了cosinor分析的范畴，不能提供相位，振幅，基线等水平，有点可惜

ps：人工总结，如有不到位和理解错误的地方还请大家多多批评指正

生物节律之振荡节律检测（4）

1.Rain::rain

2.Genecycle包：周期性表达基因的鉴定

3.DODR::dodr 检测两个时间序列集合之间节奏行为的差异

4.limoRhyde::limorhyde 将周期变量转换成线性模型中可用的分量

总结：

猜你喜欢

热点阅读