利用TCseq包进行基因表达趋势分析

2020-12-20 本文已影响0人凯凯何_Boy

TCseq包提供了一个统一的套件去处理不同时序类型的数据分析，可以应用于转录组或者像ATAC-seq，Chip-seq的表观基因组时序型数据分析。该包主要的集中于不同时间点的差异分析，时间趋势分析及可视化作图。

该包发布在Bioconductor上，可去网站下载详细文档：http://bioconductor.org/packages/release/bioc/html/TCseq.html

数据准备

该包提供了内置数据集，刚开始用可以参考着文档使用内置数据跑一遍走个流程，这里我拿自己的数据做个演示，需要准备三个文件：

基因表达矩阵文件(可以直接拿原始数据)

image-20201220170608092

分组文件(样品，时间点，分组三列)

image-20201220171002198

基因的位置信息(染色体，起始及终止位点，基因id四列)

image-20201220171115149

OK，我们先输入数据

## 安装
BiocManager::install("TCseq")
library(TCseq)

# 位置信息文件
genomicIntervals <- read.delim('intervial.txt',stringsAsFactors = FALSE)
genomicIntervals$chr <- as.factor(genomicIntervals$chr)


## 表达量矩阵文件
count <- read.delim('count.txt',header = T,row.names = 1)
# as.integer 转换为整数型
countsTable <- count %>%  mutate(across(where(is.numeric), as.integer)) 
# or
#count1 <- apply(count, 2, as.integer)
rownames(countsTable) <- rownames(count)
##矩阵形式
countsTable <- as.matrix(countsTable)


# 分组文件
experiment <- read.delim('group.txt',header = T,stringsAsFactors = FALSE)
experiment$sampleid <- as.factor(experiment$sampleid)

提示：

位置信息文件可以直接根据gtf文件转换生成，这个应该不难。
另外，相应的数据类型要提前转换好，否则会报错，比如表达量文件需要先转换为整数型，再变为矩阵的形式，根据内置数据的格式调整就行了。

创建TCA对象

TCseq使用S4类TCA存储所有输入数据以进行后续分析。根据以上三个文件创造即可。

tca <- TCA(design = experiment, genomicFeature = genomicIntervals, counts = countsTable )
 tca
  ##############
  An object of class "TCA"
@design
  sampleid timepoint group
1       s1        0h     1
2       s2        0h     1
3       s3        0h     1
4       s4       24h     2
5       s5       24h     2
6       s6       24h     2
7       s7       48h     3
8       s8       48h     3
9       s9       48h     3

@counts
                    s1  s2  s3  s4  s5  s6  s7  s8  s9
ENSBTAG00000000005  60  58  70  64  60  74  53  55  50
ENSBTAG00000000008   0   0   0   1   0   1   2   0   0
ENSBTAG00000000009   0   1   3   0   1   0  11   9   9
ENSBTAG00000000010 341 300 341 170 216 214 150 236 199
ENSBTAG00000000011   4   2   3   5   3   1   6   2   3
29749 more rows ...

@genomicFeature
   chr     start       end                 id
1 chr1 100098899 100161382 ENSBTAG00000035710
2 chr1 100754351 100754458 ENSBTAG00000043293
3 chr1 101522743 101601659 ENSBTAG00000011139
4 chr1 101619573 101646742 ENSBTAG00000053582
5 chr1 101776678 101776956          MSTRG.445
29749 more rows ...

@clusterRes
An object of class "clust"

另一种创建的方式,是通过SummarizedExperiment对象：

suppressWarnings(library(SummarizedExperiment))
se <- SummarizedExperiment(assays=list(counts = countsTable), colData = experiment)
tca <- TCAFromSummarizedExperiment(se = se, genomicFeature = genomicIntervals)

现在已经基于这三个文件得到了TCA对象

差异分析

该包内置了我们常用的差异分析R包--edgeR，通过使用广义线性模型(GLM)方法去鉴定差异基因。

tca <- DBanalysis(tca）

## 我们也可以增添一些参数进行过滤
tca <- DBanalysis(tca, filter.type = "raw", filter.value = 10, samplePassfilter = 2)

#比如上述步骤保留具有两个或多个已读取样本的基因表达量超过10的基因组区域

提取差异分析结果

通过DBresult函数可以直接提取我们差异分析的结果，在通过$去查看各个组的内容。

DBres <- DBresult(tca, group1 = "0h", group2 = c("24h","48h"))
str(DBres, strict.width =  "cut")
head(DBres$`24hvs0h`)
############
GRanges object with 6 ranges and 4 metadata columns:
                     seqnames              ranges strand |     logFC      PValue         paj
                        <Rle>           <IRanges>  <Rle> | <numeric>   <numeric>   <numeric>
  ENSBTAG00000000005     chr1 100098899-100161382      * |  0.274010 1.35162e-01 2.81082e-01
  ENSBTAG00000000009     chr1 101522743-101601659      * | -1.501457 2.26674e-01 4.03419e-01
  ENSBTAG00000000010     chr1 101776678-101776956      * | -0.513593 3.36326e-06 3.80124e-05
  ENSBTAG00000000011     chr1   10231295-10543983      * |  0.198446 7.93039e-01 8.90540e-01
  ENSBTAG00000000012     chr1 102768932-102769966      * | -0.300941 5.04886e-03 2.18330e-02
  ENSBTAG00000000013     chr1 105069395-105069910      * |  0.140489 1.63446e-01 3.21469e-01
                                     id
                            <character>
  ENSBTAG00000000005 ENSBTAG00000035710
  ENSBTAG00000000009 ENSBTAG00000011139
  ENSBTAG00000000010          MSTRG.445
  ENSBTAG00000000011 ENSBTAG00000017753
  ENSBTAG00000000012 ENSBTAG00000048551
  ENSBTAG00000000013          MSTRG.449
  -------
  seqinfo: 206 sequences from an unspecified genome; no seqlengths

若只想提取满足显著差异的结果(一般abs(log2-fold > 2)且adjust p-value<0.05)，只需要加上top.sig = TRUE函数即可。

## 显著差异
DBres.sig <- DBresult(tca, group1 = "0h", group2 = c("24h","48h"), top.sig = TRUE)
str(DBres.sig, strict.width =  "cut")
#############
Formal class 'CompressedGRangesList' [package "GenomicRanges"] with 5 slots
  ..@ unlistData     :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ values         : Factor w/ 206 levels "chr1","chr2",..: 1 10 13 14 17 18 20 2..
  .. .. .. .. ..@ lengths        : int [1:201] 3 4 4 3 2 5 3 1 2 1 ...
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. ..@ start          : int [1:1229] 145972839 46061452 69196880 100837833 21979442 ..
  .. .. .. .. ..@ width          : int [1:1229] 4721 5497 115248 442874 5105 471 5943 145553 10..
  .. .. .. .. ..@ NAMES          : chr [1:1229] "ENSBTAG00000000425" "ENSBTAG00000000706" "ENS"..
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL

时间趋势分析

接下来就到了趋势分析的换件了，为了检测数据的时序模式，TCseq包使用了非监督聚类的方法,首先通过聚类创建一个时程表，行为基因组区域信息，列为时间点，表中的数值可以选择标准化后的表达量值或者是基于初始时间点所有样品点的LogFC值，代码中通过value参数来调整，另外若设置filter = TRUE时候，将自动过滤在任何两个时间段都没有显著差异的基因组区域，结果通过tcTable函数进行查看

## 通过LogFC值
tca <- timecourseTable(tca, value = "FC", norm.method = "rpkm", filter = TRUE)

##通过标准化后的表达量
tca <- timecourseTable(tca, value = "expression", norm.method = "rpkm", filter = TRUE)

## 查看生成结果
t <- tcTable(tca)
head(t)
                             0h          24h          48h
ENSBTAG00000000425 0.0000000000  0.030165698 0.0075187827
ENSBTAG00000000706 2.6013480888 11.521986790 1.7768887654
ENSBTAG00000000936 0.0007283438  0.003506178 0.0003475789
ENSBTAG00000001325 0.0002585125  0.001290036 0.0001705818
ENSBTAG00000001687 0.3065966903  2.485377556 2.6233245852
ENSBTAG00000001745 0.2496824763  2.120932031 4.7101417152

聚类分析

两种聚类算法可供选择: hard clustering（包含hierachical,pam,kmeans）及 soft clustering (fuzzy cmeans)，算法的选择大家自己参考文献选择，这里就拿文档中cmeans算法进行分析聚类。

tca <- timeclust(tca, algo = "cm", k = 6, standardize = TRUE)

###################################################
p <- timeclustplot(tca, value = "z-score(RPKM)", cols = 3)

多个聚类

单个聚类作图：

## 聚类一作图
print(p[[1]])

单个聚类

另外，若是想查看某个聚类有哪些基因参与，使用tca@clusterRes查看聚类结果即可，比如：

a <- as.data.frame(tca@clusterRes@cluster)
names(a) <- 'Cluster'
## 查看各个聚类中的基因数目
> table(a)
a
  1   2   3   4   5   6 
511 147 165 292 180 255 
##筛选 
Cluster1 <- subset(a,Cluster == 2)

OK，根据聚类的结果，挑选自己感兴趣的趋势接着下游GO\KEGG富集分析就行了。

该包的内容到这里基本结束了，有兴趣可以去查看源文档，比如一些函数中还提供了一些设置颜色等的参数，大家就自己看吧~~