CoRegNet: 共调控网络的重建和整合分析
1. 介绍
CoRegNet包的目的是从转录组数据推断大型转录共调控网络,整合外部数据和基因调控信息去推断和分析转录组项目。这个包中独特的网络推断算法可以学习共调控网络。通过识别不同数据类型的基因共合作调节(co-operative regulators of genes)允许不同类型数据的整合到一个共表达分析中。
#安装
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CoRegNet")
library(CoRegNet)
data(CIT_BLCA_EXP,HumanTF,CIT_BLCA_Subgroup)
dim(CIT_BLCA_EXP)
# [1] 1000 183
# 1000个基因x183个sample的基因表达矩阵
length(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] 67 ##1000个基因中有67个转录因子
head(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] "EEF1A1" "INSM1" "HOXD1" "IFI16" "ID2" "GRHL1"
2. Quick user guide(CoRegNet的主要功能)
1. 从基因表达数据中重建大规模调控网络
grn = hLICORN(CIT_BLCA_EXP, TFlist=HumanTF)
#输入的是基因表达矩阵和2020个转录因子的list(包内置的)
2. 推断转录因子活性
influence = regulatorInfluence(grn,CIT_BLCA_EXP)
3. Retrieve inferred co-coregulators
coregs= coregulators(grn)
4. Analyze the network of cooperative regulators using an interactive display
display(grn,CIT_BLCA_EXP,influence,clinicalData=CIT_BLCA_Subgroup)
3详细功能和用法介绍
3.1 Construction of a large-scale co-regulatory network from gene expression data
CoRegNet包中的推理算法是LICORN 算法的hybrid版本。
网络的重构包含四步:
First, the gene expression data is discretized.
Second, all the potential sets of cooperative regulators are extracted using the apriori algorithm of frequent itemset mining.
Third, the best combinations of co-activators and co-inhibitors are identified for each gene.
Finally, a continuous model of regulation using a linear regression method with interaction terms is used to score the local gene regulatory networks of each gene.
使用CoRegNet做共表达至少需要输入两个dataset:
- 基因表达矩阵/数据框。要求行名为基因名,列名为样本名且列名要唯一。
- TF的基因list。CoRegNet包中内置了人类转录因子list,并且有official gene symbol (
HumanTF
)和EntrezGene IDs (HumanTF_entrezgene
)这两种格式。使用的时候直接用data(HumanTF)
或data(HumanTF_entrezgene)
调用即可。
# An example of how to infer a co-regulation network
grn =hLICORN(CIT_BLCA_EXP, TFlist=HumanTF)
print(grn) #grn是根据输入的表达矩阵和转录因子预测出来的转录因子,靶基因和调控互作
# [1] "63 Transcription Factors. 867 Target Genes. 8072 Regulatory interactions."
# [1] "No added evidences."
head(grn@GRN)
# Target coact corep Coef.Acts Coef.Reps Coef.coActs Coef.coReps R2 RMSE
# 1 DIRAS2 NR1H4 IFI16 0.242753364850399 -0.39999852548476 NA NA 0.24528282 0.9670125
# 2 DIRAS2 NR1H4 MYC 0.225607457464437 -0.273761935521453 NA NA 0.11392664 1.0525775
# 3 DIRAS2 NR1H4 SNAI2 0.158558921637845 -0.349914647715346 NA NA 0.23964161 0.9705609
# 4 DIRAS2 NR1H4 IRX3 0.25463526841265 -0.180678166467546 NA NA 0.12653573 1.0419588
# 5 DIRAS2 NR1H4 SPOCD1 0.29515905804531 -0.212927437477536 NA NA 0.14379615 1.0316241
# 6 DIRAS2 NR1H4 <NA> 0.308379330917705 <NA> NA NA 0.08489369 1.0651439
# 导出所有的Target
Target=unique(grn@GRN$Target) #可以拿出来做GO富集等
63 Transcription Factors x 867 Target Gene x 8072 Regulatory interactions
在上面hLICORN()
这一步是自动对输入的矩阵进行了离散化的,也就是说,实际上用于构建调控网络的input矩阵是离散化后的矩阵。下面是对矩阵进行手动离散化的两种方法
#Default discretization.
#Uses the standard deviation of the whole dataset to set a threshold.
disc1=discretizeExpressionData(CIT_BLCA_EXP)
table(disc1)
# disc1
# -1 0 1
# 23482 131019 28499
boxplot(as.matrix(CIT_BLCA_EXP)~disc1)
#Discretization with a hard threshold
disc2=discretizeExpressionData(CIT_BLCA_EXP, threshold=1)
table(disc2)
# disc2
# -1 0 1
# 45956 93063 43981
boxplot(as.matrix(CIT_BLCA_EXP)~disc2)
# more examples here
help(discretizeExpressionData)
使用矩阵前200个基因以方便运算(可以开4线程,速度更快)
# running only on the 200 first gene in the matrix for fast analysis
# Choosing to divide in 4 threads whenever possible
options("mc.cores"=4)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors. 161 Target Genes. 642 Regulatory interactions."
# [1] "No added evidences."
options("mc.cores"=2)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors. 161 Target Genes. 642 Regulatory interactions."
# [1] "No added evidences."
3.2 Refining the inferred regulatory networks
第二步是使用external knowledge来丰富前面得到的inferred regulatory network。有两种类型的外部数据集可以使用:1. TF对基因的调控数据比如Transcription Factor Binding Sites (TFBS)
或ChIP数据
;2. 共调控信息比如蛋白-蛋白互作
,这类数据可以提供TFs的合作信息。
❗️对于前面两种类型的数据,分别使用addEvidences
和addCooperativeEvidences
函数来添加。
# 引入四类外部数据
# ChIP data from the CHEA database
data(CHEA_sub)
#ChIP data from the ENCODE project
data(ENCODE_sub)
# Protein protein interactions between TF from the HIPPIE database
data(HIPPIE_sub)
# Protein protein interactions between TF from the STRING database
data(STRING_sub)
enrichedGRN = addEvidences(grn,CHEA_sub,ENCODE_sub)
# CHEA_sub was integrated into the network.
# ENCODE_sub was integrated into the network.
enrichedGRN = addCooperativeEvidences(enrichedGRN,HIPPIE_sub,STRING_sub)
# [1] "HIPPIE_sub"
# [1] "No evidence from HIPPIE_sub were found in the inferred network."
# [1] "STRING_sub"
# [1] "STRING_sub was integrated into the network."
得到的enrichedGRN就是引入external knowledge的grn
print(enrichedGRN)
# [1] "63 Transcription Factors. 867 Target Genes. 8072 Regulatory interactions."
# commonGene commonReg evidences commonEvidences enrichment p.value
# CHEA_sub 567 12 1050 218 1.13521053808365 0.0701008769458297
# ENCODE_sub 724 10 2749 382 1.18076947263415 0.0114797404982399
# STRING_sub <NA> 39 137 36 3.16656003302599 2.24846204249256e-06
导入外部数据之后就可以对 共调控网络进行refine,可以使用默认的refine方法。
# Default unsupervised refinement method
refinedGRN = refine(enrichedGRN)
print(refinedGRN)
# [1] "57 Transcription Factors. 867 Target Genes. 1854 Regulatory interactions."
# commonGene commonReg evidences commonEvidences enrichment p.value
# CHEA_sub 567 12 1050 176 3.55976690089225 3.1172872878133e-32
# ENCODE_sub 724 10 2749 296 4.71677404968161 3.01483641042892e-48
# STRING_sub <NA> 21 83 17 Inf 4.69277379589639e-08
可以看到refine之后,Transcription Factors和1854 Regulatory interactions的数目都发生了变化
也可以使用指定的数据集来进行refine
# Example of supervised refinement with the CHEA chip data
refinedGRN = refine(enrichedGRN, integration="supervised",referenceEvidence="CHEA_sub")
print(refinedGRN)
3.3 Identification of active transcriptional programs
CoRegNet包的目的是从给定样本或样本集中识别active cooperative TF的集合。Transcriptional activity的衡量方法是估计给定样本中特定转录调节因子的活化程度。这个衡量方法是比较transcriptional network中活化的和抑制的基因表达。它是基于一个样本中这两个基因sets间的divergence的测量(Welch’s t statistics)。从根本上来说,如果可被某个TF激活的基因出现高表达的同时可被抑制的基因出现低表达,这个TF就具有high influence。
使用一个coregnet object的共表达网络和一个基因表达矩阵,不管这个矩阵是用来推断共表达网络的矩阵还是别的矩阵。output的是一个列为样本(和基因表达矩阵一致),行为TF的矩阵。每一行的TF都在转录网络中拥有足够数量的targets(至少10 activated和10 repressed)。
CITinf =regulatorInfluence(grn,CIT_BLCA_EXP)
CITinf[1:6,1:6]
# CIT100 CIT102 CIT103 CIT105 CIT106 CIT107
# EMX2 -0.6705802 -4.134082 3.433414 8.512250 -2.129228 2.170088
# EGR1 6.2721207 14.563272 -6.805588 -4.347788 11.962142 -3.975751
# TGFB1I1 7.0348603 19.819534 -4.918518 -5.555589 12.702996 -3.141942
# EGR2 4.5810891 14.504279 -8.664420 -7.408075 11.432331 -9.668110
# PLAGL1 5.7439075 18.780449 -10.075245 -7.001515 13.497227 -7.333840
# SOX9 6.1841193 10.627645 -4.749642 -4.485600 12.322927 -1.968706
str(CITinf)
# num [1:33, 1:183] -0.671 6.272 7.035 4.581 5.744 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:33] "EMX2" "EGR1" "TGFB1I1" "EGR2" ...
# ..$ : chr [1:183] "CIT100" "CIT102" "CIT103" "CIT105" ...
这个新的transcriptional influence数据集可以被视为整个转录数据集的condensed版本。
# Coregulators of a hLICORN inferred network
head(coregulators(grn))
# Reg1 Reg2 Support NA nGRN fisherTest adjustedPvalue
# 1 PPARG VGLL1 0.017881213 1247 1247 1.436958e-68 1.638132e-67
# 2 GATA3 PPARG 0.016891795 1178 1178 2.861003e-87 1.087181e-85
# 3 FOXF1 TGFB1I1 0.013536379 944 944 2.838172e-101 3.235516e-99
# 4 EGR1 EGR2 0.012733373 888 888 4.332173e-72 6.173346e-71
# 5 MSX2 PPARG 0.012288853 857 857 2.767895e-77 6.310800e-76
# 6 GATA3 MSX2 0.009922854 692 692 9.068728e-80 2.584587e-78
也可以引入分组数据和copy number数据等
data(CIT_BLCA_CNV)
data(CIT_BLCA_Subgroup)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf)
# Visualizing additional regulatory or co-regulatory evidences in the network
display(enrichedGRN,expressionData=CIT_BLCA_EXP,TFA=CITinf)
# Visualizing sample classification using a named factor
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup)
# Visualizing copy number alteration of regulators
data(CIT_BLCA_CNV)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup,alterationData=CIT_BLCA_CNV)
参考:
http://www.bioconductor.org/packages/release/bioc/vignettes/CoRegNet/inst/doc/CoRegNet.html
差异共表达网络