GENIE3 || 基因调控网络推断
前情提要:
AUCell:在单细胞转录组中识别细胞对“基因集”的响应
RcisTarget || 从基因列表到调控网络
我们知道单细胞数据分析中,有两个明显的分支:
- 细胞
- 基因
细胞我们可以分析细胞之间的细胞分群、拟时间关系、细胞通讯,基因可以看基因的表达量、基因的调控网络,基因功能预测。对于细胞这一支主要解决的生物学问题是识别出细胞的(新的或旧的)身份,基因这一块可分析这一身份下的特征。为了分析细胞我们又Seurat,monocle,Velocity等‘;分析基因有GO,KEGG,GSVA,RcisTarget ,WGCNA等。今天我们顺着SCENIC的思路来看看其内核的基因调控网络工具:GENIE3 。
SCENIC同样请出我们的老朋友:
library(Seurat)
library(GENIE3)
library(SeuratData)
library(tidyverse)
pbmc3k.final -> pbmc
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
2 dimensional reductions calculated: pca, umap
> levels(Idents(pbmc))
[1] "Naive CD4 T" "Memory CD4 T" "CD14+ Mono" "B" "CD8 T" "FCGR3A+ Mono" "NK"
[8] "DC" "Platelet"
GENIE3的核心函数接受一个行为基因列为样本的表达谱矩阵,可以是未经过任何均一化的矩阵,也可以接受处理后的矩阵,二者会有所区别。我们不打算就整个表达谱来做基因调控的分析,主要是因为计算耗时。我们先把数据处理一下,获取一个需要分析的矩阵。
mk <-FindAllMarkers( pbmc)
(mk %>% filter(cluster == 'Naive CD4 T') %>% top_n(100,avg_logFC))$gene -> NaiveCD4T
NaiveCD4T <- NaiveCD4T[-c(grep("^RP",NaiveCD4T))]
cd4nt <- subset(pbmc,idents = "Naive CD4 T")
exprMatr <- as.matrix(cd4nt@assays$RNA@data[NaiveCD4T,])
exprMatr[1:4,1:4]
AAACGCTGTAGCCA AAACTTGATCCAGA AAAGAGACGAGATA AAAGAGACGGACTT
LDHB 0.000000 3.088221 2.867757 2.270898
MALAT1 6.069288 5.988516 5.743655 7.045627
EEF1A1 4.374893 4.699368 4.326630 4.255680
CCR7 0.000000 2.607332 0.000000 0.000000
dim(exprMatr)
[1] 141 697
这是一个含有Naive CD4 T
特征基因的矩阵。运行GENIE3很简单,在不知道调控基因的时候,我们可以这样
set.seed(1314) # For reproducibility of results
?GENIE3
weightMat <- GENIE3(exprMatr,nCores = 4) # with the default parameters
这样会把每个基因都当做调控基因,如果已经知道调控基因,可以放在一个向量里,如:
# Genes that are used as candidate regulators
regulators <- c(2, 4, 7)
# Or alternatively:
regulators <- c("Gene2", "Gene4", "Gene7")
weightMat <- GENIE3(exprMatr, regulators=regulators)
GENIE3是基于回归树的。这些树可以学习使用任意随机森林方法Breiman L.(2001)或者是额外树方法 Geurts P., Ernst D.和Wehenkel L.(2006)。
weightMat[1:5,1:5]
AAK1 ACAP1 AES ALOX5 APBA2
AAK1 0.0000000000 0.005258715 0.0054358216 0.001129751 0.003247545
ACAP1 0.0063033230 0.000000000 0.0086014879 0.010451221 0.009816754
AES 0.0101961203 0.010506656 0.0000000000 0.009392012 0.006051315
ALOX5 0.0003247848 0.001475766 0.0009462572 0.000000000 0.006683968
APBA2 0.0025998370 0.003291497 0.0021388552 0.010121726 0.000000000
dim(weightMat)
[1] 59 59
这是一个加权矩阵,矩阵是两两基因之间的调控加权值,据预测我们可以用igraph来构建网络。
library(igraph)
net1<-graph_from_adjacency_matrix(weightMat1)
weightMat1[which(weightMat1<0.04)] =0
net1<- graph_from_incidence_matrix(weightMat1)
多个布局的可视化调节网络:
layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1]
# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]
layouts<- c("layout_as_star","layout_in_circle" , "layout_nicely",
"layout_on_grid" , "layout_on_sphere" , "layout_randomly" , "layout_with_dh",
"layout_with_drl" , "layout_with_fr" , "layout_with_gem" , "layout_with_graphopt",
"layout_with_kk" , "layout_with_lgl" , "layout_with_mds")
layouts
length(layouts)
par(mfrow=c(3,5), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net1))
plot(net1, edge.arrow.mode=0, layout=l, main=layout) }
在我们的过滤中,许多基因与其他基因没有任何的链接,可能是没有什么调控关系,我们想看哪些基因是有调控关系的。
V(net1)[igraph::degree(net1)>1]
+ 22/118 vertices, named, from 7a4fa09:
[1] BTG1 EEF1A1 EEF1B2 JUNB MALAT1 NPM1 TMEM66 TPT1 BTG1 CCR7 EEF1A1 EEF1B2 GLTSCR2 JUNB
[15] LCK LDHB LEF1 MAL MALAT1 SOCS3 TMEM66 TPT1
plot(induced_subgraph(net1, V(net1)[igraph::degree(net1)>1] ))
GENIE3 考虑到很多童鞋操作网络可能会有困难,就安排了一个函数根据给定的阈值删选
linkList <- getLinkList(weightMat, reportMax=5)
linkList <- getLinkList(weightMat, threshold=0.03)
head(linkList)
regulatoryGene targetGene weight
1 MALAT1 EEF1A1 0.14649515
2 EEF1A1 MALAT1 0.08383228
3 BTG1 JUNB 0.07233470
4 JUNB BTG1 0.06985202
5 MALAT1 TPT1 0.06641574
6 EEF1A1 EEF1B2 0.06631800
这个格式当然也是可以用igraph来构建网络的。
net<- graph_from_data_frame(linkList)
plot(net, edge.arrow.size=.2, edge.curved=0,
vertex.color="orange", vertex.frame.color="#555555",
vertex.label.color="black",
vertex.label.cex=.7)
我们把表达量信息画在网络图上:
avexp <- AverageExpression(cd4nt,features =names(V(net)),slot = 'data')
V(net)$size <- abs(scale(avexp$RNA$`Naive CD4 T`))
E(net)$width <- E(net)$weight*10
plot(net, edge.arrow.size=.2, edge.curved=0.5,
vertex.color="orange", vertex.frame.color="#555555",
vertex.label.color="black",
vertex.label.cex=.7)
The weights of the links returned by GENIE3() do not have any statistical meaning and only provide a way to rank the regulatory links. There is therefore no standard threshold value, and caution must be taken when choosing one.
https://bioconductor.org/packages/release/bioc/html/GENIE3.html
https://scenic.aertslab.org/examples/SCENIC_MouseBrain/
https://github.com/aertslab/SCENICprotocol