使用SCENIC分析singlecell-RNAseq实战演练
SCENIC全称:Single-Cell rEgulatory Network Inference and Clustering
实战之前需要准备的学习资料:
1.阅读关于SCENIC的介绍:SCENIC | 以single-cell RNA-seq数据推断基因调控网络和细胞功能聚类
2.阅读文献:SCENIC: single-cell regulatory network inference and clustering [PMID: 28991892]
以上两部分介绍SCENIC能做什么与SCENIC标准流程
3.下载官方教程:aertslab/SCENIC
下载的文件中重点看两部分文件SCENIC_Running.html与detailedStep_0-4.html
- SCENIC_Running.html里有每一步的说明与标准流程代码
- detailedStep_0-4.html里有标准流程代码中包装好的函数中的详细代码,主要是为了有个性化用户使用时修改代码。入门实战也需要看,主要是为了进一步了解标准流程中使用的函数具体做了什么,明白每一步产生的数据是什么,数据的名称,存储在了那一文件夹中,以及执行一个函数时调用了哪一个数据。
标准流程如下,基于3个R包:GENIE3,RcisTarget,AUCell:
操练起来
1.在R中加载表达矩阵与样品(细胞)信息
rm(list = ls())
options(stringsAsFactors = F)
#Loading data
load(file = 'input_data.Rdata')
#Filter the samples based on the standatd in the article.
table(meta$Mapping.rate>30)
meta_filter1<-meta[meta$Mapping.rate>30,]
meta_filter2<-meta_filter1[meta_filter1$Gene.number>1000,]
meta_filter3<-meta_filter2[meta_filter2$Transcript.number>=10000,]
dim(meta_filter3)
#Choose the keeped the samples in the expression matrix.
keep.samples<-as.character(meta_filter3[,1])
counts_filter<-counts[,keep.samples]
dim(counts_filter)
counts_filter[1:4,1:10]
#Check the expression matrix
fivenum(apply(counts_filter,1,function(x) sum(x>0)))
hist(apply(counts_filter,2,function(x) sum(x>0)),breaks = 100)
#Transform the counts number to CPM in counts_filter
library(edgeR)
counts_filter<-cpm(counts_filter)
载入的表达矩阵counts_filter为基因表达的count值,我们这里只通过使用edgeR包中的CPM()函数去除了文库大小,未做log处理。原因见官网描述(如下):
Expression units: The preferred expression values are gene-summarized counts. There is currently not a strong recommendation towards using the raw counts, or counts normalized through single-cell specific methods (e.g. Seurat). Other measurements, such as transcripts/counts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis (first step of SCENIC) because they may induce artificial co-variation (Crow et al. (2016)). The choice of input expression matrix might have some effect on the co-expression analysis to create the regulons (step 1). The other steps of the workflow are not directly affected by the input expression values: (2) The expression is not taken into account for the motif analysis, and (3) AUCell, which is used for scoring the regulons on the cells, is cell ranking-based (it works as an implicit normalization). Overall, SCENIC is quite robust to this choice, we have applied SCENIC to datasets using raw (logged) UMI counts, normalized UMI counts, and TPM and they all provided reliable results (see Aibar et al. (2017)).
2. 使用Seurat包创建对象
#Create subject using Seurat R package
suppressMessages(library(Seurat))
Pollen <- CreateSeuratObject(counts = counts_filter,
meta.data =meta_filter3,
min.cells = 300,
min.features = 1500,
project = "Pollen")
Pollen
sce <- Pollen
Seurat包用法与创建对象时参数的设置详见生信技能树B站教程:全网第一的单细胞转录组实战演练
3. 提取sce对象中的表达矩阵与样品(cell)信息并保存
exprMat<-GetAssayData(object = sce,slot = "counts")
class(exprMat)
dim(exprMat)
exprMat[1:4,1:10]
cellInfo <-sce@meta.data
dim(cellInfo)
save(exprMat,cellInfo,file = 'Mat and cell.Rdata')
4.下载RcisTarget的物种特定数据库(这里是第一个坑)
RcisTarget数据库中的数据目前只支持人,鼠,果蝇三个物种。实战用的数据是人膝关节软骨细胞,因此下载了人的feather文件。为什么说这里是个坑呢?官网给的下载方式是这样的:
dbFiles <- c("https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-10kb-7species.mc9nr.feather")
# dir.create("cisTarget_databases"); setwd("cisTarget_databases") # if needed
for(featherURL in dbFiles)
{
download.file(featherURL, destfile=basename(featherURL)) # saved in current dir
}
在大陆下载这个数据的速度真的很感人(也可能是我家网不行,但是玩LOL刚刚的),容易下一半直接挂掉。直接挂掉还是好的,如果你下载下来了,文件大小也与官网给的一样,但其实里面的数据是不对的。后续分析调用这个文件的时候R会报错。我第一次使用的时候就是这样。
解决的方案是:手动下载(无奈办了一个迅雷会员,1.02G的feather文件分分钟下好)并使用软件MD5 & SHA Checksum Utility检查文件的完整性,与官网给的sha256sum进行比较,sha256sum一致,即可用于后续分析。MD5 & SHA Checksum Utility的用法自行百度。
5. SCENIC的初始设置
library(SCENIC)
org <- "hgnc"
dbDir <- "cisTarget_databases"
myDatasetTitle <- "SCENIC analysis on Human cartilage" # Create a name for your analysis
data(defaultDbNames)
dbs <- defaultDbNames[[org]]
scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=10)
#scenicOptions <- readRDS("int/scenicOptions.Rds")
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
这里SCENIC创建了一个对象scenicOptions,后续分析都是基于这个对象来进行的。
此外,在后续分析中产生的数据会存储在R当前目录下的int文件夹,SCENIC输出的结果会存储在R当前目录下的output文件夹。
6. 将表达矩阵中未在cisTarget database中收录的基因去除
library(RcisTarget)
dbFilePath <- getDatabases(scenicOptions)[[1]]
motifRankings <- importRankings(dbFilePath)
genesInDatabase <- colnames(getRanking(motifRankings))
genesLeft_minCells<-rownames(exprMat)
length(genesLeft_minCells)
genesLeft_minCells_inDatabases <- genesLeft_minCells[which(genesLeft_minCells %in% genesInDatabase)]
length(genesLeft_minCells_inDatabases)
genesKept <- genesLeft_minCells_inDatabases
exprMat_filtered <- exprMat[genesKept, ]
dim(exprMat_filtered)
7. 计算相关性并对数据进行log处理
class(exprMat_filtered)
runCorrelation(as.matrix(exprMat_filtered), scenicOptions)
exprMat_filtered <- log2(exprMat_filtered+1)
8. 运行GENIE3
library(GENIE3)
runGenie3(as.matrix(exprMat_filtered), scenicOptions)
save(exprMat_filtered,scenicOptions,file = "input_GENIE3_data.Rdata")
官网教程中这里没有as.matrix()函数,我在运行的时候报错了,class()以后显示exprMat_filtered是一个matrix,不知道为什么会报错,但是加入as.matrix()函数就可以了。
9.建立GRN并对其打分
为了节约电脑内存,这里重启了R并重新载入数据
load('Mat and cell.Rdata')
logMat <- log2(exprMat+1)
dim(exprMat)
运行SCENIC
library(SCENIC)
scenicOptions <- readRDS("int/scenicOptions.Rds")
scenicOptions@settings$verbose <- TRUE
scenicOptions@settings$nCores <- 1
scenicOptions@settings$seed <- 123
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"]
runSCENIC_1_coexNetwork2modules(scenicOptions)
runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget"))
runSCENIC_3_scoreCells(scenicOptions, as.matrix(logMat))
#电脑内存只有16G,以上步骤相当耗时,可以睡前运行,醒来结果就出来了=。=!
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
#网络活性的结果进行二维化
runSCENIC_4_aucell_binarize(scenicOptions)
结果可视化
aucellApp <- plotTsne_AUCellApp(scenicOptions, logMat)
savedSelections <- shiny::runApp(aucellApp)
左上角可以调不同转录因子,观察每个regulon的可视化结果。
计算得到的regulon(转录因子与其预测靶基因)结果存储在output文件夹中:Step2_regulonTargetsInfo
到这里SCENIC的主干流程即已完成,后续对结果的解读与其他可视化见官网教程。