AUCell:计算单细胞转录组的每个细胞中特定基因集的活性程度
AUC允许在单细胞rna数据集中识别具有活性基因集(如gene signature,gene module)的细胞。AUCell使用曲线下面积来计算输入基因集的一个关键子集是否在每个细胞的表达基因中富集。AUCell和GSVA的算法是一回事,都是排序。AUCell借鉴了ssGSVA的算法,但是在排序的时候,它们的有些处理不太一样。在SCENIC 转录因子预测的分析过程中,已经封装了AUCell。
AUCell的工作流程分为三步,分别通过三个函数完成:
1. AUCell_buildRankings
:Build the rank
2. AUCell_calcAUC
:Calculate the Area Under the Curve (AUC)
3. AUCell_exploreThresholds
:Set the assignment thresholds
1. 数据准备
- 1.1 R包下载
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG","doSNOW"))
# For the main example:
BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
"dynamicTreeCut","R2HTML","Rtsne", "zoo"))
browseVignettes("AUCell")
- 1.2 数据准备:AUCell所需要的输入数据分为两部分,
矩阵数据
和基因集数据
。
#设置工作目录
dir.create("AUCell_tutorial")
setwd("AUCell_tutorial") # or in the first code chunk (kntr options), if running as Notebook
下载演示数据
library(GEOquery)
attemptsLeft <- 20
while(attemptsLeft>0)
{
geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity)
if(methods::is(geoFile,"error"))
{
attemptsLeft <- attemptsLeft-1
Sys.sleep(5)
}
else
attemptsLeft <- 0
}
gzFile <- grep(".txt.gz", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
message(gzFile)
#GSE60361_C1-3005-Expression.txt.gz
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
message(txtFile)
#GSE60361_C1-3005-Expression.txt
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)
library(data.table)
geoData <- fread(txtFile, sep="\t")
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
exprMatrix <- as.matrix(geoData[,-1, with=FALSE])
rm(geoData)
dim(exprMatrix)
# [1] 19972 3005
rownames(exprMatrix) <- geneNames
exprMatrix[1:5,1:4]
# 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
# Tspan12 0 0 0 3
# Tshz1 3 1 0 2
# Fnbp1l 3 1 6 4
# Adamts15 0 0 0 0
# Cldn12 1 1 1 0
# Remove file
file.remove(txtFile)
# [1] TRUE
# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")
# 仅使用数据集中的5000个基因做演示
# load("exprMatrix_AUCellVignette_MouseBrain.RData")
set.seed(333)
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]
dim(exprMatrix)
# [1] 5000 3005
准备输入基因集
library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file('examples', package='AUCell')), "geneSignatures.gmt", sep="/")
geneSets <- getGmt(gmtFile)
# check how many of these genes are in the expression matrix:
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
cbind(nGenes(geneSets)) #演示基因集中有6个基因集 后面的数字是基因集中对应的基因数目
## [,1]
## Astrocyte_Cahoy 538
## Neuron_Cahoy 384
## Oligodendrocyte_Cahoy 477
## Astrocyte_Lein 7
## Neuron_Lein 13
## Microglia_lavin 165
# 为了方便演示,将基因集中的基因数目添加到基因集名字中
geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))
# 添加一些随机抽样出的基因组成的的基因集和在多数细胞中表达的基因(如house-keeping like gene)。(仅供演示,实操不必这样做)
# Random
set.seed(321)
extraGeneSets <- c(
GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))
countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets,
GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))
geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
names(geneSets)
## [1] "Astrocyte_Cahoy (538g)" "Neuron_Cahoy (384g)"
## [3] "Oligodendrocyte_Cahoy (477g)" "Astrocyte_Lein (7g)"
## [5] "Neuron_Lein (13g)" "Microglia_lavin (165g)"
## [7] "Random (50g)" "Random (500g)"
## [9] "HK-like (100g)"
2. Build gene-expression rankings for each cell (AUCell_buildRankings
函数)
输入的是表达矩阵
cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
# Quantiles for the number of genes detected by cell:
# (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
# min 1% 5% 10% 50% 100%
# 191 270 369 445 929 2062
查看排列的结果
cells_rankings
# Ranking for 5000 genes (rows) and 3005 cells (columns).
# Top-left corner of the ranking:
# cells
# genes 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
# 0610009B22Rik 3417 859 672 648 1822
# 0610009L18Rik 4196 4453 2603 3537 4642
# 0610010B08Rik_loc6 1497 1645 4543 2843 1646
# 0610011F06Rik 355 861 1227 2095 847
# 0610012H03Rik 2331 4084 2533 3098 4702
# 0610039K10Rik 3390 2443 4490 2202 4239
# Quantiles for the number of genes detected by cell:
# min 1% 5% 10% 50% 100%
# 191 270 369 445 929 2062
这一步的排名是对每个基因从高到低进行排名,也就是对每一个细胞的每一个基因进行排名,得到一个排名的矩阵。
3. Calculate the Area Under the Curve (AUC)(AUCell_calcAUC
函数)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
save(cells_AUC, file="cells_AUC.RData")
cells_AUC
# AUC for 9 gene sets (rows) and 3005 cells (columns).
# Top-left corner of the AUC matrix:
# cells
# gene sets 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
# Astrocyte_Cahoy 0.14705221 0.13297992 0.15524498 0.16514056 0.14599197
# Neuron_Cahoy 0.27110040 0.29031325 0.29326908 0.31334940 0.32883534
# Oligodendrocyte_Cahoy 0.15183936 0.13073092 0.13898795 0.12356627 0.13783133
# Astrocyte_Lein 0.15389082 0.13995354 0.21777003 0.24332172 0.10046458
# Neuron_Lein 0.37828427 0.41278886 0.50743906 0.46438746 0.45868946
# Microglia_lavin 0.09373979 0.06140446 0.06481582 0.08372346 0.07410633
根据输入基因list中的基因在每个细胞中的表达量,对每个细胞都算了一个该genelist的AUC值。
因为我们是直接输入的矩阵,如果矩阵是来自于单细胞数据,这时对于每个输入的list,每个细胞都已经有一个auc值,可以将auc值投射回umap或tsne图(见5.实操)。
4. Determine the cells with the given gene signatures or active gene sets (AUCell_exploreThresholds
函数)
AUC估计了输入基因集中的基因在每个细胞高表达的比例。表达了输入基因集中较多基因的细胞具有较高的AUC值。由于AUC值可以代表细胞表达基因集中基因的比例,我们可以使用跨细胞的相对AUCs来查看单细胞转录组的细胞对输入基因集的响应情况。
由于AUC并不是一个绝对的数值,而是受到细胞类型、数据集、基因集等的影响,尤其是在单细胞水平,大多数基因并不在所有细胞中稳定表达。因此有时判断某个基因集的gene在一个细胞中是'on'还是'off'状态比较困难。最理想的方法是双峰分布bi-modal distribution,也就是数据集中部分细胞有较高AUC而大多数的细胞有较低的AUC。AUCell提供了AUCell_exploreThresholds()
函数来区分较高和较低的AUC值,也就是自动划定双峰分布的阈值,判断基因集的'on'或'off'状态。
set.seed(123)
par(mfrow=c(3,3))
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)
分别划定了九个基因集的阈值
getAUC(cells_AUC)[,1:5]
# cells
# gene sets 1772071015_C02 1772071017_G12 # 1772071017_A05 1772071014_B06 1772067065_H06
# Astrocyte_Cahoy (538g) 0.14621687 0.132208835 0.15315663 0.163823293 0.147855422
# Neuron_Cahoy (384g) 0.27293173 0.291887550 0.29529317 0.312481928 0.328899598
# Oligodendrocyte_Cahoy (477g) 0.15752610 0.130120482 0.13937349 0.124144578 0.134650602
# Astrocyte_Lein (7g) 0.14982578 0.145180023 0.21544715 0.242160279 0.103368177
# Neuron_Lein (13g) 0.38524850 0.404558405 0.51187085 0.467553023 0.446660336
# Microglia_lavin (165g) 0.09196153 0.062928688 0.06546906 0.082852477 0.074578116
# Random (50g) 0.01273942 0.008908686 0.01336303 0.008819599 0.003296214
# Random (500g) 0.10306827 0.102522088 0.11167871 0.113959839 0.067726908
# HK-like (100g) 0.40626566 0.427769424 0.41182957 0.419548872 0.403208020
warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)
warningMsg[which(warningMsg!="")]
# Random (500g)
# "The AUC might follow a normal distribution (random gene-set?). The global distribution overlaps the partial distributions. "
AUCell_exploreThresholds函数计算的每个基因集的阈值储存在$aucThr
。如查看oligodendrocyte基因集的阈值:
cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
## threshold nCells
## Global_k1 0.2419241 739
## L_k2 0.2437017 733
## R_k3 0.1766730 1008
## minimumDens 0.2417192 740
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
## minimumDens
## 0.2417192
查看在这个阈值下处于on状态的细胞数目
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
length(oligodencrocytesAssigned)
# [1] 740
head(oligodencrocytesAssigned)
## [1] "1772067069_B02" "1772066077_G03"
## [3] "1772066070_D08" "1772067076_A07"
## [5] "1772067057_G07" "1772067076_C07"
也可以手动对阈值进行调整
绘制特定基因集的条型图,设定新阈值
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)
abline(v=0.25)
使用新阈值分配细胞
newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))
length(newSelectedCells)
# [1] 3002
head(newSelectedCells)
# [1] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06"
# [5] "1772067065_H06" "1772071017_E02"
5. 后续分析
加载3005个细胞的tsne坐标并绘图
load(paste(file.path(system.file('examples', package='AUCell')), "cellsTsne.RData", sep="/"))
cellsTsne <- cellsTsne$Y
plot(cellsTsne, pch=16, cex=.3)
tsne数据的来源:
load("exprMatrix_AUCellVignette_MouseBrain.RData")
sumByGene <- apply(mouseBrainExprMatrix, 1, sum)
exprMatSubset <- mouseBrainExprMatrix[which(sumByGene>0),]
logMatrix <- log2(exprMatSubset+1)
library(Rtsne)
set.seed(123)
cellsTsne <- Rtsne(t(logMatrix))
rownames(cellsTsne$Y) <- colnames(logMatrix)
colnames(cellsTsne$Y) <- c("tsne1", "tsne2")
save(cellsTsne, file="cellsTsne.RData")
上面的tSNE图可以通过AUC评分来上色。为了更好的区分细胞,我们将阈值内的细胞标为pink-red,其他细胞标为black-blue。
selectedThresholds <- getThresholdSelected(cells_assignment)
par(mfrow=c(2,4)) # Splits the plot into two rows and three columns
for(geneSetName in names(selectedThresholds))
{
nBreaks <- 5 # Number of levels in the color palettes
# Color palette for the cells that do not pass the threshold
colorPal_Neg <- grDevices::colorRampPalette(c("black","blue", "skyblue"))(nBreaks)
# Color palette for the cells that pass the threshold
colorPal_Pos <- grDevices::colorRampPalette(c("pink", "magenta", "red"))(nBreaks)
# Split cells according to their AUC value for the gene set
passThreshold <- getAUC(cells_AUC)[geneSetName,] > selectedThresholds[geneSetName]
if(sum(passThreshold) >0 )
{
aucSplit <- split(getAUC(cells_AUC)[geneSetName,], passThreshold)
# Assign cell color
cellColor <- c(setNames(colorPal_Neg[cut(aucSplit[[1]], breaks=nBreaks)], names(aucSplit[[1]])),
setNames(colorPal_Pos[cut(aucSplit[[2]], breaks=nBreaks)], names(aucSplit[[2]])))
# Plot
plot(cellsTsne, main=geneSetName,
sub="Pink/red cells pass the threshold",
col=cellColor[rownames(cellsTsne)], pch=16)
}
}
查看阈值对细胞分配的影响
selectedThresholds[2] <- 0.25
par(mfrow=c(2,3))
AUCell_plotTSNE(tSNE=cellsTsne, exprMat=exprMatrix,
cellsAUC=cells_AUC[1:2,], thresholds=selectedThresholds)
6. 使用pbmc3k数据集进行实操
library(AUCell)
library(clusterProfiler)
library(ggplot2)
library(Seurat)
pbmc <- readRDS("pbmc.rds")
H <- read.gmt("h.all.v7.4.symbols.gmt.txt")
cells_rankings <- AUCell_buildRankings(pbmc@assays$RNA@data) # 关键一步
unique(H$term)
geneSets <- lapply(unique(H$term), function(x){H$gene[H$term == x]})
names(geneSets) <- unique(H$term)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings, aucMaxRank=nrow(cells_rankings)*0.1)
length(rownames(cells_AUC@assays@data$AUC))
grep("INFLAMMATORY",rownames(cells_AUC@assays@data$AUC),value = T)
# [1] "HALLMARK_INFLAMMATORY_RESPONSE"
geneSet <- "HALLMARK_INFLAMMATORY_RESPONSE"
aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])
pbmc$AUC <- aucs
df<- data.frame(pbmc@meta.data, pbmc@reductions$umap@cell.embeddings)
head(df)
class_avg <- df %>%
group_by(cell_type) %>%
summarise(
UMAP_1 = median(UMAP_1),
UMAP_2 = median(UMAP_2)
)
ggplot(df, aes(UMAP_1, UMAP_2))+
geom_point(aes(colour= AUC)) +
viridis::scale_color_viridis(option="A") +
ggrepel::geom_label_repel(aes(label = cell_type),
data = class_avg,
size = 6,
label.size = 0,
segment.color = NA)+
theme(legend.position = "none") +
theme_bw()
参考:
http://127.0.0.1:28753/library/AUCell/doc/AUCell.html
https://www.jianshu.com/p/2fb20f44da67