irGSEA:基于秩次的单细胞基因集富集分析整合框架(修订版)
一、背景
许多Functional Class Scoring (FCS)方法,如GSEA
, GSVA
,PLAGE
, addModuleScore
, SCSE
, Vision
, VAM
, gficf
, pagoda2
和Sargent
,都会受数据集组成的影响。数据集组成的轻微变化将改变细胞的基因集富集分数。假如将新的单细胞数据集整合到现有数据中,使用这些FCS方法需要重新计算每个细胞的基因集富集分数。这个步骤可能是繁琐且资源密集的
。相反,基于单个细胞表达等级的FCS,如AUCell
、UCell
、singscore
、ssGSEA
、JASMINE
和Viper
,只需要计算新添加的单细胞数据集的富集分数,而无需重新计算所有细胞的基因集富集分数。原因是这些方法生成的富集分数仅依赖于单个细胞水平上的相对基因表达,与数据集组成无关。因此,这些方法可以节省大量的时间。
二、审视结果
在这里,我们审视了17种
常见的FCS方法:
-
GSEA
检测排序基因列表顶部或底部的基因集富集程度,该列表是分组后计算排序基因信噪比或排序基因倍数变化得到的; -
GSVA
估计所有细胞之间每个基因的累积密度函数的核。 这个过程中需要考虑所有样本,容易受到样本背景信息的影响; -
PLAGE
对跨细胞的基因表达矩阵进行标准化,并提取奇异值分解作为基因集富集分数; -
Zscore
聚合了基因集中所有基因的表达,通过细胞间的平均值和标准差缩放表达; -
AddModuleScore
需要先计算基因集中所有基因的平均值,再根据平均值把表达矩阵切割成若干份,然后从切割后的每一份中随机抽取对照基因(基因集外的基因)作为背景值。因此,在整合不同样本的情况下,即使使用相同基因集为相同细胞打分,也会产生不同的富集评分; -
SCSE
使用基因集所有基因的归一化的总和来量化基因集富集分数; -
Vision
使用随机签名的预期均值和方差对基因集富集分数进行 z 归一化从而校正基因集富集分数; -
VAM
根据经典Mahalanobis多元距离从单细胞 RNA 测序数据生成基因集富集分数; -
Gficf
利用通过非负矩阵分解获得的基因表达值的潜在因子的信息生物信号; -
Pagoda2
拟合每个细胞的误差模型,并使用其第一个加权主成分量化基因集富集分数; -
AUCell
基于单个样本中的基因表达排名,使用曲线下面积来评估输入基因集是否在单个样本的前5%表达基因内富集; -
UCell
基于单个样本的基因表达排名,使用Mann-Whitney U统计量计算单个样本的基因集富集分数; -
Singscore
根据基因表达等级评估距单个细胞中心的距离。 基因集中的基因根据单个细胞中的转录本丰度进行排序。 平均等级相对于理论最小值和最大值单独标准化,以零为中心,然后聚合,所得分数代表基因集的富集分数; -
ssGSEA
根据每个细胞的基因表达等级计算内部和外部基因集之间的经验累积分布的差异分数。 使用全局表达谱对差异分数进行标准化。 标准化这一步容易受样本构成的影响。 -
JASMINE
根据在单个细胞中表达基因中的基因排名和表达基因中基因集的富集度计算近似平均值。 这两个值均标准化为 0-1 范围,并通过平均进行组合,得出基因集的最终富集分数。 -
Viper
通过根据细胞间基因表达的排名执行three-tailed计算来估计基因集的富集分数。 -
Sargent
将给定细胞的非零表达基因从高表达到低表达进行排序,并将输入的基因逐细胞表达矩阵转换为相应的gene-set-by-cell assignment score matrix。 但Sargent
需要计算细胞间的gini-index后,将按gene-set-by-cell assignment score matrix转换为distribution of indexes。
三、工作流程
Graph Abstrast.jpg使用
AUCell、UCell、singscore、ssgsea、JASMINE 和 viper
分别对各个细胞进行评分,得到不同的富集评分矩阵。通过wilcoxon检验
计算不同的富集评分矩阵中每个细胞亚群差异表达的基因集。up或down
表示该细胞簇内差异基因集的富集程度高于或低于其他簇。 单一的基因集富集分析方法不仅只能反映有限的信息,而且也容易带来误差
。我们期待从多个角度解释复杂的生物学问题,并找到生物学问题中的共性部分。简单地为多种基因集富集分析方法的结果取共同交集,不仅容易得到少而保守的结果,而且忽略了富集分析方法中很多的其他信息,例如不同基因集的相对富集程度信息。我们希望目标基因集在大部分富集分析方法中都是富集且富集程度没有明显差异
。因此,我们通过RobustRankAggreg包中的秩聚合算法(robust rank aggregation, RRA)
对差异分析的结果进行评估,筛选出在6种方法中表现出相似的富集程度的差异基因集。
四、安装
1.irGSEA安装(基础配置)
仅使用 AUCell, UCell, singscore, ssGSEA, JASMINE和viper
# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools",
"doParallel", "doRNG", "dplyr", "ggfun", "gghalves",
"ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
"magrittr", "Matrix", "msigdbr", "pagoda2", "pointr",
"purrr", "RcppML", "readr", "reshape2", "reticulate",
"rlang", "RMTstat", "RobustRankAggreg", "roxygen2",
"Seurat", "SeuratObject", "stringr", "tibble", "tidyr",
"tidyselect", "tidytree", "VAM")
for (i in cran.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
install.packages(i, ask = F, update = F)
}
}
# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap",
"decoupleR", "fgsea", "ggtree", "GSEABase",
"GSVA", "Nebulosa", "scde", "singscore",
"SummarizedExperiment", "UCell",
"viper","sparseMatrixStats")
for (i in bioconductor.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
BiocManager::install(i, ask = F, update = F)
}
}
# install packages from Github
if (!requireNamespace("irGSEA", quietly = TRUE)) {
devtools::install_github("chuiqin/irGSEA", force =T)
}
2.irGSEA安装(进阶配置)
想使用VISION, gficf, Sargent, ssGSEApy, GSVApy等其他方法(这部分是可选安装)
# VISION
if (!requireNamespace("VISION", quietly = TRUE)) {
devtools::install_github("YosefLab/VISION", force =T)
}
# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) {
devtools::install_github("imbs-hl/ranger", force =T)
}
# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
message("The version of RcppML should greater than 0.3.7 and install RcppML package from Github")
devtools::install_github("zdebruine/RcppML", force =T)
}
# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) {
devtools::install_github("gambalab/gficf", force =T)
}
# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) {
devtools::install_github("mojaveazure/seurat-disk", force =T)
}
# sargent
if (!requireNamespace("sargent", quietly = TRUE)) {
devtools::install_github("Sanofi-Public/PMCB-Sargent", force =T)
}
# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) {
devtools::install_github("hms-dbmi/scde", force =T)
}
# if error1 (functio 'sexp_as_cholmod_sparse' not provided by package 'Matrix')
# or error2 (functio 'as_cholmod_sparse' not provided by package 'Matrix') occurs
# when you perform pagoda2, please check the version of irlba and Matrix
# It's ok when I test as follow:
# R 4.2.2 irlba(v 2.3.5.1) Matrix(1.5-3)
# R 4.3.1 irlba(v 2.3.5.1) Matrix(1.6-1.1)
# R 4.3.2 irlba(v 2.3.5.1) Matrix(1.6-3)
#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs,
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
reticulate::conda_create("irGSEA")
}
# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in seq_along(require.package)) {
if (i %in% python.package) {
reticulate::conda_install(envname = "irGSEA", packages = i, pip = T)
}
}
3.国内镜像加速安装
安装github包和pip包有困难的参考这里,我把github包克隆到了gitee上
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))
# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools",
"doParallel", "doRNG", "dplyr", "ggfun", "gghalves",
"ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
"magrittr", "Matrix", "msigdbr", "pagoda2", "pointr",
"purrr", "RcppML", "readr", "reshape2", "reticulate",
"rlang", "RMTstat", "RobustRankAggreg", "roxygen2",
"Seurat", "SeuratObject", "stringr", "tibble", "tidyr",
"tidyselect", "tidytree", "VAM")
for (i in cran.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
BiocManager::install(i, ask = F, update = F)
}
}
# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap",
"decoupleR", "fgsea", "ggtree", "GSEABase",
"GSVA", "Nebulosa", "scde", "singscore",
"SummarizedExperiment", "UCell", "viper")
for (i in bioconductor.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
install.packages(i, ask = F, update = F)
}
}
# install packages from git
if (!requireNamespace("irGSEA", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/irGSEA.git", force =T)
}
# VISION
if (!requireNamespace("VISION", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/VISION.git", force =T)
}
# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/ranger.git", force =T)
}
# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
message("The version of RcppML should greater than 0.3.7 and install RcppML package from Git")
devtools::install_git("https://gitee.com/fan_chuiqin/RcppML.git", force =T)
}
# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/gficf.git", force =T)
}
# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/seurat-disk.git",
force =T)}
# sargent
if (!requireNamespace("sargent", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/PMCB-Sargent.git",
force =T)}
# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/scde.git", force =T)
}
#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs,
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
reticulate::conda_create("irGSEA")
}
# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in require.package) {
if (! i %in% python.package) {
reticulate::conda_install(envname = "irGSEA", packages = i, pip = T,
pip_options = "-i https://pypi.tuna.tsinghua.edu.cn/simple")
}
}
五、使用教程
1.irGSEA支持Seurat 对象(V5或V4),Assay对象(V5或V4)
# 我们通过SeuratData包加载示例数据集(注释好的PBMC数据集)作为演示
#### Seurat V4对象 ####
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final)
>[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper"
#### Seurat V5对象 ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final2 <- CreateSeuratObject(counts = CreateAssay5Object(GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts")),
meta.data = pbmc3k.final[[]])
pbmc3k.final2 <- NormalizeData(pbmc3k.final2)
pbmc3k.final2 <- irGSEA.score(object = pbmc3k.final2,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final2)
[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper"
#### Assay5 对象 ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final3 <- CreateAssay5Object(counts = GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts"))
pbmc3k.final3 <- NormalizeData(pbmc3k.final3)
pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final3,assay = "RNA",
slot = "counts", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final3)
# Assay5对象存放在RNA assay中
> cat(object.size(pbmc3k.final) / (1024^2), "M")
339.955 M
> cat(object.size(pbmc3k.final2) / (1024^2), "M")
69.33521 M
> cat(object.size(pbmc3k.final3) / (1024^2), "M")
69.27851 M
# 看起来Seurat V5和Assay 5确实存放数据会比较小
2.irGSEA支持的基因集打分方法
测试了不同数据大小下各种评分方法使用50个Hallmark基因集进行打分所需的时间和内存峰值, 大家根据自己的电脑和时间进行酌情选择;
GSVApy、ssGSEApy 和 viperpy 分别代表 GSVA、ssGSEA 和 viper 的 Python 版本。Python版本的GSVA比R版本的GSVA节约太多时间了。
我们对singscore、ssGSEA、JASMINE、viper的内存峰值进行了优化
。 对于超过 50000 个细胞的数据集,我们实施了一种策略,将它们划分为5000 个细胞/单元进行评分。 虽然这可以缓解内存峰值问题,但确实会延长处理时间。
3.irGSEA支持的基因集打分方法
为了方便用户获取MSigDB数据库中预先定义好的基因集,我们内置了msigdbr包
进行MSigDB的基因集数据的获取。msigdbr包
支持多个物种的基因集获取,以及多种基因格式
的表达矩阵的输入。
①.irGSEA通过内置的msigdbr包进行打分
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "C2",
subcategory = "CP:KEGG", geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "C5",
subcategory = "GO:BP", geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
②.irGSEA使用最新版的MSigDB基因集进行打分
遗憾的是,
msigdbr包
内置的MSigDB的版本是MSigDB 2022.1
。然而,目前MSigDB数据库已经更新到了2023.2
,包括2023.2.Hs
和2023.2.Mm
。我们可以从这个链接(https://data.broadinstitute.org/gsea-msigdb/msigdb/release/)下载2023.2.Hs
的gmt文件
或者db.zip文件
。相比gmt文件
,db.zip文件
包含了基因集的描述,可以用来筛选XX功能相关基因。下面的例子中,我将介绍如何筛选血管生成相关的基因集
。
#### work with newest Msigdb ####
# https://data.broadinstitute.org/gsea-msigdb/msigdb/release/
# In this page, you can download human/mouse gmt file or db.zip file
# The db.zip file contains metadata information for the gene set
# load library
library(clusterProfiler)
library(tidyverse)
library(DBI)
library(RSQLite)
### db.zip ###
# download zip file and unzip zip file
zip_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb_v2023.2.Hs.db.zip"
local_zip_path <- "./msigdb_v2023.2.Hs.db.zip"
download.file(zip_url, local_zip_path)
unzip(local_zip_path, exdir = "./")
# code modified by https://rdrr.io/github/cashoes/sear/src/data-raw/1_parse_msigdb_sqlite.r
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = './msigdb_v2023.2.Hs.db')
DBI::dbListTables(con)
# define tables we want to combine
geneset_db <- dplyr::tbl(con, 'gene_set') # standard_name, collection_name
details_db <- dplyr::tbl(con, 'gene_set_details') # description_brief, description_full
geneset_genesymbol_db <- dplyr::tbl(con, 'gene_set_gene_symbol') # meat and potatoes
genesymbol_db <- dplyr::tbl(con, 'gene_symbol') # mapping from ids to gene symbols
collection_db <- dplyr::tbl(con, 'collection') %>% dplyr::select(collection_name, full_name) # collection metadata
# join tables
msigdb <- geneset_db %>%
dplyr::left_join(details_db, by = c('id' = 'gene_set_id')) %>%
dplyr::left_join(collection_db, by = 'collection_name') %>%
dplyr::left_join(geneset_genesymbol_db, by = c('id' = 'gene_set_id')) %>%
dplyr::left_join(genesymbol_db, by = c('gene_symbol_id' = 'id')) %>%
dplyr::select(collection = collection_name, subcollection = full_name, geneset = standard_name, description = description_brief, symbol) %>%
dplyr::as_tibble()
# clean up
DBI::dbDisconnect(con)
unique(msigdb$collection)
# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA"
# [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB"
# [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY"
# [10] "C4:3CA" "C4:CGN" "C4:CM"
# [13] "C6" "C7:IMMUNESIGDB" "C7:VAX"
# [16] "C8" "C5:GO:BP" "C5:GO:CC"
# [19] "C5:GO:MF" "H" "C5:HPO"
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP"
unique(msigdb$subcollection)
# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA"
# [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB"
# [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY"
# [10] "C4:3CA" "C4:CGN" "C4:CM"
# [13] "C6" "C7:IMMUNESIGDB" "C7:VAX"
# [16] "C8" "C5:GO:BP" "C5:GO:CC"
# [19] "C5:GO:MF" "H" "C5:HPO"
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP"
# convert to list[Hallmark] required by irGSEA package
msigdb.h <- msigdb %>%
dplyr::filter(collection=="H") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.h$geneset <- factor(msigdb.h$geneset)
msigdb.h <- msigdb.h %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.h$geneset))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.h,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[go bp] required by irGSEA package
msigdb.go.bp <- msigdb %>%
dplyr::filter(collection=="C5:GO:BP") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.go.bp$geneset <- factor(msigdb.go.bp$geneset)
msigdb.go.bp <- msigdb.go.bp %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.go.bp$geneset))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.go.bp,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>%
dplyr::filter(collection=="C2:CP:KEGG_MEDICUS") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.kegg$geneset <- factor(msigdb.kegg$geneset)
msigdb.kegg <- msigdb.kegg %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.kegg$geneset))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.kegg,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# Look for the gene sets associated with angiogenesis from gene sets names and
# gene sets descriptions
category <- c("angiogenesis", "vessel")
msigdb.vessel <- list()
for (i in category) {
# Ignore case matching
find.index.description <- stringr::str_detect(msigdb$description, pattern = regex(all_of(i), ignore_case=TRUE))
find.index.name <- stringr::str_detect(msigdb$geneset, pattern = regex(all_of(i), ignore_case=TRUE))
msigdb.vessel[[i]] <- msigdb[find.index.description | find.index.name, ] %>% mutate(category = i)
}
msigdb.vessel <- do.call(rbind, msigdb.vessel)
head(msigdb.vessel)
# # A tibble: 6 × 6
# collection subcollection geneset description symbol category
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … HECW1 angioge…
# 2 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … JADE2 angioge…
# 3 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … SEMA3C angioge…
# 4 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … STUB1 angioge…
# 5 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … FAH angioge…
# 6 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … COL7A1 angioge…
length(unique(msigdb.vessel$geneset))
# [1] 112
# convert gene sets associated with angiogenesis to list
# required by irGSEA package
msigdb.vessel <- msigdb.vessel %>%
dplyr::select(c("geneset", "symbol"))
msigdb.vessel$geneset <- factor(msigdb.vessel$geneset)
msigdb.vessel <- msigdb.vessel %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.vessel$geneset))
#### 血管生成相关基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.vessel,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
### gmt file ###
# download gmt file
gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb.v2023.2.Hs.symbols.gmt"
local_gmt <- "./msigdb.v2023.2.Hs.symbols.gmt"
download.file(gmt_url , local_gmt)
msigdb <- clusterProfiler::read.gmt("./msigdb.v2023.2.Hs.symbols.gmt")
# convert to list[hallmarker] required by irGSEA package
msigdb.h <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("HALLMARK_", ignore_case=TRUE)))
msigdb.h$term <- factor(msigdb.h$term)
msigdb.h <- msigdb.h %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.h$term))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.h,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[go bp] required by irGSEA package
msigdb.go.bp <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("GOBP_", ignore_case=TRUE)))
msigdb.go.bp$term <- factor(msigdb.go.bp$term)
msigdb.go.bp <- msigdb.go.bp %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.go.bp$term))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.go.bp,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("KEGG_", ignore_case=TRUE)))
msigdb.kegg$term <- factor(msigdb.kegg$term)
msigdb.kegg <- msigdb.kegg %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.kegg$term))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.kegg,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
③.irGSEA使用clusterProfiler包同款基因集进行打分
#### work with clusterProfiler package ####
# load library
library(clusterProfiler)
library(tidyverse)
### kegg ###
# download kegg pathway (human) and write as gson file
kk <- clusterProfiler::gson_KEGG(species = "hsa")
gson::write.gson(kk, file = "./KEGG_20231128.gson")
# read gson file
kk2 <- gson::read.gson("./KEGG_20231123.gson")
# Convert to a data frame
kegg.list <- dplyr::left_join(kk2@gsid2name,
kk2@gsid2gene,
by = "gsid")
head(kegg.list)
# gsid name gene
# 1 hsa01100 Metabolic pathways 10
# 2 hsa01100 Metabolic pathways 100
# 3 hsa01100 Metabolic pathways 10005
# 4 hsa01100 Metabolic pathways 10007
# 5 hsa01100 Metabolic pathways 100137049
# 6 hsa01100 Metabolic pathways 10020
# Convert gene ID to gene symbol
gene_name <- clusterProfiler::bitr(kegg.list$gene,
fromType = "ENTREZID",
toType = "SYMBOL",
OrgDb = "org.Hs.eg.db")
kegg.list <- dplyr::full_join(kegg.list,
gene_name,
by = c("gene"="ENTREZID"))
# remove NA value if exist
kegg.list <- kegg.list[complete.cases(kegg.list[, c("gene", "SYMBOL")]), ]
head(kegg.list)
# gsid name gene SYMBOL
# 1 hsa01100 Metabolic pathways 10 NAT2
# 2 hsa01100 Metabolic pathways 100 ADA
# 3 hsa01100 Metabolic pathways 10005 ACOT8
# 4 hsa01100 Metabolic pathways 10007 GNPDA1
# 5 hsa01100 Metabolic pathways 100137049 PLA2G4B
# 6 hsa01100 Metabolic pathways 10020 GNE
# convert to list required by irGSEA package
kegg.list$name <- factor(kegg.list$name)
kegg.list <- kegg.list %>%
dplyr::group_split(name, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
purrr::set_names(levels(kegg.list$name))
head(kegg.list)
### 整理好之后就可以进行KEGG打分
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = kegg.list,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
### go bp ###
# download go bp (human) and write as gson file
go <- clusterProfiler::gson_GO(OrgDb = "org.Hs.eg.db", ont = "BP")
gson::write.gson(go, file = "./go_20231128.gson")
# read gson file
go2 <- gson::read.gson("./go_20231128.gson")
# Convert to a data frame
go.list <- dplyr::left_join(go2@gsid2name,
go2@gsid2gene,
by = "gsid")
head(go.list)
# gsid name gene
# 1 GO:0000001 mitochondrion inheritance <NA>
# 2 GO:0000002 mitochondrial genome maintenance 142
# 3 GO:0000002 mitochondrial genome maintenance 291
# 4 GO:0000002 mitochondrial genome maintenance 1763
# 5 GO:0000002 mitochondrial genome maintenance 1890
# 6 GO:0000002 mitochondrial genome maintenance 2021
# Convert gene ID to gene symbol
go.list <- dplyr::full_join(go.list,
go2@gene2name,
by = c("gene"="ENTREZID"))
# remove NA value if exist
go.list <- go.list[complete.cases(go.list[, c("gene", "SYMBOL")]), ]
head(go.list)
# gsid name gene SYMBOL
# 2 GO:0000002 mitochondrial genome maintenance 142 PARP1
# 3 GO:0000002 mitochondrial genome maintenance 291 SLC25A4
# 4 GO:0000002 mitochondrial genome maintenance 1763 DNA2
# 5 GO:0000002 mitochondrial genome maintenance 1890 TYMP
# 6 GO:0000002 mitochondrial genome maintenance 2021 ENDOG
# 7 GO:0000002 mitochondrial genome maintenance 3980 LIG3
# convert to list required by irGSEA package
go.list$name <- factor(go.list$name)
go.list <- go.list %>%
dplyr::group_split(name, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
purrr::set_names(levels(go.list$name))
head(go.list)
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = go.list,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
3.irGSEA的完整流程
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
# 基因集打分
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
# 计算差异基因集,并进行RRA
# 如果报错,考虑加句代码options(future.globals.maxSize = 100000 * 1024^5)
result.dge <- irGSEA.integrate(object = pbmc3k.final,
group.by = "seurat_annotations",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"))
# 查看RRA识别的在多种打分方法中都普遍认可的差异基因集
geneset.show <- result.dge$RRA %>%
dplyr::filter(pvalue <= 0.05) %>%
dplyr::pull(Name) %>% unique(.)
4.可视化展示
1). 全局展示
①.热图
你还可以把method从'RRA"换成“ssgsea”,展示特定基因集富集分析方法中差异上调或差异下调的基因集;
irGSEA.heatmap.plot <- irGSEA.heatmap(object = result.dge,
method = "RRA",
top = 50,
show.geneset = NULL)
irGSEA.heatmap.plot
image.png
默认展示前50,你也可以展示你想展示的基因集。
例如,我想展示RRA识别的差异基因集。
irGSEA.heatmap.plot1 <- irGSEA.heatmap(object = result.dge,
method = "RRA",
show.geneset = geneset.show)
irGSEA.heatmap.plot1
image.png
②.气泡图
irGSEA.bubble.plot <- irGSEA.bubble(object = result.dge,
method = "RRA",
show.geneset = geneset.show)
irGSEA.bubble.plot
image.png
③.upset plot
upset图展示了综合评估中每个细胞亚群具有统计学意义差异的基因集的数目,以及不同细胞亚群之间具有交集的差异基因集数目;左边不同颜色的条形图代表不同的细胞亚群;上方的条形图代表具有交集的差异基因集的数目;中间的气泡图单个点代表单个细胞亚群,多个点连线代表多个细胞亚群取交集()这里只展示两两取交集;
irGSEA.upset.plot <- irGSEA.upset(object = result.dge,
method = "RRA",
mode = "intersect",
upset.width = 20,
upset.height = 10,
set.degree = 2,
pt_size = grid::unit(2, "mm"))
irGSEA.upset.plot
image.png
④.堆叠条形图
堆叠柱状图具体展示每种基因集富集分析方法中每种细胞亚群中上调、下调和没有统计学差异的基因集数目;上方的条形代表每个亚群中不同方法中差异的基因数目,红色代表上调的差异基因集,蓝色代表下调的差异基因集;中间的柱形图代表每个亚群中不同方法中上调、下调和没有统计学意义的基因集的比例;
irGSEA.barplot.plot <- irGSEA.barplot(object = result.dge,
method = c("AUCell", "UCell", "singscore",
"ssgsea", "JASMINE", "viper", "RRA"))
irGSEA.barplot.plot
image.png
2). 局部展示
①.密度散点图
密度散点图将基因集的富集分数和细胞亚群在低维空间的投影结合起来,展示了特定基因集在空间上的表达水平。其中,颜色越黄,密度分数越高,代表富集分数越高;
scatterplot <- irGSEA.density.scatterplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE",
reduction = "umap")
scatterplot
image.png
②.半小提琴图
半小提琴图同时以小提琴图(左边)和箱线图(右边)进行展示。不同颜色代表不同的细胞亚群;
halfvlnplot <- irGSEA.halfvlnplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
halfvlnplot
image.png
除此之外,还可以
vlnplot <- irGSEA.vlnplot(object = pbmc3k.final,
method = c("AUCell", "UCell", "singscore", "ssgsea",
"JASMINE", "viper"),
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
vlnplot
image.png
③.山峦图
山峦图中上方的核密度曲线展示了数据的主要分布,下方的条形编码图展示了细胞亚群具体的数量。不同颜色代表不同的细胞亚群,而横坐标代表不同的表达水平;
ridgeplot <- irGSEA.ridgeplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
ridgeplot
image.png
④.密度热图
密度热图展示了具体差异基因在不同细胞亚群中的表达和分布水平。颜色越红,代表富集分数越高;
densityheatmap <- irGSEA.densityheatmap(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
densityheatmap
image.png
六、参考文献
'AUCell (https://doi.org/10.1038/nmeth.4463)';
'UCell (https://doi.org/10.1016/j.csbj.2021.06.043)':;
'singscore (https://doi.org/10.1093/nar/gkaa802)';
'ssgsea (https://doi.org/10.1038/nature08460)';
'JASMINE (https://doi.org/10.7554/eLife.71994)'
'VAM (https://doi.org/10.1093/nar/gkaa582)';
'scSE (https://doi.org/10.1093/nar/gkz601)';
'VISION (https://doi.org/10.1038/s41467-019-12235-0)';
'wsum (https://doi.org/10.1093/bioadv/vbac016)';
'wmean (https://doi.org/10.1093/bioadv/vbac016)';
'mdt (https://doi.org/10.1093/bioadv/vbac016)';
'viper (https://doi.org/10.1038/ng.3593)';
'GSVApy (https://doi.org/10.1038/ng.3593)': ;
'gficf (https://doi.org/10.1093/nargab/lqad024)';
'GSVA (https://doi.org/10.1186/1471-2105-14-7)';
'zscore (https://doi.org/10.1371/journal.pcbi.1000217)';
'plage (https://doi.org/10.1186/1471-2105-6-225)';
'ssGSEApy (https://doi.org/10.1093/bioinformatics/btac757)';
'viperpy (https://doi.org/10.1093/bioinformatics/btac757)';
'AddModuleScore (https://doi.org/10.1126/science.aad0501)';
'pagoda2 (https://doi.org/10.1038/nbt.4038)':;
'Sargent (https://doi.org/10.1016/j.mex.2023.102196)';