muscat代码实操（一）：模拟复杂的 scRNA-seq 数据

2023-09-04 本文已影响0人生信宝库

前言

与传统的单细胞分析工具不同，muscat更加关注于揭示样本间的变异性，从而得出更具普遍性的结论。在上一期的推文中muscat：专注于多样本多分组的单细胞差异分析，我们介绍了muscat的功能框架。那么从本期推文开始，我们将通过代码实操的方式来介绍muscat的使用技巧。那么，今天让我们一起来学习一下如何使用muscat模拟复杂的 scRNA-seq 数据以及评估不同的差异表达分析方法的性能。

代码流程

0. 加载R包

library(cowplot)
library(dplyr)
library(reshape2)
#BiocManager::install("muscat")
library(muscat)
library(purrr)
library(scater)
library(SingleCellExperiment)

1. prepSim：准备模拟数据

数据来源：GSE96583；包含8名狼疮患者在IFN-β治疗前后获得的scRNA-seq PBCM数据

#estimate simulation parameters
data(example_sce)
table(example_sce$group_id)
# ctrl stim 
# 750  806 
ref <- prepSim(example_sce, verbose = FALSE)
# only samples from `ctrl` group are retained
table(ref$sample_id)
# ctrl101 ctrl107 
# 200     200 
# 用法prepSim(x, min_count = 1, min_cells = 10, min_genes = 100, 
#           min_size = 100, group_keep = NULL, verbose = TRUE)
# 参数x是一个SingleCellExperiment。
# min_count，min_cells用于过滤基因；只保留在>=min_cells中具有>min_count计数的基因。
# min_genes用于过滤细胞；只保留在>=min_genes中具有>0计数的细胞。
# min_size用于过滤亚群-样本组合；只保留具有>=min_size个细胞的实例。指定min_size=NULL跳过此步骤。
# group_keep字符字符串；默认值NULL保留来自levels(x$group_id)[1]的样本；

prepSim通过以下步骤为muscat的simData函数准备输入SCE进行模拟：

基本过滤基因和细胞
(可选)过滤亚群-样本实例
分别估计细胞(库大小)和基因参数(离散度和样本特定均值)。

# cell parameters: library sizes
sub <- assay(example_sce[rownames(ref), colnames(ref)])
all.equal(exp(ref$offset), as.numeric(colSums(sub)))
## [1] "names for target but not for current"
## [2] "Mean relative difference: 0.4099568"
#模拟数据与原始数据平均相对差异为0.4099568

平均相对差异的值范围是0到1（或0%到100%）。值越接近0，表示两组数据越相似；值越接近1，表示两组数据越不相似。

# gene parameters: dispersions & sample-specific means
head(rowData(ref))
# DataFrame with 6 rows and 4 columns
# ENSEMBL      SYMBOL                           beta      disp
# <character> <character>                    <DataFrame> <numeric>
# ISG15    ENSG00000187608       ISG15 -7.84574:-0.24711310:-1.039480 4.6360532
# AURKAIP1 ENSG00000175756    AURKAIP1 -7.84859: 0.00768812:-1.171896 0.1784345
# MRPL20   ENSG00000242485      MRPL20 -8.31434:-0.58684477:-0.304827 0.6435324
# SSU72    ENSG00000160075       SSU72 -8.05160:-0.17119703:-0.793222 0.0363892
# RER1     ENSG00000157916        RER1 -7.75327: 0.10731331:-1.261821 0.5046166
# RPL22    ENSG00000116251       RPL22 -8.03553:-0.03357193: 0.143506 0.2023632

2. simData: Simulating complex designs

# simulated 10% EE genes
sim <- simData(ref, p_dd = diag(6)[1, ],
               nk = 3, ns = 3, nc = 2e3,
               ng = 1e3, force = TRUE)
让我们看一下muscat包设定的差异模式
Non-differential ：EE、EP
Differential ：DE、DP、DM、DB
according to a probability vector p_dd =(pEE,pEP,pDE,pDP,pDM,pDB)
nk = 3：定义了3个子群。
ns = 3：定义了3个样本。
nc = 2e3：定义了2000个细胞。
ng = 1e3：定义了1000个基因。
table(sim$sample_id, sim$cluster_id)
#            cluster1 cluster2 cluster3
# sample1.A      123      110       91
# sample2.A      116      102      111
# sample3.A      112      114      107
# sample1.B      120      101      129
# sample2.B      109      110      128
# sample3.B       99       97      121
metadata(sim)$ref_sids
#             A         B        
# sample1 "ctrl101" "ctrl107"
# sample2 "ctrl101" "ctrl101"
# sample3 "ctrl101" "ctrl101"

# simulated paired design
sim <- simData(ref, paired = TRUE, 
               nk = 3, ns = 3, nc = 2e3,
               ng = 1e3, force = TRUE)
#使用了paired = TRUE参数，表示模拟的是配对设计。
# same set of reference samples for both groups
ref_sids <- metadata(sim)$ref_sids
#             A         B        
# sample1 "ctrl107" "ctrl107"
# sample2 "ctrl101" "ctrl101"
# sample3 "ctrl107" "ctrl107"
all(ref_sids[, 1] == ref_sids[, 2])
## [1] TRUE
#表示两列完全相同，即两组使用了相同的参考样本。

在这里，我们介绍3个参数

p_dd: Simulating differential distributions: 参数p_dd指定要为每个DD类别模拟的单元格的比例。它的值应该在[0,1]中，和为1。

# simulate genes from all DD categories
sim <- simData(ref, p_dd = c(0.5, rep(0.1, 5)),
               nc = 2e3, ng = 1e3, force = TRUE)
gi <- metadata(sim)$gene_info
table(gi$category)
# ee  ep  de  dp  dm  db 
# 976 202 228 202 188 204

image.png

rel_lfc: Simulating cluster-specific state changes

rel_lfc=c(1,1,1)所有子种群的变化幅度相等
rel_lfc=c(1,1,3) 单个集群的变化更大
rel_lfc=c(0,1,2) 单个集群的特定FC因素没有变化

image.png

p_type: Simulating type features

差异状态(DS)分析用于测试不同实验条件下亚群特异性表达变化:

使用稳定的分子特征(即类型特征)将细胞分组为有意义的亚群;
对状态特征进行统计测试，这些状态特征是更短暂的表达，可能会在例如治疗或疾病期间的表达发生变化。

image.png

引入每个子种群的类型特征的比例通过参数p_type指定，p_type值的增加导致按簇ID上色时细胞分离越来越多，但状态变化的缺乏导致按group ID上色时细胞混合均匀。

3. Simulation a hierarchical cluster structure

simData包含三个参数，控制亚群的相似和差异:

p_type determines the percentage of type genes exclusive to each cluster
phylo_tree represents a phylogenetic tree specifying of clusters relate to one another
phylo_pars controls how branch distances are to be interpreted

3.1 p_type: Introducing type features

# simulate 5% of type genes; one group only
sim <- simData(ref, p_type = 0.1, 
    nc = 2e3, ng = 1e3, force = TRUE,
    probs = list(NULL, NULL, c(1, 0)))
# do log-library size normalization
sim <- logNormCounts(sim)

image.png

3.2 phylo_tree: Introducing a cluster phylogeny.

上面描述的场景可以说是不太现实的，在生物学环境中，亚种群之间并不存在特定的基因子集差异，但可能共享一些决定其生物学作用的基因。也就是说，集合类型特征对每个给定的子种群来说都不是排他性的，并且一些子种群之间比其他子种群更相似。为了引入更实际的亚种群结构，可以为simData提供一个系统发育树phylo_tree，它指定集群之间的关系和距离。

# specify cluster phylogeny 
tree <- "(('cluster1':0.4,'cluster2':0.4):0.4,('cluster3':
    0.5,('cluster4':0.2,'cluster5':0.2,'cluster6':0.2):0.4):0.4);"
# visualize cluster tree
library(phylogram)
plot(read.dendrogram(text = tree))

image.png

# simulate 5% of type genes; one group only
sim <- simData(ref, 
    phylo_tree = tree, phylo_pars = c(0.1, 1),
    nc = 800, ng = 1e3, dd = FALSE, force = TRUE)
# do log-library size normalization
sim <- logNormCounts(sim)
# extract gene metadata & number of clusters
rd <- rowData(sim)
nk <- nlevels(sim$cluster_id)
# filter for type & shared genes with high expression mean
is_type <- rd$class != "state"
is_high <- rowMeans(assay(sim, "logcounts")) > 1
gs <- rownames(sim)[is_type & is_high]
# sample 100 cells per cluster for plotting
cs <- lapply(split(seq_len(ncol(sim)), sim$cluster_id), sample, 50)
plotHeatmap(sim[, unlist(cs)], features = gs, 
    center = TRUE, show_rownames = FALSE,
    colour_columns_by = "cluster_id")

image.png

4. 质量控制

但模拟数据的质量因参考数据集而异，并可能受到过于极端的模拟参数的影响。因此，作者建议生成countsimQC报告(Soneson and Robinson 2018)查看数据质量。

#BiocManager::install("countsimQC")
library(countsimQC)
library(DESeq2)
# simulate data
sim <- simData(ref, 
               ng = nrow(ref), 
               nc = ncol(ref),
               dd = FALSE)
# construct 'DESeqDataSet's for both, 
# simulated and reference dataset
dds_sim <- DESeqDataSetFromMatrix(
  countData = counts(sim),
  colData = colData(sim),
  design = ~ sample_id)
dds_ref <- DESeqDataSetFromMatrix(
  countData = counts(ref),
  colData = colData(ref),
  design = ~ sample_id)
dds_list <- list(sim = dds_sim, data = dds_ref)
# generate 'countsimQC' report
countsimQCReport(
  ddsList = dds_list,
  outputFile = "QC.html",
  outputDir = "./",
  outputFormat = "html_document",
  maxNForCorr = 200, 
  maxNForDisp = 500)
完整的html报告将会输出在你指定的文件下。

5. 各种方法性能评估

iCOBRA包中提供了各种用于计算和可视化评估排名和二元分类(分配)方法的性能指标的功能。作者首先定义了一个包装器，它将传递给pbDS的方法作为输入，并将结果重新格式化为整齐格式的数据帧，然后将其与模拟基因metadata右连接。

# 'm' is a character string specifying a valid `pbDS` method
.run_method <- function(m) {
    res <- pbDS(pb, method = m, verbose = FALSE)
    tbl <- resDS(sim, res)
    left_join(gi, tbl, by = c("gene", "cluster_id"))
}

计算了一组方法的结果data.frames之后，作者接下来定义了一个包装器，该包装器使用COBRAData构造函数为iCOBRA的评估准备数据，并使用calculate_performance计算感兴趣的任何性能度量(通过方面指定):

# 'x' is a list of result 'data.frame's
.calc_perf <- function(x, facet = NULL) {
    cd <- COBRAData(truth = gi,
        pval = data.frame(bind_cols(map(x, "p_val"))),
        padj = data.frame(bind_cols(map(x, "p_adj.loc"))))
    perf <- calculate_performance(cd, 
        binary_truth = "is_de", maxsplit = 1e6,
        splv = ifelse(is.null(facet), "none", facet),
        aspects = c("fdrtpr", "fdrtprcurve", "curve"))
}

运行一组DS分析方法，计算它们的性能，并根据.calc_perf计算的方面绘制各种性能指标:

# simulation with all DD types
sim <- simData(ref, 
    p_dd = c(rep(0.3, 2), rep(0.1, 4)),
ng = 1e3, nc = 2e3, ns = 3, force = TRUE)
#使用simData函数模拟数据，其中p_dd参数定义了不同类型的差异表达基因的比例。
# aggregate to pseudobulks
pb <- aggregateData(sim)
#将单细胞数据聚合为伪批量数据。
# extract gene metadata
gi <- metadata(sim)$gene_info
# add truth column (must be numeric!)
gi$is_de <- !gi$category %in% c("ee", "ep")
gi$is_de <- as.numeric(gi$is_de) 
#为基因metadata添加了一个真实值列is_de，表示基因是否是差异表达的。

# specify methods for comparison & run them
# (must set names for methods to show in visualizations!)
names(mids) <- mids <- c("edgeR", "DESeq2", "limma-trend", "limma-voom")
res <- lapply(mids, .run_method)

# calculate performance measures 
# and prep. for plotting with 'iCOBRA'
library(iCOBRA)
perf <- .calc_perf(res, "cluster_id")
pd <- prepare_data_for_plot(perf)

使用前面定义的包装器.calc_perf计算性能度量，并使用prepare_data_for_plot函数准备数据以供绘图。

# plot FDR-TPR curves by cluster
plot_fdrtprcurve(pd) +
    theme(aspect.ratio = 1) +
    scale_x_continuous(trans = "sqrt") +
    facet_wrap(~ splitval, nrow = 1)

image.png

总的来说，作者在这部分使用iCOBRA包来评估和可视化不同的差异表达分析方法的性能。它首先模拟数据，然后运行不同的方法，计算性能度量，并绘制FDR-TPR曲线。

小结

在本期推文中，我们首先介绍了如何使用muscat模拟复杂的scRNA-seq数据。模拟数据是评估分析方法性能的关键，因为它可以为我们提供一个已知的“真实”答案，从而帮助我们更好地理解各种方法的优缺点。接下来，我们探讨了如何使用muscat评估不同的差异表达分析方法。差异表达分析是scRNA-seq数据分析的核心部分，选择合适的方法对于得出准确的结论是至关重要的。在下一期的推文中，我们将详细介绍如何使用muscat进行差异状态分析。

好啦，本期的分享到这里就结束了，我们下期再会~

muscat代码实操（一）：模拟复杂的 scRNA-seq 数据

前言

代码流程

0. 加载R包

1. prepSim：准备模拟数据

2. simData: Simulating complex designs

3. Simulation a hierarchical cluster structure

3.1 p_type: Introducing type features

3.2 phylo_tree: Introducing a cluster phylogeny.

4. 质量控制

5. 各种方法性能评估

小结

猜你喜欢

热点阅读