Seurat—Integrating scRNA-seq and

2022-04-21 本文已影响0人重拾生活信心

单细胞转录组学已经改变了我们表征细胞状态的能力，但深入的生物学理解需要的不仅仅是一个clusters的分类列表。随着测量不同细胞modalitie的新方法出现，一个关键的分析挑战是整合这些数据集，以更好地理解细胞的身份和功能。例如，用户可以在同一生物系统上进行scRNA-seq和scATAC-seq实验，并使用相同的细胞类型标签对两个数据集进行注释。
这种联合分析尤其具有挑战性，因为scATAC-seq数据集很难注释，这是因为在单细胞分辨率下收集的基因组数据稀少，并且在scRNA-seq数据中缺乏可解释的基因标记。
在 Stuart, Butler et al, 2019,中，我们介绍了整合从同一生物系统收集的scRNA-seq和scATAC-seq数据集的方法，并在此vignette展示了这些方法。
我们展示了以下分析：

如何使用带注释的scRNA-seq数据集标记scATAC-seq实验中的细胞
如何 co-visualize (co-embed)来自scRNA-seq和scATAC-seq的细胞
如何将scATAC-seq细胞投射到来源于scRNA-seq实验的UMAP上

Signac package:for analysis of chromatin datasets collected at single-cell resolution

我们使用10x Genomics中公开的约12000个人PBMC“multiome”数据集演示了这些方法。在该数据集中，在同一细胞中同时收集了scRNA-seq和scATAC-seq图谱。
出于这个vignettes的目的，我们将数据集视为来自两个不同实验的数据集，并将它们整合在一起。由于它们最初是在同一个细胞中测量的，这提供了一个基本事实，我们可以用来评估整合的准确性。我们强调，我们在这里使用多组数据集是为了演示和评估目的，用户应将这些方法应用于单独收集的scRNA-seq和scATAC-seq数据集。
我们提供了一个单独的 weighted nearest neighbors vignette (WNN)
，它描述了多组单细胞数据的分析策略。

Load in data and process each modality individually

载入数据并分别处理scRNA-seq和scATAC-seq。数据在SeuratData包里，里面包括了WNN vignette 产生的注释数据。
提前下载好需要的包，再载入。

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Signac")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EnsDb.Hsapiens.v86")

library(SeuratData)
# install the dataset and load requirements
InstallData("pbmcMultiome")

library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(ggplot2)
library(cowplot)

# load both modalities
pbmc.rna <- LoadData("pbmcMultiome", "pbmc.rna")
pbmc.atac <- LoadData("pbmcMultiome", "pbmc.atac")

# repeat QC steps performed in the WNN vignette
pbmc.rna <- subset(pbmc.rna, seurat_annotations != "filtered")
pbmc.atac <- subset(pbmc.atac, seurat_annotations != "filtered")

# Perform standard analysis of each modality independently RNA analysis
pbmc.rna <- NormalizeData(pbmc.rna)
pbmc.rna <- FindVariableFeatures(pbmc.rna)
pbmc.rna <- ScaleData(pbmc.rna)
pbmc.rna <- RunPCA(pbmc.rna)
pbmc.rna <- RunUMAP(pbmc.rna, dims = 1:30)

scATAC-seq analysis

# ATAC analysis add gene annotation information
#pbmc.atac@assays$ATAC@key <- "atac_"
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- "UCSC"
genome(annotations) <- "hg38"
Annotation(pbmc.atac) <- annotations

# We exclude the first dimension as this is typically correlated with sequencing depth
pbmc.atac <- RunTFIDF(pbmc.atac)
#使用RunTFIDF函数进行数据归一化
pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = "q0")
# 使用FindTopFeatures函数选择可变的features
pbmc.atac <- RunSVD(pbmc.atac)
# 使用RunSVD函数进行线性降维
pbmc.atac <- RunUMAP(pbmc.atac, 
reduction = "lsi", dims = 2:30, 
reduction.name = "umap.atac", 
reduction.key = "atacUMAP_")

p1 <- DimPlot(pbmc.rna, group.by = "seurat_annotations", label = TRUE) + NoLegend() + ggtitle("RNA")
p2 <- DimPlot(pbmc.atac, group.by = "orig.ident", label = FALSE) + NoLegend() + ggtitle("ATAC")
p1 + p2

image.png

Identifying anchors between scRNA-seq and scATAC-seq datasets

为了确定scRNA-seq和scATAC-seq实验之间的“anchor”，我们首先使用Signac包中的 GeneActivity()函数，通过quantifying ATAC-seq counts in the 2 kb-upstream region and gene body，对每个基因的转录活性进行粗略估计。来自scATAC-seq数据的基因活性分数与来自scRNA-seq的基因表达定量一起用作典型相关分析(CCA)的input。我们对从scRNA-seq数据集中确定为高度可变的所有基因进行量化。

# quantify gene activity
gene.activities <- GeneActivity(pbmc.atac, features = VariableFeatures(pbmc.rna))

# add gene activities as a new assay
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = gene.activities)

# normalize gene activities
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac, features = rownames(pbmc.atac))

# Identify anchors
transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna),
    reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")

Annotate scATAC-seq cells via label transfer

在确定anchor之后，我们可以将注释从scRNA-seq数据集转移到scATAC-seq单元。注释作为输入存储在seurat_annotations中。输出将包含一个矩阵，其中包含每个ATAC celltype.predictions和置信度得分。

celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$seurat_annotations,
    weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)

pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)

Why do you choose different (non-default) values for reduction and weight.reduction?

在FindTransferAnchors()中,我们通常在scRNA seq数据集之间transfer时，将PCA结构从reference投影到query上。
然而，当跨模式（across modalities）transfer时，我们发现CCA更好地捕获shared feature correlation structure，因此在这里设置reduction='CCA'。
此外，默认情况下，在TransferData（）中，我们使用相同的projected PCA结构来计算影响每个细胞的预测的锚的局部邻域的权重。在scRNA-seq传输到scATAC-seq时，我们使用通过计算ATAC-seq数据上的LSI而获得的低维空间来计算这些权重，因为这样可以更好地捕捉ATAC-seq数据的内部结构。

在执行transfer后，ATAC-seq细胞已将预测的注释（来自scRNA-seq）存储 predicted.id field。
由于这些细胞是用multiome试剂盒测量的，我们也有一个可用于评估的实际注释。您可以看到，预测注释和实际注释非常相似。

pbmc.atac$annotation_correct <- pbmc.atac$predicted.id == pbmc.atac$seurat_annotations
p1 <- DimPlot(pbmc.atac, group.by = "predicted.id", label = TRUE) + NoLegend() + ggtitle("Predicted annotation")
p2 <- DimPlot(pbmc.atac, group.by = "seurat_annotations", label = TRUE) + NoLegend() + ggtitle("Ground-truth annotation")
p1 | p2

predicter vs ground-truth

在本例中，scATAC-seq配置文件的注释在90%的时间内通过scRNA-seq整合正确预测。除此之外，prediction.score.max量化了与预测注释相关的不确定性。我们可以看到，正确注释的单元格通常与较高的预测分数（>90%）相关，而错误注释的单元格与较低的预测分数（<50%）相关。错误的分配也往往反映出密切相关的细胞类型（i.e.中间型和幼稚型B细胞）。

predictions <- table(pbmc.atac$seurat_annotations, pbmc.atac$predicted.id)
predictions <- predictions/rowSums(predictions)  # normalize for number of cells in each cell type
predictions <- as.data.frame(predictions)
p1 <- ggplot(predictions, aes(Var1, Var2, fill = Freq)) + geom_tile() + scale_fill_gradient(name = "Fraction of cells",
    low = "#ffffc8", high = "#7d0025") + xlab("Cell type annotation (RNA)") + ylab("Predicted cell type label (ATAC)") +
    theme_cowplot() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

correct <- length(which(pbmc.atac$seurat_annotations == pbmc.atac$predicted.id))
incorrect <- length(which(pbmc.atac$seurat_annotations != pbmc.atac$predicted.id))
data <- FetchData(pbmc.atac, vars = c("prediction.score.max", "annotation_correct"))
p2 <- ggplot(data, aes(prediction.score.max, fill = annotation_correct, colour = annotation_correct)) +
    geom_density(alpha = 0.5) + theme_cowplot() + scale_fill_discrete(name = "Annotation Correct",
    labels = c(paste0("FALSE (n = ", incorrect, ")"), paste0("TRUE (n = ", correct, ")"))) + scale_color_discrete(name = "Annotation Correct",
    labels = c(paste0("FALSE (n = ", incorrect, ")"), paste0("TRUE (n = ", correct, ")"))) + xlab("Prediction Score")
p1 + p2

image.png

Co-embedding scRNA-seq and scATAC-seq datasets

  # note that we restrict the imputation to variable genes from scRNA-seq, but could impute the
# full transcriptome if we wanted to
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]

# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells.  imputation
# (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]],
    dims = 2:30)
pbmc.atac[["RNA"]] <- imputation

coembed <- merge(x = pbmc.rna, y = pbmc.atac)

# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both
# datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
coembed <- RunUMAP(coembed, dims = 1:30)

DimPlot(coembed, group.by = c("orig.ident", "seurat_annotations"))

image.png

Seurat—Integrating scRNA-seq and

Load in data and process each modality individually

Identifying anchors between scRNA-seq and scATAC-seq datasets

Annotate scATAC-seq cells via label transfer

Co-embedding scRNA-seq and scATAC-seq datasets

猜你喜欢

热点阅读