Seurat - Guided Clustering Tutor
Seurat - Guided Clustering Tutorial • Seurat (satijalab.org)
1Setup the Seurat Object
Dataset : Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. (There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.)
The Read10X()function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI)count matrix.(row:feature/gene;colum:cell)
Object:(container of data and analysis)
e.g. Count matrix is stored in pbmc[["RNA"]]@counts.
library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
Data in cout matrix
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
3 x 30 sparse Matrix of class "dgCMatrix"
CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
这个矩阵中的"."表示0(未检测到分子)。由于scRNA序列矩阵中的大多数值为0,Seurat尽可能使用稀疏矩阵表示。这为Drop seq/inDrop/10x数据节省了大量内存和速度。
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
# 709591472 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
## 29905192 bytes
dense.size/sparse.size
## 23.7 bytes
2 Standard pre-processing workflow
These represent the selection and filtration of cells based on QC metrics, data,normalization and scaling, and the detection of highly variable features.
2.1 QC and selecting cells for further analysis
指标:
1 The number of unique genes detected in each cell.
- Low-quality cells or empty droplets will often have very few genes
- Cell doublets or multiplets may exhibit an aberrantly high gene count
2 The total number of molecules detected within a cell (correlates strongly with unique genes)
3 The percentage of reads that map to the mitochondrial genome
- Low-quality / dying cells often exhibit extensive mitochondrial contamination
- We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
- We use the set of all genes starting with
MT-
as a set of mitochondrial genes
指标相关数据储存在meta data
Where are QC metrics stored in Seurat?
The number of unique genes and total molecules are automatically calculated during CreateSeuratObject()
- object meta data
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
Visualize QC metrics, and use these to filter cells.
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
image.png
image.png
- Filter cells that have unique feature counts over 2,500 or less than 200 ,and have >5% mitochondrial counts
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
2.2 Normalizing the data
默认用 global-scaling normalization method “LogNormalize”,对所有expression进行标准化。
Normalized values are stored in pbmc[["RNA"]]@data.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#="pbmc <- NormalizeData(pbmc)"
2.3 Identification of highly variable features (feature selection)
找出数据集中表现出高细胞间变异的特征基因(HVG)集(在某些细胞中高表达,而在其他细胞中低表达)。 在下游分析中关注这些基因有助于突出单细胞数据集中的生物信号。
detail详细描述了在Seurat中的步骤。通过直接建模单细胞数据中固有的均值-方差关系,并在FindVariableFeatures()函数中实现。默认情况下,每个数据集返回2000个特征,可用于下游分析,如PCA。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot
image.png
(这里+运行报错 改为|)
2.4 Scaling the data
linear transformation (‘scaling’) 是降维分析前的标准预处理流程。(均值为0,方差为1)(使不同细胞的同一基因的表达水平具有可比性)
TheScaleData() function:
- Shifts the expression of each gene, so that the mean expression across cells is 0
- Scales the expression of each gene, so that the variance across cells is 1
- This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
- The results of this are stored in
pbmc[["RNA"]]@scale.data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
#vs
#pbmc <- ScaleData(pbmc)
#默认对于识别的2000 variable features scaling
#PCA不受影响,但热图里的基因需要被scaled)
- remove unwanted sources of variation
(e.g. cell cycle stage or mitochondrial contamination:As with ScaleData(), the function SCTransform() also includes avars.to.regress
parameter.
3 Perform linear dimensional reduction
对scaled data 进行PCA,默认input是variable features,用参数features进行调整。
- Visualization:VizDimReduction(), DimPlot(), and DimHeatmap()
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL
Negative: MALAT1, LTB, IL32, IL7R, CD2
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
Negative: NKG7, PRF1, CST7, GZMB, GZMA
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
Negative: PPBP, PF4, SDPR, SPARC, GNG11
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
Negative: VIM, IL7R, S100A6, IL32, S100A8
PC_ 5
Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
Negative: LTB, IL7R, CKB, VIM, MS4A7
image.png
image.png
热图:DimHeatmap()允许轻松探索数据集中异质性的主要来源,并且在尝试决定要用哪些PC以进行进一步的下游分析时非常有用。细胞和features都是根据其PCA score排序的。将cells设置为一个数字,展现极端细胞,这将大大加快绘制大型数据集的速度。虽然这显然是一个有监督的分析,但我们发现这是探索相关特征集的一个有价值的工具。
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
image.png
image.png
3.1 Determine the ‘dimensionality’ of the dataset
为了克服scRNA-seq数据的任一feature中的广泛技术噪声,Seurat根据它们的PCA分数对细胞进行聚类,每个PC基本上代表一个“metafeature”(correlated feature set中的信息组合)。因此,top主成分代表数据集的 robust compression 。
然而,我们应该选择包含多少PC?10? 20? 100?
In Macosko et al,,我们实施了一项受JackStraw程序启发的重采样测试。我们随机排列数据子集(默认情况下为1%),重新运行PCA,构建特征分数的“零分布”,并重复此过程。我们认为“重要”PC是指那些具有富集的的低p值的PC。
JackStrawPlot()函数提供了一个可视化工具,用于比较每个PC的p值分布和均匀分布(虚线)。'“significant”PC将显示出strong enrichment of features with low p-values(虚线上方的实线)。在这种情况下,在前10-12个PC之后,其重要性似乎急剧下降。
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
image.png
另一种启发式方法生成“肘部图”:根据the percentage of variance explained by each one 对主成分进行排序(ElbowPlot())。在这个例子中,我们可以观察到PC9-10周围有一个“肘部”,这表明大多数真实信号是在前10个PC中捕获的。
ElbowPlot(pbmc)
image.png
4 Cluster the cells
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al)
驱动聚类分析的距离度量(基于之前确定的PC)保持不变。然而,我们将细胞距离矩阵划分为簇的方法有了显著的改进。
简单地说,这些方法将单元嵌入到一个图结构中——例如一个K-nearest neighbor (KNN) graph,在具有相似特征表达模式的单元之间绘制边,然后尝试将该图划分为高度互联的“准集团”或“社区”。
与PhenoGraph一样,我们首先基于PCA空间中的欧几里德距离构造KNN图,并基于其局部邻域中的共享重叠(Jaccard相似性)来细化任意两个单元之间的边缘权重。此步骤使用FindNeighbors()函数执行,并将之前定义的数据集维度(前10个)作为输入。
为了对x细胞进行聚类,我们接下来应用模块化优化技术,如Louvain算法(默认)或SLM[SLM,Blondel等人,统计力学杂志],以迭代方式将细胞分组在一起,目的是优化标准模块化函数。FindClusters()函数实现了这个过程,并包含一个resolution参数,用于设置下游集群的“粒度”,增加值会导致更多集群。我们发现,将该参数设置在0.4-1.2之间,对于约3K个单元的单单元数据集,通常会返回良好的结果。对于较大的数据集,最佳分辨率通常会增加。可以使用Idents()函数找到cluster。
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
2 3 2 1 6
Levels: 0 1 2 3 4 5 6 7 8
Run non-linear dimensional reduction (UMAP/tSNE)
suggest using the same PCs as input to the clustering analysis.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")
image.png
Finding differentially expressed features (cluster biomarkers)
Seurat可以帮助你找到通过差异表达定义clusters的markers。默认情况下,与所有其他单元格相比,它可以识别单个簇(在ident.1中指定)的正标记和负标记。
FindAllMarkers()自动对所有集群执行此过程,但您也可以对集群组进行相互测试,或针对所有单元进行测试。
- min.pct参数要求在两组细胞中的任一个中以最小百分比检测到一个特征。
- thresh.test参数要求一个feature在两组之间以一定的成都进行差异表达。您可以将这两个值都设置为0,但时间的会急剧增加——因为这将测试大量不太具有高度筛选的功能。
- max.cells.per.ident:作为加速这些计算的另一个选择,可以设置max.cells.per.ident参数这将对每个identity class 进行降采样,使其细胞数数不超过设置的值。虽然通常会出现a loss in power,但速度的提高可能会非常显著,而且表达差异最大的功能可能仍会上升到top。
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
p_val avg_log2FC pct.1 pct.2 p_val_adj
IL32 2.892340e-90 1.2013522 0.947 0.465 3.966555e-86
LTB 1.060121e-86 1.2695776 0.981 0.643 1.453850e-82
CD3D 8.794641e-71 0.9389621 0.922 0.432 1.206097e-66
IL7R 3.516098e-68 1.1873213 0.750 0.326 4.821977e-64
LDHB 1.642480e-67 0.8969774 0.954 0.614 2.252497e-63
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
p_val avg_log2FC pct.1 pct.2 p_val_adj
FCGR3A 8.246578e-205 4.261495 0.975 0.040 1.130936e-200
IFITM3 1.677613e-195 3.879339 0.975 0.049 2.300678e-191
CFD 2.401156e-193 3.405492 0.938 0.038 3.292945e-189
CD68 2.900384e-191 3.020484 0.926 0.035 3.977587e-187
RP11-290F20.3 2.513244e-186 2.720057 0.840 0.017 3.446663e-182
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
image.png
- test.use: DE vignette for details
e.g.:cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
he ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
Visualizing marker expression
选取感兴趣的基因(数据库、文献),查看他们在不同cluster中的表达情况
VlnPlot()(shows expression probability distributions across clusters), and FeaturePlot()(visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring RidgePlot(), CellScatter(), and DotPlot()as additional methods to view your dataset.
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
![image.png](https://img.haomeiwen.com/i27913461/7351e44ab5734d13.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
image.png
···
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
image.pngAssigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")
image.png