Seurat-Tutorial part3学习
Part2 中我们对进行归一化的数据进行了降维处理,接下来就是对数据进行以下的操作了
确定数据集的维数
我们要确定有多少个Principal Components(PCs),鉴定significant PCs方法是----那些富集了低 p-value的Features
这里使用了2个函数来帮助我们挑选合适的PCs数
- 1.
JackStrawPlot()
We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.
我们随机置换数据的一部分(默认为1%),然后重新运行PCA,以构建feature Score的“零分布”,然后重复此过程。 我们将“重要”的PC识别为具有丰富的低p-value特征基因的PC。特征重要性是给数据中每个特征打一个分数,分数越高说明特征越重要。通过使用模型的要素重要性属性,可以获取数据集每个要素的要素重要性。
# NOTE: This process can take a long time for big datasets, comment out for expediency.
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce computation time
#推测这个方法计算耗时,相对来说要更准确些,但是可以使用ElbowPlot做近似计算
pbmc <- JackStraw(pbmc, num.replicate = 100)
#|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05m 39s
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)#
https://satijalab.org/seurat/reference/jackstraw
JackStraw(
object,#Seurat object
reduction = "pca",
assay = NULL,
dims = 20,#Number of PCs to compute significance for
num.replicate = 100,#Number of replicate samplings to perform
prop.freq = 0.01,
verbose = TRUE,
maxit = 1000
)
对上面一步得到的pbmc可视化和结果分析
JackStrawPlot(pbmc, dims = 1:15)
JackStrawPlot-dim选择15
JackStrawPlot()
函数提供了一个可视化工具,用于比较挑选了的PC和均匀分布的PC之间的差异。dim 15的时候,发现是PC在前10~12之后,重要性显著下降。 对于dim 20的情况下,增加到15之后,基本上是低于这个均匀分别虚线值的(原因不明白)
虚线:PC的p-values是均匀分布的,没有进行挑选
JackStrawPlot(pbmc, dims = 1:20)#这里的dims参数必须是小于JackStraw函数计算的PCs数,默认是20,最多是20,可以自行设置
JackStrawPlot-dim选择20
- 2.
ElbowPlot
使用这个函数可以替代上面一个函数的功能,得到一个大概的主成分数量值
ElbowPlot(pbmc)#pbmc数据是来自之前RunPCA函数运行后的数据
#a ranking of principle components based on the percentage of variance explained by each one
#we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs
ElbowPlot
鉴定数据集的真实维度是很不确定的,Seurat的作者是建议考虑3种方法。第一个受到更多监督,探索PC以确定异质性的相关来源,例如可以与GSEA结合使用。 第二种实现基于随机空模型的统计测试,但是对于大型数据集而言非常耗时,并且可能无法返回清晰的PC截止值。 第三种是常用的启发式方法,可以立即进行计算。
这里作者是选择了10个 ,但是建议考虑下面几个因素
- 树突状细胞和NK爱好者或许能认知到一点:与这两个维度(PCs12和13)的主成分密切相关的基因 或许能定义罕见的免疫亚群(如MZB1这个基因就是浆细胞样DC的Marker),但是这些都特别少于,在没有先验知识的情况下,很难从背景噪音中区分
- 建议使用不同数量的PC(10~50)重复进行下游分析,结果通常没有太大的区别
- 建议使用更多的dim进行分析。对本例子,如果只选择10PCs进行下游分析,则会对结果更不利
Cluster the cells
- we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the
FindNeighbors()
function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
pbmc <- FindNeighbors(pbmc, dims = 1:10) #这个数值是之前所选择的PCs数量
##Computing nearest neighbor graph
##Computing SNN
- To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM http://dx.doi.org/10.1088/1742-5468/2008/10/P10008
to iteratively group cells together, with the goal of optimizing the standard modularity function
pbmc <- FindClusters(pbmc, resolution = 0.5)#resolution值越大,会得到更多的Clusters;对于数目是3K个细胞的数据集来说设置为0.4~1.2之间,均能返回良好结果
##Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##Number of nodes: 2638
##Number of edges: 95965
##Running Louvain algorithm...
##0% 10 20 30 40 50 60 70 80 90 100%
##[----|----|----|----|----|----|----|----|----|----|
#**************************************************|
##Maximum modularity in 10 random starts: 0.8723
##Number of communities: 9
##Elapsed time: 0 seconds
The clusters can be found using the Idents() function.
head(Idents(pbmc), 5)
AAACATACAACCAC-1
2
AAACATTGAGCTAC-1
3
AAACATTGATCAGC-1
2
AAACCGTGCTTCCG-1
1
AAACCGTGTATGCG-1
6
Levels: 0 1 2 3 4 5 6 7 8
非线性降维 (UMAP/tSNE)
The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
把上一步基于图的聚类中的细胞共位于这些降维图中。作为UMAP 和 tSNE的输入数据,是建议使用 Cluster the cells 那一步得到的PCs来分析聚类;图上每个点代表一个细胞。
pbmc <- RunUMAP(pbmc, dims = 1:10, label = "T")
21:34:50 UMAP embedding parameters a = 0.9922 b = 1.112
21:34:50 Read 2638 rows and found 10 numeric columns
21:34:50 Using Annoy for neighbor search, n_neighbors = 30
21:34:50 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:34:50 Writing NN index file to temp file C:\Users\ADMINI~1\AppData\Local\Temp\RtmpKYasTJ\file260382e791d
21:34:50 Searching Annoy index using 1 thread, search_k = 3000
21:34:52 Annoy recall = 100%
21:34:52 Commencing smooth kNN distance calibration using 1 thread
21:34:52 Initializing from normalized Laplacian + noise
21:34:53 Commencing optimization for 500 epochs, with 105124 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:35:03 Optimization finished
UMAP
https://satijalab.org/seurat/reference/runumap
RunUMAP
函数使用
RunUMAP(object, ...)
# S3 method for default
RunUMAP(
object,
reduction.key = "UMAP_",
assay = NULL,
reduction.model = NULL,
return.model = FALSE,
umap.method = "uwot",
n.neighbors = 30L,
n.components = 2L,
metric = "cosine",
n.epochs = NULL,
learning.rate = 1,
min.dist = 0.3,
spread = 1,
set.op.mix.ratio = 1,
local.connectivity = 1L,
repulsion.strength = 1,
negative.sample.rate = 5,
a = NULL,
b = NULL,
uwot.sgd = FALSE,
seed.use = 42,###这个可以默认下,防止运行后图片有点差异,但不影响整体结果展示
metric.kwds = NULL,
angular.rp.forest = FALSE,
verbose = TRUE,
...
)
另外一种方法非线性降维
pbmc <- RunTSNE(pbmc, dims = 1:10,seed.use = 42, label = "T")
DimPlot(pbmc, reduction = "tsne")
RunTSNE
视觉上给人的感觉是UMAP更加紧凑点,细胞的点更聚集一点
Finding differentially expressed features (cluster biomarkers)
怎么给上面的图-聚类的clustering进行标注,也就是怎么找到对应的Bio-Marker
Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
使用函数FindMarkers()
https://satijalab.org/seurat/reference/findmarkers
Finds markers (differentially expressed genes) for identity classes
FindMarkers(object, ...)
# S3 method for default
FindMarkers(
object,
slot = "data",
counts = numeric(),
cells.1 = NULL,
cells.2 = NULL,
features = NULL,
logfc.threshold = 0.25,
test.use = "wilcox",
min.pct = 0.1,
min.diff.pct = -Inf,
verbose = TRUE,
only.pos = FALSE,
max.cells.per.ident = Inf,
random.seed = 1,
latent.vars = NULL,
min.cells.feature = 3,
min.cells.group = 3,
pseudocount.use = 1,
fc.results = NULL,
...
)
counts Count matrix if using scale.data for DE tests. This is used for computing pct.1 and pct.2 and for filtering features based on fraction expressing
logfc.threshold Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Default is 0.25 Increasing logfc.threshold speeds up the function, but can miss weaker signals.
ident.1 Identity class to define markers for; pass an object of class phylo or 'clustertree' to find markers for a node in a cluster tree; passing 'clustertree' requires BuildClusterTree to have been run
min.pct only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. Meant to speed up the function by not testing genes that are very infrequently expressed. Default is 0.1
only.pos Only return positive markers (FALSE by default)
对于本例子,使用这个来找Marker
- 比如可以找下cluster2 的marker
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
##IL32 2.593535e-91 1.2154360 0.949 0.466 3.556774e-87
##LTB 7.994465e-87 1.2828597 0.981 0.644 1.096361e-82
##CD3D 3.922451e-70 0.9359210 0.922 0.433 5.379250e-66
##IL7R 1.130870e-66 1.1776027 0.748 0.327 1.550876e-62
##LDHB 4.082189e-65 0.8837324 0.953 0.614 5.598314e-61
- 找到cluster5和cluster0~3之间的差异基因
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=27s
head(cluster5.markers, n = 5)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
##FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
##IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
##CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
##CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
##RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
- 找所有的Clusters 的 Marker
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)# Only return positive markers
##Calculating cluster 0
##For a more efficient implementation of the Wilcoxon Rank Sum Test,
##(default method for FindMarkers) please install the limma package
--------------------------------------------
install.packages('BiocManager')
BiocManager::install('limma')
--------------------------------------------
##After installation of limma, Seurat will automatically use the more efficient implementation (no further action necessary).
##This message will be shown once per session
##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=10s
##Calculating cluster 1
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=21s
##Calculating cluster 2
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s
##Calculating cluster 3
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=09s
##Calculating cluster 4
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s
##Calculating cluster 5
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=30s
##Calculating cluster 6
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=24s
##Calculating cluster 7
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=32s
##Calculating cluster 8
##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=19s
对得出的pbmc.markers数据进行查看
pbmc.markerstable(pbmc.markers$cluster)
# 0 1 2 3 4 5 6 7 8
#163 391 171 147 171 608 363 633 242
%>%
数据传递
top_n(x, n, wt)
#x A data frame
#n Number of rows to return for top_n(), fraction of rows to return for top_frac().
#If n is positive, selects the top rows. If negative, selects the bottom rows.
#If x is grouped, this is the number (or fraction) of rows per group. Will include more rows if there are ties.
#wt (Optional). The variable to use for ordering. If not specified, defaults to the last variable in the tbl.
首先pbmc.markers按照avg_log2FC对Features进行排名,并提取前2行
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
得到的结果是这个
挑选出来的top2基因,每个cluster有2行,总共18行
对某些基因进行查看
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
image.png
# you can plot raw counts as well
VlnPlot(pbmc, features = c("MS4A1", "CD79A"), slot = "counts", log = TRUE)
对其counts值画图
对于这个图,"MS4A1"和 "CD79A"这两个都聚在cluster3中,都能代表cluster3
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
随意取出的基因Features
对于NKG7这个基因主要存在cluster4,6,中;对于PF4这个feature是聚在cluster9中的,从上述的表格中18行基因第17个PF4中可以看出
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
对这9个基因进行cluster映射
把每个cluster的top基因进行扩大到10个,而非上述的2个
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)#和上述的比较类似,n=10
dim(top10)
#[1] 90 7
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
top10-cluster0-CCR7
top10-cluster1-CD14,LYZ
top10-cluster2-IL7R
top10-cluster3-MS4A1
top10-cluster4-CD8A
top10-cluster5-FCGR3A
top10-cluster6-GNLY
top10-cluster7-FCER1A
top10-cluster8-PPBP
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | CD14, LYZ | CD14+ Mono |
2 | IL7R, S100A4 | Memory CD4+ |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
特别幸运的是,上述9个cluster的top10 gene都能找到对应的Cell Type
疑问是有同一个gene对应2个cluster,但是仅仅代表一个 Cell type该怎么选择呢
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
对top10挑选出的90个基因进行热图绘制
最后一步,将对应的Cluster和Cell Type对应在图上
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
没有对Cluster ID和细胞类型对映的图
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
Cluster ID和细胞类型对映的图
https://www.cell.com/fulltext/S0092-8674(15)00549-8
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
https://blog.csdn.net/weixin_39549312/article/details/111391539