单细胞数据挖掘3-clustering

2020-05-19 本文已影响0人大吉岭猹

1. 写在前面

参考教程：https://hbctraining.github.io/scRNA-seq/

2. Normalization, variance stabilization, and regression of unwanted variation for each sample

使用的是 Seurat 中的方法 sctransform

2.1. Cell cycle scoring

sctransform 会默认抹平测序深度差异，但是细胞周期差异也需要抹除，因此在 sctransform 之前先检查一下细胞周期的情况

# 先进行一个粗略的归一化（除以每个细胞的总count再取自然对数）
seurat_phase <- NormalizeData(filtered_seurat)

## 拿到细胞周期相关基因list
## for HS
# # Load cell cycle markers
# load("data/cycle.rda")

## for MM
# cc_file <-
#   getURL("https://raw.githubusercontent.com/hbc/tinyatlas/master/cell_cycle/Mus_musculus.csv")
# cell_cycle_genes <- read.csv(text = cc_file)
# 可以采用上面的方式直接从网页拿，实际操作的时候服务器的网出了点问题，所以就在PC上直接下好了，很小的文件
cell_cycle_genes = read.csv("data/Mus_musculus.csv")
g2m_genes = cell_cycle_genes[cell_cycle_genes[,1]=="G2/M",2]
s_genes = cell_cycle_genes[cell_cycle_genes[,1]=="S",2]

# ID转换
library(clusterProfiler)
library(org.Mm.eg.db)
keytypes(org.Mm.eg.db)
g2m_symbol = bitr(g2m_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
duplicated(g2m_symbol[,2])
s_symbol = bitr(s_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")

# Score cells for cell cycle
seurat_phase <- CellCycleScoring(seurat_phase,
                                 g2m.features = g2m_symbol[,2],
                                 s.features = s_symbol[,2])

# View cell cycle scores and phases assigned to cells
View(seurat_phase@meta.data)

在对每个细胞进行一个细胞周期评分之后，需要判断细胞周期是否是基因表达量差异的主要因素，因此需要进行 PCA
在 PCA 之前，需先选取 most variable features，然后对数据进行标准化，使得任一基因在所有细胞中的表达量均值为 0，方差为 1
然后对三种细胞周期的细胞分别进行 PCA

# Identify the most variable genes
seurat_phase <- FindVariableFeatures(seurat_phase,
                                     selection.method = "vst",
                                     nfeatures = 2000,
                                     verbose = FALSE)

# Scale the counts
seurat_phase <- ScaleData(seurat_phase)

# Perform PCA
seurat_phase <- RunPCA(seurat_phase)

# Plot the PCA colored by cell cycle phase
DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "Phase",
        split.by = "Phase")

可以看到 G1 期和非 G1 期有差异，说明细胞周期基因是高变成分，后面的 SCTransform 中需要把细胞周期差异去掉

2.2. SCTransform

首先将 Seurat 对象按 sample 分开，分别对每个 sample 的细胞进行 NormalizeData, CellCycleScoring, SCTransform。

options(future.globals.maxSize = 4000 * 1024^2)
# Split seurat object by condition to perform cell cycle scoring and SCT on all samples
split_seurat <- SplitObject(filtered_seurat, split.by = "sample")

split_seurat <- split_seurat[c("sigaf1", "sigag1", "sigah1")]

for (i in 1:length(split_seurat)) {
  split_seurat[[i]] <- NormalizeData(split_seurat[[i]], verbose = TRUE)
  split_seurat[[i]] <- CellCycleScoring(split_seurat[[i]],
                                        g2m.features = g2m_symbol[,2],
                                        s.features = s_symbol[,2])
  split_seurat[[i]] <- SCTransform(split_seurat[[i]],
                                   vars.to.regress = c("mitoRatio","S.Score","G2M.Score"))
}

该流程中注意：若细胞数较大（如远大于15000），则可以将 SCTransform() 函数中的 variable.features.n 参数设置得大一些（默认3000）
可以看到差别还是很大的，需要整合。

3. Integration of the samples using shared highly variable genes (optional, but recommended)

# Select the most variable features to use for integration
integ_features <- SelectIntegrationFeatures(object.list = split_seurat,
                                            nfeatures = 3000)
# Prepare the SCT list object for integration
split_seurat <- PrepSCTIntegration(object.list = split_seurat,
                                   anchor.features = integ_features)
# Find best buddies - can take a while to run
integ_anchors <- FindIntegrationAnchors(object.list = split_seurat,
                                        normalization.method = "SCT",
                                        anchor.features = integ_features)
seurat_integrated <- IntegrateData(anchorset = integ_anchors,
                                   normalization.method = "SCT")
# Save integrated seurat object
save(seurat_integrated, file = "data/seurat_integrated.Rdata")

# 可视化看一下效果
# Run PCA
seurat_integrated <- RunPCA(object = seurat_integrated)

# Plot PCA
PCAPlot(seurat_integrated,
        split.by = "sample")

# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated,
                             dims = 1:40,
                             reduction = "pca")

# Plot UMAP
DimPlot(seurat_integrated)
DimPlot(seurat_integrated,
        split.by = "sample")

4. Clustering cells based on top PCs (metagenes)

4.1. 选择高变 PC

常规两种方法
一是以最高变的 PC 和细胞作为行和列画热图，看从哪个 PC 开始界线变得模糊，这个方法的缺点是如果要看大量 PC 的话很不方便

# Explore heatmap of PCs
DimHeatmap(seurat_integrated,
           dims = 1:9,
           cells = 500,
           balanced = TRUE)

二是画 elbow plot，看拐点出现的位置，这个方法的缺点是没有明确的拐点标准，非常主观化

# Plot the elbow plot
ElbowPlot(object = seurat_integrated,
          ndims = 40)

但是其实挑选 PC 这一步只对于一些老的归一化方法来说是必须的，由于我们使用 SCTransform 进行 normalization，已经排除了技术原因（非生物学原因）产生的 PC，理论上我们选择越多的 PC，分群聚类时就会有越多有意义的差异被考虑在内，只是计算时间会更长，所以我们选择前 40 个 PC

4.2. 分群聚类

# Determine the K-nearest neighbor graph
seurat_integrated <- FindNeighbors(object = seurat_integrated,
                                   dims = 1:40)

# Determine the clusters for various resolutions
seurat_integrated <- FindClusters(object = seurat_integrated,
                                  resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

# Explore resolutions
seurat_integrated@meta.data %>%
  View()

关于 resolustion 的选择，对于细胞数在 3000-5000 之间的数据集，一般可以采用 0.4-1.4 之间的分辨率
然后进行可视化看聚类效果，两种常用的降维方法：t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques

# Assign identity of clusters
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"

# Plot the UMAP
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)

res = 0.8
res = 0.4

5. Exploration of quality control metrics

determine whether clusters are unbalanced wrt UMIs, genes, cell cycle, mitochondrial content, samples, etc.

5.1. 按样本分离聚类

# Extract identity and sample information from seurat object to determine the number of cells per cluster per sample
n_cells <- FetchData(seurat_integrated,
                     vars = c("ident", "orig.ident")) %>%
  dplyr::count(ident, orig.ident) %>%
  tidyr::spread(ident, n)

# View table
View(n_cells)

# UMAP of cells in each cluster by sample
DimPlot(seurat_integrated,
        label = TRUE,
        split.by = "sample")  + NoLegend()

5.2. 按细胞周期分离聚类

# Explore whether clusters segregate by cell cycle phase
DimPlot(seurat_integrated,
        label = TRUE,
        split.by = "Phase")  + NoLegend()

结果不是非常理想

5.3. 其他想剔除的因素

# Determine metrics to plot present in seurat_integrated@meta.data
metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")

FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = metrics,
            pt.size = 0.4,
            sort.cell = TRUE,
            min.cutoff = 'q10',
            label = TRUE)

可以看到分布不那么均匀

5.4. 尝试不做 SCTransform 进行对比

从 filtered_seurat 开始

## NO SCT, NO intergration
# 先进行一个粗略的归一化（除以每个细胞的总count再取自然对数）
seurat_phase <- NormalizeData(filtered_seurat)

cell_cycle_genes = read.csv("data/Mus_musculus.csv")
g2m_genes = cell_cycle_genes[cell_cycle_genes[,1]=="G2/M",2]
s_genes = cell_cycle_genes[cell_cycle_genes[,1]=="S",2]

# ID转换
library(clusterProfiler)
library(org.Mm.eg.db)
keytypes(org.Mm.eg.db)
g2m_symbol = bitr(g2m_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
duplicated(g2m_symbol[,2])
s_symbol = bitr(s_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")

# Score cells for cell cycle
seurat_phase <- CellCycleScoring(seurat_phase,
                                 g2m.features = g2m_symbol[,2],
                                 s.features = s_symbol[,2])

# Identify the most variable genes
seurat_phase <- FindVariableFeatures(seurat_phase,
                                     selection.method = "vst",
                                     nfeatures = 2000,
                                     verbose = FALSE)

# Scale the counts
seurat_phase <- ScaleData(seurat_phase)

# Perform PCA
seurat_phase <- RunPCA(seurat_phase)

# Run UMAP
seurat_phase <- RunUMAP(seurat_phase,
                             dims = 1:40,
                             reduction = "pca")
# UMAP of cells in each cluster by sample
DimPlot(seurat_phase,
        label = TRUE,
        split.by = "sample")  + NoLegend()

DimPlot(seurat_phase,
        label = TRUE,
        split.by = "Phase")  + NoLegend()

# Determine metrics to plot present in seurat_integrated@meta.data
metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")

FeaturePlot(seurat_phase,
            reduction = "umap",
            features = metrics,
            pt.size = 0.4,
            sort.cell = TRUE,
            min.cutoff = 'q10',
            label = TRUE)

可以看到 SCTransform 是有效果的，但效果并不好

5.5. 查看各 PC 在各亚群中的表达情况

# Defining the information in the seurat object of interest
columns <- c(paste0("PC_", 1:16),
             "ident",
             "UMAP_1", "UMAP_2")

# Extracting this data from the seurat object
pc_data <- FetchData(seurat_integrated,
                     vars = columns)

# Adding cluster label to center of cluster on UMAP
umap_label <- FetchData(seurat_integrated,
                        vars = c("ident", "UMAP_1", "UMAP_2"))  %>%
  group_by(ident) %>%
  summarise(x=mean(UMAP_1), y=mean(UMAP_2))

# Plotting a UMAP plot for each of the PCs
map(paste0("PC_", 1:16), function(pc){
  ggplot(pc_data,
         aes(UMAP_1, UMAP_2)) +
    geom_point(aes_string(color=pc),
               alpha = 0.7) +
    scale_color_gradient(guide = FALSE,
                         low = "grey90",
                         high = "blue")  +
    geom_text(data=umap_label,
              aes(label=ident, x, y)) +
    ggtitle(pc)
}) %>%
  plot_grid(plotlist = .)

PC1 在 4 群
PC2 在 5/8 群
PC3 在 6 群
PC4 在除了 5/7 外的群
PC5 在 6/11/13/15/20 群
PC6 在除了 12/14/15 外的群
PC9 在 4/8 群
PC15 在 14 群

6. Searching for expected cell types using known cell type-specific gene markers

如果一种细胞类型有多个 marker，则需要找到表达所有 marker 的亚群
如果一个亚群包含多种已知细胞类型，则需要提高分辨率或提高 PC 数来将不同细胞类型区分开

友情宣传

生信爆款入门-全球听（买一得五）（第5期）（可能是最后一期）你的生物信息学入门课
（必看！）数据挖掘第3期（两天变三周，实力加量），医学生/临床医师首选技能提高课
生信技能树的2019年终总结，你的生物信息学成长宝藏
2020学习主旋律，B站74小时免费教学视频为你领路，还等什么，看啊！！！

单细胞数据挖掘3-clustering

1. 写在前面

2. Normalization, variance stabilization, and regression of unwanted variation for each sample

2.1. Cell cycle scoring

2.2. SCTransform

3. Integration of the samples using shared highly variable genes (optional, but recommended)

4. Clustering cells based on top PCs (metagenes)

4.1. 选择高变 PC

4.2. 分群聚类

5. Exploration of quality control metrics

5.1. 按样本分离聚类

5.2. 按细胞周期分离聚类

5.3. 其他想剔除的因素

5.4. 尝试不做 SCTransform 进行对比

5.5. 查看各 PC 在各亚群中的表达情况

6. Searching for expected cell types using known cell type-specific gene markers

友情宣传

猜你喜欢

热点阅读