单细胞分析踩坑实录：DotPlot()之'RNA' assay还

2022-01-16 本文已影响0人 Bio_Infor

前段时间在生信技能树单细胞学习小分队分享单细胞文章复现，一不小心就踩了一个坑……记录在这儿和大家一起分享。

背景

主要是想复现2020年发表在 PNAS 上的一篇题为 Single-nucleus transcriptome analysis reveals dysregulation of angiogenic endothelial cells and neuroprotective glia in Alzheimer’s disease 的阿尔兹海默症单细胞研究文章。

这篇文章提供的数据包括12个阿尔兹海默症患者和9个正常人的脑组织单细胞数据，已经全部上传到GEO数据库中，大家可以自行“食用”。
因为原始数据有接近 170,000 个细胞，数据量太大，所以我每个样本随机抽取了1000个细胞进行了我的分析（总共 21,000 个细胞）。

从创建SeuratObject到降维聚类

这部分不赘述了，直接贴我的代码：

library(Seurat)
library(clustree)
library(dplyr)
library(tidyverse)

dir.create(path = 'Figure')
sample = list.files("../sample")
sample
path = "../sample"

sce_list <- lapply(sample, function(sub_sample){
  folder = file.path(path, sub_sample)
  print(Sys.time())
  print(folder)
  sce <- CreateSeuratObject(counts = Read10X(folder), project = sub_sample)
  return(sce)
})

names(sce_list) = sample

#save as sce_list.rds
saveRDS(sce_list, file = 'sce_list.rds')
#sce_list <- readRDS("~/Single_cell/SCS_learning/biotrain_01/biotrain/sce_list.rds")


#randomly sample some cells
set.seed(1234)
sce = lapply(sce_list, function(x){
  sub = x[ ,sample(ncol(x), 1000)]
  return(sub)
})


#calculate the percentage of mt genes
for (i in 1:length(sample)){
  sce[[i]][['percent.mt']] <- PercentageFeatureSet(object = sce[[i]], pattern = "^MT-")
}

plotfeatures <- c('nFeature_RNA', 'nCount_RNA', 'percent.mt')
group = "orig.ident"


#quality control (save as sce_qc)
sce_qc <- lapply(sce, function(sce_sample){
  sub = subset(sce_sample, subset = nFeature_RNA>200 & nCount_RNA<20000 & percent.mt<20)
  return(sub)
})

#check final results after quality control
num_cells_after = 0
for (i in 1:length(sample)){
  num_cells_after = num_cells_after + ncol(sce_qc[[i]])
}
num_cells_after
#18950 cells

#integrate sample (batch effect) with CCA+MNN 
sce_obj <- lapply(sce_qc, function(x){
  x = NormalizeData(x)
  x = FindVariableFeatures(x, selection.method = "vst", nfeatures = 1000)
})
features <- SelectIntegrationFeatures(object.list = sce_obj)
anchors <- FindIntegrationAnchors(object.list = sce_obj, anchor.features = features, dims = 1:20)
sce_combine <- IntegrateData(anchorset = anchors, dims = 1:20)


#Run the standard workflow for visualization and clustering
DefaultAssay(sce_combine) <- 'integrated'

sce_combine <- sce_combine %>% 
  ScaleData() %>% 
  RunPCA(npcs = 50)

#evaluate pcs to use
sce_combine <- JackStraw(sce_combine, reduction = 'pca', dims = 30) %>% 
  ScoreJackStraw(dims = 1:30)
JackStrawPlot(sce_combine, dims = 1:30, reduction = 'pca')

sce_combine <- FindNeighbors(sce_combine, dims = 1:20)
for (res in seq(0.1, 1.2, by = 0.1)){
  sce_combine <- FindClusters(sce_combine, resolution = res)
}
sce_combine <- RunUMAP(sce_combine, dims = 1:20)

DimPlot(sce_combine, reduction = 'umap', group.by = "integrated_snn_res.0.6")

clustree(sce_combine@meta.data, prefix = 'integrated_snn_res.')

#resolution=0.6
sel_res = 'integrated_snn_res.0.6'
sce_combine_res0.6 = SetIdent(sce_combine, value = sel_res)

这里的质控标准等参数都和原文保持一致。

细胞类型注释

根据原文的细胞基因marker，我想利用 气泡图 来可视化不同的 cluster 中不同基因marker的表达情况，这是我第一次的代码：

#cell type annotation
genes_to_check = c('AQP4', 'ADGRV1', 'GPC5', 'RYR3', #astrocytes
                   'CLDN5', 'ABCB1', 'EBF1', #endothelial 
                   'CAMK2A', 'CBLN2', 'LDB2', #excitatory
                   'GAD1', 'LHFPL3', 'PCDH15', #inhibitory
                   'C3', 'LRMDA', 'DOCK8', #microglia
                   'MBP', 'PLP1', 'ST18' #oligodendrocytes
                   )
DotPlot(sce_combine_res0.6, 
        features = genes_to_check, 
        assay = 'integrated', 
        cols = c('lightgrey', 'red')) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 0.5))

出图：

这个图……就有种一言难尽的感觉，给人第一感觉就是很脏，有的cluster你很难给出一个确切的答案它到底高表达哪种细胞的marker。
好了，问题出在哪儿呢？注意我在 DotPlot() 中用的 assay 是 'integrated' ，使用的是整合后的数据来绘制的气泡图，经过Jimmy老师指点，再用没有整合的数据来画这个图是什么效果呢？

DotPlot(sce_combine_res0.6, 
        features = genes_to_check, 
        assay = 'RNA', #注意，差别在这儿
        cols = c('lightgrey', 'red')) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 0.5))

嗯？这不是干净多了？哪些细胞高表达哪些基因marker一目了然！那问题来了，该信哪个，为什么会有不同？

其实对于这个问题，Seurat 已经给出了答案，链接为：https://github.com/satijalab/seurat/issues/1717，总结起来就是：

当我们在利用marker基因对细胞类型进行探索性注释的时候，用 'RNA' assay，也就是没有经过整合的数据。

当我们在进行除细胞类型鉴定以外的其它操作，诸如聚类和聚类结果细胞的可视化等，就使用'integrated' assay。

感觉就是，和基因有关的操作都建议在 'RNA' assay 上完成（可能有点激进~~），如果你想具体了解一下怎么做，可以看看这个链接：https://satijalab.org/seurat/archive/v3.0/immune_alignment.html，毕竟 satijalab 将其描述为：Our best example！！！。

最后，为了让种种原因不能上GitHub的小伙伴了也看到 satijalab 对于这个问题的解答，这里也贴上截图（可下载原图查看哦）~

satijalab..jpg

当然最后也贴上我聚类注释出来的结果和原文的比较啦：

我做的

文章做的

感觉效果还可以哈？感谢 Jimmy 老师和大家的指点咯~~~

快过年了，就祝大家新年快乐吧~