一文读懂单细胞测序分析流程

2019-04-16 本文已影响365人小光amateur

摘要

一文介绍单细胞测序生物信息分析完整流程，这可能是最新也是最全的流程

基础流程（cellranger）

单细胞流程

cellranger 数据拆分

cellranger mkfastq可用于将单细胞测序获得的 BCL 文件拆分为可以识别的 fastq 测序数据

cellranger makefastq   --run=[ ]   --samplesheet=[sample.csv] --jobmode=local --localcores=20 --localmem=80

-–run ：是下机数据 BCL 所在的路径；
-–samplesheet ：样品信息列表--共三列（lane id ,sample name ,index name)
注意要控制好核心数和内存数

运行产出结果存在于 out 目录中

cellranger 数据统计

cellranger count是 cellranger 最主要也是最重要的功能：完成细胞和基因的定量，也就是产生了我们用来做各种分析的基因表达矩阵。

cellranger count \
-–id=sample345 \
-–transcriptome=/opt/refdata-cellranger-GRCh38-1.2.0/GRCh38 \
-–fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path \
-–indices=SI-3A-A1 \
–-cells=1000

id ：产生的结果都在这个文件中，可以取几号样品（如 sample345）；

fastqs ：由 cellranger mkfastq 产生的 fastqs 文件夹所在的路径；fastqs ：由 cellranger mkfastq 产生的 fastqs 文件夹所在的路径；

indices：sample index：SI-3A-A1；

transcriptome：参考转录组文件路径；

cells：预期回复的细胞数；

下游分析

cellranger count 计算的结果只能作为错略观测的结果，如果需要进一步分析聚类细胞，还需要进行下游分析，这里使用官方推荐 R 包（Seurat 3.0）

流程参考官方外周血分析标准流程（https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html）

软件安装

install.packages('devtools')
devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
library(Seurat)

生成 Seruat 对象

library(dplyr)
library(Seurat)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

这里读取的是单细胞 count 结果中的矩阵目录；
在对象生成的过程中，做了初步的过滤；
留下所有在>=3 个细胞中表达的基因 min.cells = 3；
留下所有检测到>=200 个基因的细胞 min.genes = 200。
(为了除去一些质量差的细胞)(为了除去一些质量差的细胞)

标准预处理流程

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")

这一步 mit-开头的为线粒体基因，这里将其进行标记并统计其分布频率

# Visualize QC metrics as a violin plot
VlnPlot(object = pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

对 pbmc 对象做小提琴图，分别为基因数，细胞数和线粒体占比

image

pbmc <- subset(x = pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

接下来，根据图片中基因数和线粒体数，分别设置过滤参数，这里基因数 200-2500，线粒体百分比为小于 5%

数据标准化

pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- NormalizeData(object = pbmc)

鉴定高度变化基因

pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(x = VariableFeatures(object = pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(object = pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

image

这里都是固定的参数，不需要说明什么，但是我在 repel 参数总是报错，去掉该参数，右侧图表中的基因全部重叠在一起，但是不会报错，莫名其妙

数据归一化

all.genes <- rownames(x = pbmc)
pbmc <- ScaleData(object = pbmc, features = all.genes)

线形降维

pbmc <- RunPCA(object = pbmc, features = VariableFeatures(object = pbmc))

这里有多种方法展示 pca 结果，本文采用最简单的方法

DimPlot(object = pbmc, reduction = "pca")

image

鉴定数据集的可用维度

pbmc <- JackStraw(object = pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(object = pbmc, dims = 1:20)
JackStrawPlot(object = pbmc, dims = 1:15)

image

虚线以上的为可用维度，你也可以调整 dims 参数，画出所有 pca 查看

细胞聚类

pbmc <- FindNeighbors(object = pbmc, dims = 1:10)
pbmc <- FindClusters(object = pbmc, resolution = 0.5)

这里的 dims 为上一步计算所用的维度数，而 resolution 参数控制聚类的数目，这里用默认就好

执行非线性降维

这里注意，这一步聚类有两种聚类方法(umap/tSNE)，两种方法都可以使用，但不要混用，这样，后面的结算结果会将先前的聚类覆盖掉，只能保留一个
本文采用基于图论的聚类方法

pbmc <- RunUMAP(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = "umap")

image

完成聚类后，一定要记住保存数据，不然重新计算可要头疼了

saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

寻找每个聚类中显著表达的基因

cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
head(x = cluster1.markers, n = 5)

这样是寻找单个聚类中的显著基因

cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(x = cluster5.markers, n = 5)

这样寻找所有聚类中显著基因，计算速度很慢，需要等待

另外，我们有多种方法统计基因的显著性

FeaturePlot(object = pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ",
    "PPBP", "CD8A"))

image

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(object = pbmc, features = top10$gene) + NoLegend()

image

剩下的便是寻找基因 marker 并对细胞类型进行注释

全自动细胞类型注释

SingleR:一个全自动细胞注释的 R 包，用法很简单

软件安装

devtools::install_github('dviraran/SingleR')
# this might take long, though mostly because of the installation of Seurat.

如果是 Mac ，需要安装 xquartz 软件，去官方下载

创建 SingleR 对象

官方有多种方法创建该对象，参考SingleR - create object
我们这里由于已经具有了 Seurat 对象，所以可以采用直接转化的方法

library(SingleR)

singler = CreateSinglerObject(counts, annot = NULL, project.name, min.genes = 0,
  technology = "10X", species = "Human", citation = "",
  ref.list = list(), normalize.gene.length = F, variable.genes = "de",
  fine.tune = T, do.signatures = T, clusters = NULL, do.main.types = T,
  reduce.file.size = T, numCores = SingleR.numCores)

singler$seurat = seurat.object # (optional)
singler$meta.data$orig.ident = seurat.object@meta.data$orig.ident # the original identities, if not supplied in 'annot'

## if using Seurat v3.0 and over use:
singler$meta.data$xy = seurat.object@reductions$tsne@cell.embeddings # the tSNE coordinates
singler$meta.data$clusters = seurat.object@active.ident # the Seurat clusters (if 'clusters' not provided)

#对于S4对象，需要手动寻找数据
counts<-seurat.object@assays$RNA@counts
clusters<-seurat.object@meta.data$seurat_clusters

fine.tune 如果设置为 T，会消耗大量时间，这一步是对数据小差异的进一步细化，可以不计算
do.signatures 这个也会消耗大量时间，做单细胞基因集丰度分析，可以先设置为 F

对象载入完成就可以保存好去官方网站进行可视化分析了

singler.new = convertSingleR2Browser(singler)
saveRDS(singler.new,file=paste0(singler.new@project.name,'.rds')

image

伪时间分析

伪时间分析建议采用 monocle3.0 软件

软件安装

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")
devtools::install_github("cole-trapnell-lab/DDRTree", ref="simple-ppt-like")
devtools::install_github("cole-trapnell-lab/L1-graph")
#这一步在Seurat3.0的安装过程中已经安装过的就不必安装了
install.packages("reticulate")
library(reticulate)
py_install('umap-learn', pip = T, pip_ignore_installed = T) # Ensure the latest version of UMAP is installed
py_install("louvain")
devtools::install_github("cole-trapnell-lab/monocle-release", ref="monocle3_alpha")

伪时间分析

library(Seurat)
library(monocle)
Seurat.obj<-readRDS("**.rds")
#如果使用的是seurat2.4版本，可以使用monocle的importCDS命令直接导入，如果是3.0版本，需要进行如下手动导入数据
#这里采用的是官方教程中所需要的三个文件，细胞矩阵，细胞注释表和基因注释表表
data <- as(as.matrix(Seurat.obj@assays$RNA@data), 'sparseMatrix')
pd<-new("AnnotatedDataFrame", data = Seurat.obj@meta.data)
fd<-new("AnnotatedDataFrame", data = data.frame(gene_short_name = row.names(data), row.names = row.names(data)))
cds <- newCellDataSet(data, phenoData = pd, featureData = fd)
#给其中一列数据重命名
names(pData(cds))[names(pData(cds))=="RNA_snn_res.0.5"]="Cluster"
#添加细胞聚类数据
pData(cds)$cell_type2 <- plyr::revalue(as.character(pData(cds)$Cluster),c("0" = 'Fibroblasts',"1" = 'Fibroblasts',"2" = 'Fibroblasts',"3" = 'Fibroblasts',"4" = 'Fibroblasts',"5" = 'NK',"6" = 'Fibroblasts',"7" = 'Macrophage',"8" = 'NK',"9" = 'Macrophage',"10" = 'EC',"11" = 'Fibroblasts',"12" = 'EC'))
cell_type_color <- c("Fibroblasts" = "#E088B8","NK" = "#46C7EF","Macrophage" = "#EFAD1E","EC" = "#8CB3DF")
#伪时间分析流程
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
cds <- preprocessCDS(cds, num_dim = 20)
cds <- reduceDimension(cds, reduction_method = 'UMAP')
cds <- partitionCells(cds)
cds <- learnGraph(cds,  RGE_method = 'SimplePPT')
plot_cell_trajectory(cds,color_by = "cell_type2") + scale_color_manual(values = cell_type_color)
#选择特定细胞进行伪时间分析
get_correct_root_state <- function(cds, cell_phenotype, root_type){
  cell_ids <- which(pData(cds)[, cell_phenotype] == root_type)
  closest_vertex <-cds@auxOrderingData[[cds@rge_method]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-V(cds@minSpanningTree)$name[as.numeric(names(which.max(table(closest_vertex[cell_ids,]))))]
}

MPP_node_ids = get_correct_root_state(cds,cell_phenotype ='cell_type2', 'Fibroblasts')
cds <- orderCells(cds, root_pr_nodes = MPP_node_ids)
plot_cell_trajectory(cds)

伪时间分析

特定细胞分析

本文纯属原创，部分数据采用官方教程，转载需标明出处

一文读懂单细胞测序分析流程

摘要

基础流程（cellranger）

cellranger 数据拆分

cellranger 数据统计

下游分析

软件安装

生成 Seruat 对象

标准预处理流程

数据标准化

鉴定高度变化基因

数据归一化

线形降维

鉴定数据集的可用维度

细胞聚类

执行非线性降维

寻找每个聚类中显著表达的基因

全自动细胞类型注释

软件安装

创建 SingleR 对象

伪时间分析

软件安装

伪时间分析

猜你喜欢

热点阅读