scATAC-seq数据分析练习(成年小鼠大脑组织)
教程官网以及练习数据下载:https://satijalab.org/signac/articles/
在这个练习里使用了成年小鼠大脑细胞的scATAC-seq数据来作为例子,这个scATAC-seq是由10x Genomics平台测得的。练习数据是count矩阵。这篇笔记就是按照官网的教程走一遍代码,熟悉单细胞ATAC-seq的分析流程。
> library(Signac)
> library(Seurat)
> library(GenomeInfoDb)
> library(EnsDb.Mmusculus.v79)
> library(ggplot2)
> library(patchwork)
> set.seed(1234)
Pre-processing workflow数据预处理
> counts <- Read10X_h5("./mouse_brain/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5")
> metadata <- read.csv(
file = "./mouse_brain/atac_v1_adult_brain_fresh_5k_singlecell.csv",
header = TRUE,
row.names = 1
)
#从count矩阵构建一个ChromatinAssay对象,
#输入的矩阵应该是以features x 细胞的格式
> brain_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
genome = "mm10",
fragments = './mouse_brain/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz',
min.cells = 1
)
Computing hash
Checking for 5337 cell barcodes
上面的CreateChromatinAssay()
函数是创建ChromatinAssay对象。参数分别是:
counts:count是非标准化的,原始的count值
sep:用于分隔基因组坐标的字符,这里是:和-。
genome:说明你用的是哪一个物种的基因组。
fragments:输入矩阵中包含的数据的tabix索引片段文件的路径。
min.cells:至少在一个细胞里检测到features,这一步实际上过滤细胞。把什么reads都没有的细胞除去。
这个函数还有一些其他的参数,具体请见:Create ChromatinAssay object
构建Seurat对象:
> brain <- CreateSeuratObject(
counts = brain_assay, #可以是一个矩阵,也可以是assay-derived的对象
assay = 'peaks', #Name of the initial assay
project = 'ATAC', #Project name for the Seurat object
meta.data = metadata
)
#会弹出:(不要紧,官网上的教程里也出现了这一行warning信息)
Warning message:
In CreateSeuratObject.Assay(counts = brain_assay, assay = "peaks", :
Some cells in meta.data not present in provided counts matrix.
这里我们可以给这个brain对象添加基因注释信息,这会方便之后从这个seurat对象里直接拉取基因注释信息:
#从EnsDb对象中提取所有染色体的转录信息
> annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
#会弹出:
Fetching data...OK
Parsing exons...OK
Defining introns...OK
Defining UTRs...OK
Defining CDS...OK
aggregating...
Done
......(此处省略N行)
There were 21 warnings (use warnings() to see them)
修改UCSC style:
> seqlevelsStyle(annotations) <- 'UCSC'
> genome(annotations) <- "mm10"
给seurat对象加基因信息:
> Annotation(brain) <- annotations
Computing QC Metrics计算质量控制指标
首先计算每一个细胞里的核小体信号强度。使用的是NucleosomeSignal()
函数,计算147bp-294bp(单个核小体)之间的片段与小于147bp片段(非核小体)的比例。
计算后会返回一个Seurat对象,其中添加了关于每个细胞单核小体片段与无核小体片段比例的metadata,以及每个比例的百分数rank。
> brain <- NucleosomeSignal(object = brain)
Found 5337 cell barcodes
Done Processing 53 million lines
我们可以查看所有细胞的片段长度周期性,并且把核小体高信号强度和低信号强度的细胞分群。这样可以看单核小体/非核小体的比例,从而检查离群细胞。剩下的细胞就是可以用来进行后续分析的细胞。
#在brain对象里加一个metadata,名为nucleosome_group,如果nucleosome_signal里的数值大于4,则在nucleosome_group里标记NS>4,反之则标记NS<4.
> brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
#绘制不同长度的片段在不同细胞群中出现的频率
#这里我们绘制上面区分的NS>4和NS<4的细胞群里不同片段在chr1的内出现的频率。
> FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')
Tn5整合事件在转录起始位点是富集的,所以也可以用来作为一个重要的质量控制参数来评价Tn5在ATAC-seq实验中的靶向效率。ENCODE定义了TSS富集分数(TSS周围的Tn5整合位点数量标准化为侧翼区域的Tn5整合位点数量)。
转录起始点(TSS)富集分数: TSS富集计算是一个信号到噪声的计算。落在一系列TSSs周围的reads被收集起来,形成以TSSs为中心的reads的聚合分布,并在两个方向上扩展到1000 bp(总共为2000bp)。然后,通过取在每个末端100 bps中的平均reads深度(总共200bp的平均数据),计算在每个位置的read深度的倍数变化,对该分布进行标准化。这意味着侧翼(flanks)应该从1开始,如果在转录起始位点(基因组的高度开放区域)有高的reads信号,信号应该增加形成中间的peak。我们将分布的中心的信号值进行标准化,作为TSS的富集参数。用于评估ATAC-seq。
具体内容可见: (https://www.encodeproject.org/data-standards/terms/). 我们可以使用Signac包里的TSSEnrichment()
功能来计算每一个细胞TSS富集分数:
> brain <- TSSEnrichment(brain, fast = FALSE)
Extracting TSS positions
Finding + strand cut sites
Finding - strand cut sites
|--------------------------------------------------|
|==================================================|
Computing mean insertion frequency in flanking regions
Normalizing TSS score
#将上面计算好的数值加到brain对象中,并区分>2的是高,反之则是低
> brain$high.tss <- ifelse(brain$TSS.enrichment > 2, 'High', 'Low')
#绘制每个位置相对于TSS的标准化TSS富集评分,这里分别绘制high和low两个组
> TSSPlot(brain, group.by = 'high.tss') + NoLegend()
然后我们要把所有细胞里落在peaks上的reads的比例算出来,并把落在“黑名单”区域上的Reads比例算出来:
# add blacklist ratio and fraction of reads in peaks
> brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100
> brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments
> VlnPlot(
object = brain,
features = c('pct_reads_in_peaks', 'peak_region_fragments',
'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
pt.size = 0.1,
ncol = 5
)
fraction of reads in peaks:
reads在峰值的部分(FRiP)——所有比对到peak区域的reads的部分,即在显著富集的peak上的可用的reads除以所有可用的reads。一般来说,FRiP分数与区域的数量呈正相关。(Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)
然后把离群值去掉:
> brain <- subset(
x = brain,
subset = peak_region_fragments > 3000 &
peak_region_fragments < 100000 &
pct_reads_in_peaks > 40 &
blacklist_ratio < 0.025 &
nucleosome_signal < 4 &
TSS.enrichment > 2
)
> brain
An object of class Seurat
156815 features across 3512 samples within 1 assay
Active assay: peaks (156815 features, 0 variable features)
Normalization and linear dimensional reduction标准化和线性降维
> brain <- RunTFIDF(brain)
Performing TF-IDF normalization
> brain <- FindTopFeatures(brain, min.cutoff = 'q0')
# A dimensional reduction object with key LSI_
> brain <- RunSVD(object = brain)
Running SVD
Scaling cell embeddings
第一个LSI成分通常捕获的是测序深度(技术variation)而不是生物学variation。如果是这种情况,则应该从下游分析中删除该成分。我们可以使用DepthCor()函数来评估每个LSI成分与序列深度的相关性:
#计算总counts数与每一个降维成分之间的相关性
> DepthCor(brain)
在上图里,我们可以看到第一个LSI成分与细胞总count数有强烈的相关性,所以我们在后续的分析中要去掉这个成分。
Non-linear dimension reduction and clustering非线性降维和聚类
由于细胞被嵌入到低维空间中,我们就可以使用通常用于分析scRNA-seq数据的方法来执行graph-based聚类,并使用非线性降维来进行可视化。函数RunUMAP()、FindNeighbors()和FindClusters()都来自Seurat包:
> brain <- RunUMAP(
object = brain,
reduction = 'lsi',
dims = 2:30
)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
21:47:12 UMAP embedding parameters a = 0.9922 b = 1.112
21:47:13 Read 3512 rows and found 29 numeric columns
21:47:13 Using Annoy for neighbor search, n_neighbors = 30
21:47:13 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:47:13 Writing NN index file to temp file C:\Users\YANFAN~1\AppData\Local\Temp\RtmpW4OpBN\file97830d16bc1
21:47:13 Searching Annoy index using 1 thread, search_k = 3000
21:47:15 Annoy recall = 100%
21:47:16 Commencing smooth kNN distance calibration using 1 thread
21:47:19 Initializing from normalized Laplacian + noise
21:47:19 Commencing optimization for 500 epochs, with 130124 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:47:32 Optimization finished
> brain <- FindNeighbors(
object = brain,
reduction = 'lsi',
dims = 2:30
)
Computing nearest neighbor graph
Computing SNN
> brain <- FindClusters( #根据shared nearest neighbor (SNN) 来鉴定clusters
object = brain,
algorithm = 3,
resolution = 1.2,
verbose = FALSE
)
> DimPlot(object = brain, label = TRUE) + NoLegend()
Create a gene activity matrix创建基因活性矩阵
#计算基因活性(计算每一个细胞里,落在gene body上和启动子上的counts数)
> gene.activities <- GeneActivity(brain)
Extracting gene coordinates
Extracting reads overlapping genomic regions
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=09m 20s
# 把基因活性矩阵作为一个新的assay添加到Seurat对象里
> brain[['RNA']] <- CreateAssayObject(counts = gene.activities)
> brain <- NormalizeData(
object = brain,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(brain$nCount_RNA)
)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> DefaultAssay(brain) <- 'RNA'
> FeaturePlot(
object = brain,
features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3
)
Integrating with scRNA-seq data与scRNA-seq数据整合
为了帮助解释scATAC-seq数据,我们可以根据同一个生物系统(成年小鼠大脑)的scRNA-seq实验对细胞进行分类。这里我们使用描述的跨模态集成和标签转移的方法(here),这里有一个更详细的教程(here)。
你可以从Allen Institute网站下载此实验的原始数据(website),并在GitHub上查看用于构造此对象的代码(GitHub)。或者,你可以在这里下载预处理的Seurat对象(here)。
# 这里我们使用预处理过的scRNA-seq Seurat对象
> allen_rna <- readRDS("./mouse_brain/allen_brain.rds")
> allen_rna <- FindVariableFeatures(
object = allen_rna,
nfeatures = 5000
)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
#把ATAC-seq和RNA-seq的数据对应起来,这里需要输入你的参考对象、和要进行query(查询)的对象,这里reference是RNA-seq的对象,我们需要用ATAC-seq结果和它对应起来
> transfer.anchors <- FindTransferAnchors(
reference = allen_rna,
query = brain,
reduction = 'cca', #这里降维的方法是CCA。如果你两个对象都是scRNA-seq的话,则用PCA,代码是reduction='pcaproject'
dims = 1:40
)
Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 7875 anchors
Filtering anchors
Retained 4854 anchors
> predicted.labels <- TransferData(
anchorset = transfer.anchors, #这里anchorset应该是你上一步生成的对象
refdata = allen_rna$subclass, #data to transfer
weight.reduction = brain[['lsi']],
dims = 2:30
)
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
> brain <- AddMetaData(object = brain, metadata = predicted.labels)
> plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
> plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
> plot1 + plot2
Why did we change default parameters?为什么我们要修改默认参数?
我们更改了FindIntegrationAnchors()和FindVariableFeatures()的默认参数(包括更多的features和维度)。你可以以两种方式运行分析,这两种方法的结果是非常相似的。然而,当使用默认参数时,我们误将cluster 11细胞标记为Vip间神经元,而实际上它们是最近我们和其他人描述的表达Meis2的CGE衍生的间神经元群体。原因是这个细胞群在scRNA-seq数据中非常罕见(0.3%),所以定义这个细胞群的基因(例如Meis2)表达过低,无法在最初的一组变量特征中被选择。因此,我们需要更多的基因和维度来进行跨模比对。有趣的是,与scRNA-seq数据相比,scATAC-seq数据中的这个细胞群有10倍的富集(请看这篇文章以理解这种现象的发生的可能原因 this paper )。
你可以看到,基于RNA的分类与仅在ATAC-seq数据上计算的UMAP可视化完全一致。我们现在可以注释我们的scATAC-seq的细胞群(或者,我们可以使用RNA分类)。我们注意到三个小的细胞群(13,20,21),它们代表了scRNA-seq标签的细分。尝试从allen scRNA-seq数据集转移聚类标签(它显示了更细微的区别),对它们进行注释!
# replace each label with its most likely prediction
> for(i in levels(brain)) {
cells_to_reid <- WhichCells(brain, idents = i)
newid <- names(sort(table(brain$predicted.id[cells_to_reid]),decreasing=TRUE))[1]
Idents(brain, cells = cells_to_reid) <- newid
}
Find differentially accessible peaks between clusters不同细胞群之间差异可接近性peak
#switch back to working with peaks instead of gene activities
> DefaultAssay(brain) <- 'peaks'
> da_peaks <- FindMarkers( #Finds markers (differentially expressed genes) for identity classes
object = brain,
ident.1 = c("L2/3 IT"),
ident.2 = c("L4", "L5 IT", "L6 IT"),
min.pct = 0.4,
test.use = 'LR',
latent.vars = 'peak_region_fragments'
)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
> head(da_peaks) #这里avg_logFC的值如果是正的,说明peaks在上一步里ident.1设置的细胞群里比较多;如果是负的,说明在ident.2细胞群里比较多
p_val avg_logFC pct.1 pct.2 p_val_adj
chr4-86523678-86525285 6.291218e-67 0.5613693 0.401 0.036 9.865574e-62
chr15-87605281-87607659 4.653646e-57 0.4891020 0.484 0.091 7.297614e-52
chr2-118700082-118704897 1.747014e-56 0.3044948 0.624 0.181 2.739581e-51
chr3-137056475-137058371 1.246008e-52 0.3551127 0.524 0.103 1.953928e-47
chr13-69329933-69331707 3.007140e-52 -0.4234536 0.144 0.444 4.715647e-47
chr10-107751762-107753240 1.533317e-51 0.3950263 0.612 0.184 2.404472e-46
#对差异peak列表里的第一个peak(chr4的一个区域)进行不同细胞群之间的比较,并作图
> plot1 <- VlnPlot(
object = brain,
features = rownames(da_peaks)[1],
pt.size = 0.1,
idents = c("L4","L5 IT","L2/3 IT","L6 IT")
)
> plot2 <- FeaturePlot(
object = brain,
features = rownames(da_peaks)[1],
pt.size = 0.1,
max.cutoff = 'q95'
)
> plot1 | plot2
#标记显著差异的peaks。logFC>0.25说明在L2/3里peak更加容易接近,如果<-0.25说明在L4/5/6细胞群里peak更容易接近。
> open_l23 <- rownames(da_peaks[da_peaks$avg_logFC > 0.25, ])
> open_l456 <- rownames(da_peaks[da_peaks$avg_logFC < -0.25, ])
> closest_l23 <- ClosestFeature(brain, open_l23) #closestFeature:寻找距离一组给定的基因组区域最近的features
> closest_l456 <- ClosestFeature(brain, open_l456)
> head(closest_l23)
> head(closest_l23)
tx_id gene_name gene_id gene_biotype type closest_region
ENSMUST00000151481 ENSMUST00000151481 Fam154a ENSMUSG00000028492 protein_coding gap chr4-86487920-86538964
ENSMUSE00000647021 ENSMUST00000068088 Fam19a5 ENSMUSG00000054863 protein_coding exon chr15-87625230-87625486
ENSMUST00000104937 ENSMUST00000104937 Ankrd63 ENSMUSG00000078137 protein_coding cds chr2-118702266-118703438
ENSMUST00000070198 ENSMUST00000070198 Ppp3ca ENSMUSG00000028161 protein_coding utr chr3-136935226-136937727
ENSMUST00000165341 ENSMUST00000165341 Otogl ENSMUSG00000091455 protein_coding utr chr10-107762223-107762309
ENSMUST00000179893 ENSMUST00000179893 Ryr1 ENSMUSG00000030592 protein_coding cds chr7-29073019-29073139
query_region distance
ENSMUST00000151481 chr4-86523678-86525285 0
ENSMUSE00000647021 chr15-87605281-87607659 17570
ENSMUST00000104937 chr2-118700082-118704897 0
ENSMUST00000070198 chr3-137056475-137058371 118747
ENSMUST00000165341 chr10-107751762-107753240 8982
ENSMUST00000179893 chr7-29070299-29073172 0
> head(closest_l456)
tx_id gene_name gene_id gene_biotype type closest_region
ENSMUST00000044081 ENSMUST00000044081 Papd7 ENSMUSG00000034575 protein_coding utr chr13-69497959-69499915
ENSMUST00000084628 ENSMUST00000084628 Hs3st2 ENSMUSG00000046321 protein_coding cds chr7-121392730-121393214
ENSMUST00000161979 ENSMUST00000161979 Kcnab1 ENSMUSG00000027827 protein_coding gap chr3-65261847-65266490
ENSMUST00000111627 ENSMUST00000111627 Mobp ENSMUSG00000032517 protein_coding utr chr9-120173191-120176091
ENSMUST00000034339 ENSMUST00000034339 Cdh5 ENSMUSG00000031871 protein_coding gap chr8-104126660-104128051
ENSMUST00000134253 ENSMUST00000134253 Nmnat3 ENSMUSG00000032456 protein_coding gap chr9-98354165-98399455
query_region distance
ENSMUST00000044081 chr13-69329933-69331707 166251
ENSMUST00000084628 chr7-121391215-121395519 0
ENSMUST00000161979 chr3-65262154-65263636 0
ENSMUST00000111627 chr9-120174034-120175458 0
ENSMUST00000034339 chr8-104126858-104127953 0
ENSMUST00000134253 chr9-98387463-98389145 0
Plotting genomic regions画基因组区域
我们还可以使用CoveragePlot()函数为任何基因组区域创建按聚类、细胞类型或对象中存储的任何其他元数据分组的覆盖率图。这些代表了pseudo-bulk可接近性track,其中信号是一个组内所有细胞的平均,以可视化DNA可接近性。(实际上和IGV里的可视化是一样的)
> levels(brain) <- c("L2/3 IT","L4","L5 IT","L5 PT","L6 CT", "L6 IT","NP","Sst","Pvalb","Vip","Lamp5","Meis2","Oligo","Astro","Endo","VLMC","Macrophage")
#查看Neurod6基因附近的peak
> CoveragePlot(
object = brain,
region = c("Neurod6"),
extend.upstream = 1000,
extend.downstream = 1000
)
#查看Gad2基因附近的peak
> CoveragePlot(
object = brain,
region = c("Gad2"),
extend.upstream = 1000,
extend.downstream = 1000
)
你也可以同时画两个基因的peak情况:
> CoveragePlot(
object = brain,
region = c("Neurod6", "Gad2"),
extend.upstream = 1000,
extend.downstream = 1000,
ncol = 1
)