Cicero R软件包

2020-06-30 本文已影响0人嗷嗷叫的多多

Cicero将数据存储在Cell DataSet类的对象中，该类继承自Bioconductor的ExpressionSet类。使用以下三个函数来操作该对象：

fData: 获取feature的元信息
pData: 获取cell,sample的元信息
exprs: 获取cell-by-peak的count矩阵

loading data

Cicero使用peak作为其特征数据fData，而不是基因或转录本。许多Cicero函数需要形式为chr1_10390134_10391134的峰值信息,例如:

                            site_name               chromosome    bp1          bp2          
chr10_100002625_100002940   chr10_100002625_100002940   10  100002625   100002940   
chr10_100006458_100007593   chr10_100006458_100007593   10  100006458   100007593
chr10_100011280_100011780   chr10_100011280_100011780   10  100011280   100011780

1. 创建CDS class作为Cicero的输入文件

1.1. 从简单的稀疏矩阵格式加载数据

Cicero包含一个名为make_atac_cds的函数，这个函数的输入数据第一列是峰坐标，格式为“ chr10_100013372_100013596”，第二列是cell名称，第三列是整数，表示该细胞与该峰重叠的read数。该文件不应包含标题行。如下：
```
chr10_100002625_100002940   cell1   1
chr10_100006458_100007593   cell2   2
chr10_100006458_100007593   cell3   1
chr10_100013372_100013596   cell2   1
chr10_100015079_100015428   cell4   3
```

将此文件使用make_atac_cds函数转存为cds对象，以便接下来的处理（cicero软件包需要的数据大致都为cds形式）

# read in the data
cicero_data <- read.table("D:/biowork/cicero/kidney_data.txt")
input_cds <- make_atac_cds(cicero_data, binarize = TRUE)

1.2. 加载 10X scATAC-seq data

Cicero还支持利用cellranger-ATAC处理scATAC-seq数据，在输出的结果中有名为filtered_peak_bc_matrix（过滤peak-barcode矩阵）的文件夹。filtered_peak_bc_matrix文件夹中包括：

matrix.mtx

%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "1.2.0"}
125987 427 1849286
125903 1 2
125834 1 2

barcodes.tsv

AAACTGCAGAGAGTTT-1
AAAGATGAGGCTAAAT-1
AAAGGATAGAGTTCGG-1
AAAGGATTCTACTTTG-1

peaks.bed

chr1    10035   10358
chr1    629447  630122
chr1    633794  634270
chr1    775088  775154

处理以上三个文件，同样存储为cds对象：

# read in matrix data using the Matrix package
indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") 
# binarize the matrix
indata@x[indata@x > 0] <- 1

# format cell info
cellinfo <- read.table("filtered_peak_bc_matrix/barcodes.tsv")
row.names(cellinfo) <- cellinfo$V1
names(cellinfo) <- "cells"

# format peak info
peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed")
names(peakinfo) <- c("chr", "bp1", "bp2")
peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_")
row.names(peakinfo) <- peakinfo$site_name

row.names(indata) <- row.names(peakinfo)
colnames(indata) <- row.names(cellinfo)

# make CDS
input_cds <-  suppressWarnings(new_cell_data_set(indata,
cell_metadata = cellinfo,
gene_metadata = peakinfo))
#对于cell_data_set对象中的每个基因，detect_genes计算有多少细胞的表达超过了最小阈值,此外，对于每个细胞，detect_genes会统计超过该阈值的可检测基因的数量。结果分别作为列num_cells_expressed和num_genes_expressed添加到rowData表和colData表中。
input_cds <- monocle3::detect_genes(input_cds)

#Ensure there are no peaks included with zero reads
input_cds <- input_cds[Matrix::rowSums(exprs(input_cds)) != 0,]

处理后三个文件中的内容如下：
matrix.mtx 二进制化

6 x 427 sparse Matrix of class "dgTMatrix"
[1,] . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
[2,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . ......
[3,] . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . ......

barcodes.tsv format cell info

                                cells
AAACTGCAGAGAGTTT-1 AAACTGCAGAGAGTTT-1
AAAGATGAGGCTAAAT-1 AAAGATGAGGCTAAAT-1
AAAGGATAGAGTTCGG-1 AAAGGATAGAGTTCGG-1

peaks.bed format peak info

                    chr    bp1    bp2          site_name
chr1_10035_10358   chr1  10035  10358   chr1_10035_10358
chr1_629447_630122 chr1 629447 630122 chr1_629447_630122
chr1_633794_634270 chr1 633794 634270 chr1_633794_634270

2.Constructing cis-regulatory networks

2.1Running Cicero

创建一个Cicero CDS：

单细胞染色质可及性数据极为稀疏，因此对可及性分数的准确估计需要汇总相似的细胞以创建更密集的计数数据。Cicero使用k最近邻方法来做到这一点，该方法会创建重叠的细胞集。Cicero基于细胞相似度的降维坐标图（例如UMAP或t-sne）构造这些集合。

使用Monocle 3的功能，我们首先找到input_cds的UMAP坐标

#随机产生种子
set.seed(2017)
#检测超过最小阈值的基因。并将超过阈值的基因数据存放到表中
input_cds <- detect_genes(input_cds)
#用于计算单细胞RNA-seq数据的大小因子
input_cds <- estimate_size_factors(input_cds)
#对cds进行预处理，为轨迹分析做准备
input_cds <- preprocess_cds(input_cds, method = "LSI")
#进行降维处理--UMAP
input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP', 
                              preprocess_method = "LSI")
#使用Monocle的绘图功能来可视化缩小尺寸图：
plot_cells(input_cds)

使用函数make_cicero_cds创建聚合的CDS对象。make_cicero_cds的输入是input_CDS对象，以及缩小的尺寸坐标图。缩小的尺寸图reduce_coordinates应该采用data.frame或矩阵的形式，其中行名称与CDS的pData表中的cell ID匹配。

读取input_cds的UMAP坐标
reduce_coordinates的列应为降维对象的坐标

#input_CDS对象访问UMAP坐标
umap_coords <- reducedDims(input_cds)$UMAP

#这是把细胞降维了，本来应该是行是peak，列是细胞的矩阵，他想对细胞进行降维，然后看细胞之间的相似程度，因为原本的矩阵纬度太高了，所以他把所有的peak变成了两个特征值，就把矩阵降成了二维--得到umap1,umap2。这样便可以确定细胞的位置（在图中），做数据之前会先对细胞进行预设颜色，聚类之后看图中细胞聚类效果是否好。
        umap_coord1 umap_coord2
cell1   -0.7084047  -0.7232994
cell2   -4.4767964  0.8237284
cell3   1.4870098   -0.4723493

使用input_cds和input_cds的降维坐标创建cicero_cds

#创建Cicero_cds数据
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = umap_coords)

Run cicero：Cicero软件包的主要功能是估计基因组中位点的共可及性，以预测顺式调节相互作用。有两种获取此信息的方法：

第一种方法：直接运行run_cicero

要运行run_cicero，需要一个cicero_CDS对象（在上面创建）和一个基因组坐标文件，其中包含生物体中每个染色体的长度。
导入的cicero包中包括人类hg19坐标和鼠mm9坐标，可以通过data("human.hg19.genome") and data("mouse.mm9.genome")进行访问。

运行cicero

data("mouse.mm9.genome")
# use only a small part of the genome for speed
sample_genome <- subset(mouse.mm9.genome, V1 == "chr2")
sample_genome$V2[1] <- 10000000

## Usually use the whole mouse.mm9.genome ##
## Usually run with sample_num = 100 ##
conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2) 
head(conns)

run_cicero后得到如下数据格式：

#这类细胞中这两段peak的共可及性
Peak1                   Peak2                   coaccess
chr2_3005180_3006128    chr2_3006405_3006928    0.35327965
chr2_3005180_3006128    chr2_3019616_3020066    -0.02461107
chr2_3005180_3006128    chr2_3021952_3022152    0.00000000

第二种方法Call functions separately：分别调用函数，逐步进行
1. estimate_distance_parameter：此函数基于基因组的随机小窗口计算距离惩罚参数。
2. generate_cicero_models：此函数使用上面确定的距离参数，并使用图形化LASSO使用基于距离的惩罚来计算基因组重叠窗口的共可及性得分。
3. assemble_connections：此函数将generate_cicero_models的输出作为输入，并协调重叠的模型以创建最终可访问性分数列表。

2.2 Visualizing Cicero Connections

plot_connections：Cicero程序包包含的一个通用的绘图功能，用于可视化。plot_connections有很多选项，在他们的“Advanced Visualization”部分中进行了详细介绍，若是从可访问性表中获取基本图非常简单。
需要先从ensembl下载与此数据(mm9)关联的GTF并加载它，用于画图

# Download the GTF associated with this data (mm9) from ensembl and load it
# using rtracklayer

# download and unzip
temp <- tempfile()
download.file("ftp://ftp.ensembl.org/pub/release-65/gtf/mus_musculus/Mus_musculus.NCBIM37.65.gtf.gz", temp)
gene_anno <- rtracklayer::readGFF(temp)
# gene_anno <- rtracklayer::readGFF("Mus_musculus.NCBIM37.65.gtf.gz")
unlink(temp)

# rename some columns to match requirements
gene_anno$chromosome <- paste0("chr", gene_anno$seqid)
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name
#以“chr2”形式绘制的区域的染色体。
plot_connections(conns, "chr2", 9773451, 9848598,
                 gene_model = gene_anno, 
                 coaccess_cutoff = .25, 
                 connection_width = .5, 
                 collapseTranscripts = "longest" )

2.3 比较Cicero与其他数据集的联系

compare_connections：将Cicero连接与具有类似连接类型的其他数据集进行比较。此函数将连接对的两个数据帧conns1和conns2作为输入，并从conns2中找到的conns1返回连接的逻辑向量。

#虚构一组ChIA-PET的connections
chia_conns <-  data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600", 
                                    "chr2_3004900_3005100"), 
                          Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600", 
                                    "chr2_3035100_3035200"))
head(chia_conns)

#                  Peak1                Peak2
# 1 chr2_3005100_3005200 chr2_3006400_3006600
# 2 chr2_3004400_3004600 chr2_3006400_3006600
# 3 chr2_3004900_3005100 chr2_3035100_3035200

conns$in_chia <- compare_connections(conns, chia_conns)

对两个联系进行比较，得到如下格式数据，true代表在conns2中也预测到了peak1与peak2存在互作关系：

#                  Peak1                Peak2    coaccess in_chia
# 2 chr2_3005180_3006128 chr2_3006405_3006928  0.35327965    TRUE
# 3 chr2_3005180_3006128 chr2_3019616_3020066 -0.02461107   FALSE
# 4 chr2_3005180_3006128 chr2_3021952_3022152  0.00000000   FALSE
# 5 chr2_3005180_3006128 chr2_3024576_3025188  0.05050716   FALSE
# 6 chr2_3005180_3006128 chr2_3026145_3026392 -0.03521344   FALSE
# 7 chr2_3005180_3006128 chr2_3035075_3037296  0.01305855   FALSE

如果觉得比对结果过于紧密，也可以使用max_gap 参数放松比对松紧度。

comparison_track：Cicero的绘图功能可以直观地比较数据集，比较数据帧必须包括前两个峰值列之外的第三列，称为“ coaccess”，此列用于绘制连接线条高度。

# Add a column of 1s called "coaccess"
chia_conns <-  data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600", 
                                    "chr2_3004900_3005100"), 
                          Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600", 
                                    "chr2_3035100_3035200"),
                          coaccess = c(1, 1, 1))

plot_connections(conns, "chr2", 3004000, 3040000, 
                 gene_model = gene_anno, 
                 coaccess_cutoff = 0,
                 connection_width = .5,
                 comparison_track = chia_conns,
                 comparison_connection_width = .5,
                 nclude_axis_track = FALSE,
                 collapseTranscripts = "longest")

comparison_track_m3.png

2.4 Finding cis-Co-accessibility Networks (CCANS)（寻找顺式可访问性网络）

generate_ccans：Cicero还具有查找Cis-Co-accessibility网络（CCAN）的功能，将“connection data frame”作为输入，并为每个输入峰值输出具有CCAN分配的数据帧（data frame）。未包含在输出数据帧中的site未分配CCAN。
函数generate_ccans具有一个可选输入，称为coaccess_cutoff_override。当coaccess_cutoff_override为NULL时，该函数将根据不同截止点的总CCAN数量来确定并报告CCAN生成的合适的可访问性得分截止值。还可以将coaccess_cutoff_override设置为介于0和1之间的数字，以覆盖该函数的临界值查找部分。

CCAN_assigns <- generate_ccans(conns)

# [1] "Coaccessibility cutoff used: 0.14"

generate_ccans的输出数据格式为：

#                                      Peak CCAN
# chr2_3005180_3006128 chr2_3005180_3006128    6
# chr2_3006405_3006928 chr2_3006405_3006928    6
# chr2_3019616_3020066 chr2_3019616_3020066    1
# chr2_3024576_3025188 chr2_3024576_3025188    6
# chr2_3026145_3026392 chr2_3026145_3026392    1
# chr2_3045478_3046610 chr2_3045478_3046610    6

2.5 Cicero gene activity scores

区域可及性的综合得分与基因表达有更好的一致性，我们将此分数称为Cicero基因活性分数，它是使用两个函数计算得出的。

build_gene_activity_matrix：此函数需要“input CDS ”和“ Cicero connection list”，并输出基因活性得分的未标准化表格。重要说明：输入CDS必须在fData表中的一列中称为“基因”，如果该峰是启动子，则指示该基因；如果该峰是末端，则指示NA。

#### Add a column for the pData table indicating the gene if a peak is a promoter ####
# Create a gene annotation set that only marks the transcription start sites of 
# the genes. We use this as a proxy for promoters.
# To do this we need the first exon of each transcript
pos <- subset(gene_anno, strand == "+")
pos <- pos[order(pos$start),] 
# remove all but the first exons per transcript
pos <- pos[!duplicated(pos$transcript),] 
# make a 1 base pair marker of the TSS
pos$end <- pos$start + 1 

neg <- subset(gene_anno, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),] 
# remove all but the first exons per transcript
neg <- neg[!duplicated(neg$transcript),] 
neg$start <- neg$end - 1

gene_annotation_sub <- rbind(pos, neg)

# Make a subset of the TSS annotation columns containing just the coordinates 
# and the gene name
gene_annotation_sub <- gene_annotation_sub[,c("chromosome", "start", "end", "symbol")]

# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"
#使用基于坐标重叠的特征数据注释cds的site。
input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)

tail(fData(input_cds))
# DataFrame with 6 rows and 7 columns
#                                 site_name         chr       bp1       bp2
#                                  <factor> <character> <numeric> <numeric>
# chrY_590469_590895     chrY_590469_590895           Y    590469    590895
# chrY_609312_609797     chrY_609312_609797           Y    609312    609797
# chrY_621772_623366     chrY_621772_623366           Y    621772    623366
# chrY_631222_631480     chrY_631222_631480           Y    631222    631480
# chrY_795887_796426     chrY_795887_796426           Y    795887    796426
# chrY_2397419_2397628 chrY_2397419_2397628           Y   2397419   2397628
#                      num_cells_expressed   overlap        gene
#                                <integer> <integer> <character>
# chrY_590469_590895                     5        NA          NA
# chrY_609312_609797                     7        NA          NA
# chrY_621772_623366                   106         2       Ddx3y
# chrY_631222_631480                     2        NA          NA
# chrY_795887_796426                     1         2       Usp9y
# chrY_2397419_2397628                   4        NA          NA

#### Generate gene activity scores ####
# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)

# remove any rows/columns with all zeroes
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0, 
                       !Matrix::colSums(unnorm_ga) == 0]

# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))

# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)

# if you had two datasets to normalize, you would pass both:
# num_genes should then include all cells from both sets
unnorm_ga2 <- unnorm_ga

normalize_gene_activities：对上一个未标准化的结果进行标准化。normalize_gene_activities还需要每个单元总共可访问站点的命名向量。这可以在CDS的pData表中轻松找到，该表称为“ num_genes_expressed”，标准化的基因活性得分范围是0到1。

cicero_gene_activities <- normalize_gene_activities(list(unnorm_ga, unnorm_ga2), 
                                                    num_genes)

2103836.png

2.6 Advanced visualizaton

“plot_connections”函数的Some useful parameters

Viewpoints：可让您仅查看来自基因组中特定位置的连接。这在将数据与4C-seq数据进行比较时可能很有用。
alpha_by_coaccess：使您在担心过度绘图时很有用。此参数使连接曲线的Alpha（透明度）基于协同访问的大小进行缩放
Colors: 有几个与颜色有关的参数：peak_color，comparison_peak_color，connection_color，comparison_connection_color，gene_model_color，viewpoint_color，viewpoint_fill

使用return_as_list自定义所有内容

3. 单细胞可及性轨迹

Cicero软件包的第二个主要功能是扩展Monocle 3，以用于单细胞可访问性数据。染色质可访问性数据要克服的主要障碍是稀疏性，因此大多数扩展和方法都旨在解决这一问题。

3.1 使用可访问性数据构造轨迹

简而言之，Monocle通过三个步骤推断伪时间轨迹：
- Preprocess the data
- Reduce the dimensionality of the data
- Cluster the cells
- Learn the trajectory graph
- Order the cells in pseudotime

仅需进行少量修改就可以对可访问性数据运行以下步骤：
2.1 首先，我们下载并加载数据（与上面相同）：

 # Code to download (54M) and unzip the file - can take a couple 
 minutes 
 # depending on internet connection:
 temp <- textConnection(readLines(gzcon(url("http://staff.washington.edu/hpliner/data/kidney_data.txt.gz"))))
 # read in the data
 cicero_data <- read.table(temp)
 input_cds <- make_atac_cds(cicero_data)

2.2 接下来，我们使用潜在语义索引（LSI）预处理数据，然后继续使用Monocle 3中使用的标准降维方法。

set.seed(2017)
input_cds <- estimate_size_factors(input_cds)
#1
input_cds <- preprocess_cds(input_cds, method = "LSI")
#2
input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP', preprocess_method = "LSI")
#3
input_cds <- cluster_cells(input_cds)
#4
input_cds <- learn_graph(input_cds)
#5
# cell ordering can be done interactively by leaving out "root_cells"
input_cds <- order_cells(input_cds, 
                    root_cells = "GAGATTCCAGTTGAATCACTCCATCGAGATAGAGGC")

2.3 Plot the results（绘制结果）

plot_cells(input_cds, color_cells_by = "pseudotime")

UMAP_traj_examp.png

3.2 差异可及性分析

Aggregation（聚合）：解决稀疏性以进行差异分析，Cicero软件包处理稀疏单细胞染色质可及性数据的主要方式是通过聚合。
可视化跨伪时的可访问性
使用单细胞染色质可访问性数据运行fit_models：判断位点是否在伪时更改，因此我们将聚集类似的细胞。为此，Cicero提供了函数aggregate_by_cell_bin。

可视化跨伪时的可访问性

#pseudotime:从CDS对象中提取伪时间
input_cds_lin <- input_cds[,is.finite(pseudotime(input_cds))]
#利用伪时间绘制可及性图
plot_accessibility_in_pseudotime(input_cds_lin[c("chr1_3238849_3239700", 
                                                 "chr1_3406155_3407044", 
                                                 "chr1_3397204_3397842")])

example_plot_accessibility_m3.png

plot_accessibility_in_pseudotime:
每个柱的基数为多个cell，这些细胞在当前伪时间段内的开放性（eg:10个细胞中有2个开发，纵坐标则为2）比例作为纵坐标。
黑线表示伪时间依赖的平均可达性平滑二项回归。

使用单细胞染色质可访问性数据运行fit_models

# First, assign a column in the pData table to umap pseudotime
pData(input_cds_lin)$Pseudotime <- pseudotime(input_cds_lin)
#将伪时间轨迹切成10个部分，从而将cell元分配给bin。
pData(input_cds_lin)$cell_subtype <- cut(pseudotime(input_cds_lin), 10)
binned_input_lin <- aggregate_by_cell_bin(input_cds_lin, "cell_subtype")

运行fit_models

# For speed, run fit_models on 1000 randomly chosen genes
set.seed(1000)
acc_fits <- fit_models(binned_input_lin[sample(1:nrow(fData(binned_input_lin)), 1000),], 
                       model_formula_str = "~Pseudotime + num_genes_expressed" )
fit_coefs <- coefficient_table(acc_fits)

# Subset out the differentially accessible sites with respect to Pseudotime
pseudotime_terms <- subset(fit_coefs, term == "Pseudotime" & q_value < .05)
head(pseudotime_terms)

# # A tibble: 2 x 12
#   site_name num_cells_expre… use_for_ordering status term  estimate std_err
#   <fct>                <int> <lgl>            <chr>  <chr>    <dbl>   <dbl>
# 1 chr12_32…                3 FALSE            OK     Pseu…   -0.352  0.0404
# 2 chr14_61…                2 FALSE            OK     Pseu…   -0.413  0.0280
# # … with 5 more variables: test_val <dbl>, p_value <dbl>,
# #   normalized_effect <dbl>, model_component <chr>, q_value <dbl>

4. Useful Functions

annotate_cds_by_site：向CDS对象添加关于峰值的附加注释。例如，您可能想知道哪些峰与外显子或转录起始位点重叠。输入信息有
1. CDS
2. 具有bed格式的信息（染色体，bp1，bp2，其他列）的数据帧或文件路径
find_overlapping_coordinates：只想知道哪些峰与基因组的特定区域重叠