科研信息学

Cicero R软件包

2020-06-30  本文已影响0人  嗷嗷叫的多多

Cicero将数据存储在Cell DataSet类的对象中,该类继承自Bioconductor的ExpressionSet类。使用以下三个函数来操作该对象:

loading data

Cicero使用peak作为其特征数据fData,而不是基因或转录本。许多Cicero函数需要形式为chr1_10390134_10391134的峰值信息,例如:

                            site_name               chromosome    bp1          bp2          
chr10_100002625_100002940   chr10_100002625_100002940   10  100002625   100002940   
chr10_100006458_100007593   chr10_100006458_100007593   10  100006458   100007593
chr10_100011280_100011780   chr10_100011280_100011780   10  100011280   100011780

1. 创建CDS class作为Cicero的输入文件

1.1. 从简单的稀疏矩阵格式加载数据

1.2. 加载 10X scATAC-seq data


2.Constructing cis-regulatory networks

2.1Running Cicero

  1. 创建一个Cicero CDS:
  1. Run cicero:Cicero软件包的主要功能是估计基因组中位点的共可及性,以预测顺式调节相互作用。有两种获取此信息的方法:

2.2 Visualizing Cicero Connections

plot_connections:Cicero程序包包含的一个通用的绘图功能,用于可视化。plot_connections有很多选项,在他们的“Advanced Visualization”部分中进行了详细介绍,若是从可访问性表中获取基本图非常简单。
需要先从ensembl下载与此数据(mm9)关联的GTF并加载它,用于画图

# Download the GTF associated with this data (mm9) from ensembl and load it
# using rtracklayer

# download and unzip
temp <- tempfile()
download.file("ftp://ftp.ensembl.org/pub/release-65/gtf/mus_musculus/Mus_musculus.NCBIM37.65.gtf.gz", temp)
gene_anno <- rtracklayer::readGFF(temp)
# gene_anno <- rtracklayer::readGFF("Mus_musculus.NCBIM37.65.gtf.gz")
unlink(temp)

# rename some columns to match requirements
gene_anno$chromosome <- paste0("chr", gene_anno$seqid)
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name
#以“chr2”形式绘制的区域的染色体。
plot_connections(conns, "chr2", 9773451, 9848598,
                 gene_model = gene_anno, 
                 coaccess_cutoff = .25, 
                 connection_width = .5, 
                 collapseTranscripts = "longest" )

2.3 比较Cicero与其他数据集的联系

  1. compare_connections:将Cicero连接与具有类似连接类型的其他数据集进行比较。此函数将连接对的两个数据帧conns1和conns2作为输入,并从conns2中找到的conns1返回连接的逻辑向量。
#虚构一组ChIA-PET的connections
chia_conns <-  data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600", 
                                    "chr2_3004900_3005100"), 
                          Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600", 
                                    "chr2_3035100_3035200"))
head(chia_conns)

#                  Peak1                Peak2
# 1 chr2_3005100_3005200 chr2_3006400_3006600
# 2 chr2_3004400_3004600 chr2_3006400_3006600
# 3 chr2_3004900_3005100 chr2_3035100_3035200

conns$in_chia <- compare_connections(conns, chia_conns)

对两个联系进行比较,得到如下格式数据,true代表在conns2中也预测到了peak1与peak2存在互作关系:

#                  Peak1                Peak2    coaccess in_chia
# 2 chr2_3005180_3006128 chr2_3006405_3006928  0.35327965    TRUE
# 3 chr2_3005180_3006128 chr2_3019616_3020066 -0.02461107   FALSE
# 4 chr2_3005180_3006128 chr2_3021952_3022152  0.00000000   FALSE
# 5 chr2_3005180_3006128 chr2_3024576_3025188  0.05050716   FALSE
# 6 chr2_3005180_3006128 chr2_3026145_3026392 -0.03521344   FALSE
# 7 chr2_3005180_3006128 chr2_3035075_3037296  0.01305855   FALSE

如果觉得比对结果过于紧密,也可以使用max_gap 参数放松比对松紧度。

  1. comparison_track:Cicero的绘图功能可以直观地比较数据集,比较数据帧必须包括前两个峰值列之外的第三列,称为“ coaccess”,此列用于绘制连接线条高度。
# Add a column of 1s called "coaccess"
chia_conns <-  data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600", 
                                    "chr2_3004900_3005100"), 
                          Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600", 
                                    "chr2_3035100_3035200"),
                          coaccess = c(1, 1, 1))

plot_connections(conns, "chr2", 3004000, 3040000, 
                 gene_model = gene_anno, 
                 coaccess_cutoff = 0,
                 connection_width = .5,
                 comparison_track = chia_conns,
                 comparison_connection_width = .5,
                 nclude_axis_track = FALSE,
                 collapseTranscripts = "longest") 
comparison_track_m3.png

2.4 Finding cis-Co-accessibility Networks (CCANS)(寻找顺式可访问性网络)

  1. generate_ccans:Cicero还具有查找Cis-Co-accessibility网络(CCAN)的功能,将“connection data frame”作为输入,并为每个输入峰值输出具有CCAN分配的数据帧(data frame)。未包含在输出数据帧中的site未分配CCAN。
    函数generate_ccans具有一个可选输入,称为coaccess_cutoff_override。当coaccess_cutoff_override为NULL时,该函数将根据不同截止点的总CCAN数量来确定并报告CCAN生成的合适的可访问性得分截止值。还可以将coaccess_cutoff_override设置为介于0和1之间的数字,以覆盖该函数的临界值查找部分。
CCAN_assigns <- generate_ccans(conns)

# [1] "Coaccessibility cutoff used: 0.14"
  1. generate_ccans的输出数据格式为:
#                                      Peak CCAN
# chr2_3005180_3006128 chr2_3005180_3006128    6
# chr2_3006405_3006928 chr2_3006405_3006928    6
# chr2_3019616_3020066 chr2_3019616_3020066    1
# chr2_3024576_3025188 chr2_3024576_3025188    6
# chr2_3026145_3026392 chr2_3026145_3026392    1
# chr2_3045478_3046610 chr2_3045478_3046610    6

2.5 Cicero gene activity scores

区域可及性的综合得分与基因表达有更好的一致性,我们将此分数称为Cicero基因活性分数,它是使用两个函数计算得出的。

  1. build_gene_activity_matrix:此函数需要“input CDS ”和“ Cicero connection list”,并输出基因活性得分的未标准化表格。重要说明:输入CDS必须在fData表中的一列中称为“基因”,如果该峰是启动子,则指示该基因;如果该峰是末端,则指示NA。
#### Add a column for the pData table indicating the gene if a peak is a promoter ####
# Create a gene annotation set that only marks the transcription start sites of 
# the genes. We use this as a proxy for promoters.
# To do this we need the first exon of each transcript
pos <- subset(gene_anno, strand == "+")
pos <- pos[order(pos$start),] 
# remove all but the first exons per transcript
pos <- pos[!duplicated(pos$transcript),] 
# make a 1 base pair marker of the TSS
pos$end <- pos$start + 1 

neg <- subset(gene_anno, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),] 
# remove all but the first exons per transcript
neg <- neg[!duplicated(neg$transcript),] 
neg$start <- neg$end - 1

gene_annotation_sub <- rbind(pos, neg)

# Make a subset of the TSS annotation columns containing just the coordinates 
# and the gene name
gene_annotation_sub <- gene_annotation_sub[,c("chromosome", "start", "end", "symbol")]

# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"
#使用基于坐标重叠的特征数据注释cds的site。
input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)

tail(fData(input_cds))
# DataFrame with 6 rows and 7 columns
#                                 site_name         chr       bp1       bp2
#                                  <factor> <character> <numeric> <numeric>
# chrY_590469_590895     chrY_590469_590895           Y    590469    590895
# chrY_609312_609797     chrY_609312_609797           Y    609312    609797
# chrY_621772_623366     chrY_621772_623366           Y    621772    623366
# chrY_631222_631480     chrY_631222_631480           Y    631222    631480
# chrY_795887_796426     chrY_795887_796426           Y    795887    796426
# chrY_2397419_2397628 chrY_2397419_2397628           Y   2397419   2397628
#                      num_cells_expressed   overlap        gene
#                                <integer> <integer> <character>
# chrY_590469_590895                     5        NA          NA
# chrY_609312_609797                     7        NA          NA
# chrY_621772_623366                   106         2       Ddx3y
# chrY_631222_631480                     2        NA          NA
# chrY_795887_796426                     1         2       Usp9y
# chrY_2397419_2397628                   4        NA          NA

#### Generate gene activity scores ####
# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)

# remove any rows/columns with all zeroes
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0, 
                       !Matrix::colSums(unnorm_ga) == 0]

# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))

# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)

# if you had two datasets to normalize, you would pass both:
# num_genes should then include all cells from both sets
unnorm_ga2 <- unnorm_ga
  1. normalize_gene_activities:对上一个未标准化的结果进行标准化。normalize_gene_activities还需要每个单元总共可访问站点的命名向量。这可以在CDS的pData表中轻松找到,该表称为“ num_genes_expressed”,标准化的基因活性得分范围是0到1。
cicero_gene_activities <- normalize_gene_activities(list(unnorm_ga, unnorm_ga2), 
                                                    num_genes)
2103836.png

2.6 Advanced visualizaton

  1. “plot_connections”函数的Some useful parameters
  1. 使用return_as_list自定义所有内容

3. 单细胞可及性轨迹

Cicero软件包的第二个主要功能是扩展Monocle 3,以用于单细胞可访问性数据。染色质可访问性数据要克服的主要障碍是稀疏性,因此大多数扩展和方法都旨在解决这一问题。

3.1 使用可访问性数据构造轨迹

  1. 简而言之,Monocle通过三个步骤推断伪时间轨迹:
    • Preprocess the data
    • Reduce the dimensionality of the data
    • Cluster the cells
    • Learn the trajectory graph
    • Order the cells in pseudotime
  2. 仅需进行少量修改就可以对可访问性数据运行以下步骤:
    2.1 首先,我们下载并加载数据(与上面相同):
     # Code to download (54M) and unzip the file - can take a couple 
     minutes 
     # depending on internet connection:
     temp <- textConnection(readLines(gzcon(url("http://staff.washington.edu/hpliner/data/kidney_data.txt.gz"))))
     # read in the data
     cicero_data <- read.table(temp)
     input_cds <- make_atac_cds(cicero_data)
    
    2.2 接下来,我们使用潜在语义索引(LSI)预处理数据,然后继续使用Monocle 3中使用的标准降维方法。
    set.seed(2017)
    input_cds <- estimate_size_factors(input_cds)
    #1
    input_cds <- preprocess_cds(input_cds, method = "LSI")
    #2
    input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP', preprocess_method = "LSI")
    #3
    input_cds <- cluster_cells(input_cds)
    #4
    input_cds <- learn_graph(input_cds)
    #5
    # cell ordering can be done interactively by leaving out "root_cells"
    input_cds <- order_cells(input_cds, 
                        root_cells = "GAGATTCCAGTTGAATCACTCCATCGAGATAGAGGC")
    
    2.3 Plot the results(绘制结果)
    plot_cells(input_cds, color_cells_by = "pseudotime")
    
UMAP_traj_examp.png

3.2 差异可及性分析

  1. Aggregation(聚合):解决稀疏性以进行差异分析,Cicero软件包处理稀疏单细胞染色质可及性数据的主要方式是通过聚合。
  2. 可视化跨伪时的可访问性
  3. 使用单细胞染色质可访问性数据运行fit_models:判断位点是否在伪时更改,因此我们将聚集类似的细胞。为此,Cicero提供了函数aggregate_by_cell_bin。
#pseudotime:从CDS对象中提取伪时间
input_cds_lin <- input_cds[,is.finite(pseudotime(input_cds))]
#利用伪时间绘制可及性图
plot_accessibility_in_pseudotime(input_cds_lin[c("chr1_3238849_3239700", 
                                                 "chr1_3406155_3407044", 
                                                 "chr1_3397204_3397842")])
example_plot_accessibility_m3.png

plot_accessibility_in_pseudotime:
每个柱的基数为多个cell,这些细胞在当前伪时间段内的开放性(eg:10个细胞中有2个开发,纵坐标则为2)比例作为纵坐标。
黑线表示伪时间依赖的平均可达性平滑二项回归。

# First, assign a column in the pData table to umap pseudotime
pData(input_cds_lin)$Pseudotime <- pseudotime(input_cds_lin)
#将伪时间轨迹切成10个部分,从而将cell元分配给bin。
pData(input_cds_lin)$cell_subtype <- cut(pseudotime(input_cds_lin), 10)
binned_input_lin <- aggregate_by_cell_bin(input_cds_lin, "cell_subtype")
# For speed, run fit_models on 1000 randomly chosen genes
set.seed(1000)
acc_fits <- fit_models(binned_input_lin[sample(1:nrow(fData(binned_input_lin)), 1000),], 
                       model_formula_str = "~Pseudotime + num_genes_expressed" )
fit_coefs <- coefficient_table(acc_fits)

# Subset out the differentially accessible sites with respect to Pseudotime
pseudotime_terms <- subset(fit_coefs, term == "Pseudotime" & q_value < .05)
head(pseudotime_terms)

# # A tibble: 2 x 12
#   site_name num_cells_expre… use_for_ordering status term  estimate std_err
#   <fct>                <int> <lgl>            <chr>  <chr>    <dbl>   <dbl>
# 1 chr12_32…                3 FALSE            OK     Pseu…   -0.352  0.0404
# 2 chr14_61…                2 FALSE            OK     Pseu…   -0.413  0.0280
# # … with 5 more variables: test_val <dbl>, p_value <dbl>,
# #   normalized_effect <dbl>, model_component <chr>, q_value <dbl>

4. Useful Functions

上一篇 下一篇

猜你喜欢

热点阅读