celaref ||单细胞细胞类型定义工具
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
当我们拿到单细胞转录组数据的时候,不管结果怎么样,翻来覆去发现其实就是一个基因和细胞的矩阵而已!一般基于这个矩阵,会先过滤细胞均一化,降维,聚类(最重要的一步,因为它可以发现细胞的异质性),聚类结果的差异分析(声称找到了每个亚群的marker基因),最后做一下数据的可视化。
且慢,用非监督的方式聚出来的类(一般叫做为分的类数),既然是细胞异质性的体现,那么它到底是什么细胞呢,可以叫做什么,是CD4细胞呢还是microglia细胞?因为是没有意义的,只有起了像CD4这样的名字,这类细胞才有故事可以讲。
在定义细胞类型之前,需要确定就哪种聚类结果来做,是图聚类的结果还是k-means某一类的结果。如何来确定?看看分群的tsne图是一个不错的参考。
那么呢,之前你们家大师推荐过几个数据库single cell marker 基因数据库是基于某亚群细胞的marker基因,根据已经知道marker基因的细胞类型,这些数据库其实就是一个细胞类型和marker基因大表。根据marker基因列表与分析出来的差异基因列表以及基因丰度的关系(往往是做一个热图)来推断某个cluster属于哪一种细胞类型。
另外还有一种定义细胞类型的方法是通过每个cluster与已知细胞类型的表达谱的相似性来定义细胞类型,已有的方法为SingleR、celaref。均是R包,其中celaref是2019年的新品,也是本文主要介绍的一种方法。
celaref 简介
走遍示例流程
安装celaref包的时候,就把示例数据也下载了,首先载入数据,然后观察一下示例数据是怎么组织的,以便我们以后组织自己的数据。在接触一个新包的时候,最好的学习方法就是用示例数据走一遍然后看看有哪些功能(函数以及函数的参数可以用),多用help或者?来查看细节。一个R包封装的越好也就要求我们花时间去了解他的细节。
library(celaref)
# Paths to data files.
counts_filepath.query <- system.file("extdata", "sim_query_counts.tab", package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref <- system.file("extdata", "sim_ref_counts.tab", package = "celaref")
cell_info_filepath.ref <- system.file("extdata", "sim_ref_cell_info.tab", package = "celaref")
# Load data
toy_ref_se <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
head(toy_ref_se)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)
可见我载入了两个数据:一个是toy_ref_se,reference;一个是toy_query_se,query.都是SummarizedExperiment对象:
对数据进行过滤,help一下这个函数,看看都有哪些过滤条件。
# Filter data
toy_ref_se <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se <- trim_small_groups_and_low_expression_genes(toy_query_se)
对参考和需要鉴定的表达谱分别做差异分析:
# Setup within-experiment differential expression
de_table.toy_ref <- contrast_each_group_to_the_rest(toy_ref_se, dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se, dataset_name="query")
分别得到差异基因结果:
绘制一组小提琴图,显示每个查询组的“top”基因在参考数据库中的分布。
# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)
根据参考数据集的相似性,在查询数据集中构造一些合理的标签或组/集群。
# And get group labels
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)
可以看到除了Group2 没有相应的similarity 类似的type之外,其余的都找到了相对应的细胞类型(基于检验的,不是生物学的)。这当然可以为我们定义细胞类型做一个统计上的参考。
准备数据
那么,如何应用我们自己的数据来做呢?官网也给出了数据准备的方法:
在准备数据时需要以下数据:
- Counts Matrix Number of gene tags per gene per cell.
- Cell information A sample-sheet table of cell-level information. Only two fields are essential:
cell_sample: A unique cell identifier
group: The cluster/group to which the cell has been assigned. - Gene information Optional. A table of extra gene-level information.
ID: A unique gene identifier
输入数据:
- tab键分割的ounts matrix
gene | Cell1 | cell2 | cell3 | cell4 | … | cell954 |
---|---|---|---|---|---|---|
GeneA | 0 | 1 | 0 | 1 | … | 0 |
GeneB | 0 | 3 | 0 | 2 | … | 2 |
GeneC | 1 | 40 | 1 | 0 | … | 0 |
.... |
- tab键分割的细胞分群信息
CellId | Sample | Cluster |
---|---|---|
cell1 | Control | cluster1 |
cell2 | Control | cluster7 |
… | … | … |
cell954 | KO | cluster8 |
From 10X pipeline output
dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata',
dataset_genome = "GRCh38",
clustering_set = "kmeans_7_clusters")
拿一个10X的例子试试吧
#Prepare 10X query dataset
先要把数据下载好,发现这应用的对象还是cellranger V2 pipeline的输出结果,注意为了和官网数据格式保持一致,我把文件夹命名为了,这个前后一致就可以了。
+--- 10X_pbmc4k
| +--- analysis
| | +--- clustering
| | | +--- graphclust
| | | | +--- clusters.csv
| | | +--- kmeans_10_clusters
| | | | +--- clusters.csv
| | | +--- kmeans_2_clusters
···
| | +--- diffexp
| | | +--- graphclust
| | | | +--- differential_expression.csv
| | | +--- kmeans_10_clusters
···
| | +--- pca
| | | +--- 10_components
| | | | +--- components.csv
| | | | +--- dispersion.csv
| | | | +--- genes_selected.csv
| | | | +--- projection.csv
| | | | +--- variance.csv
| | +--- tsne
| | | +--- 2_components
| | | | +--- projection.csv
| +--- filtered_gene_bc_matrices
| | +--- GRCh38
| | | +--- barcodes.tsv
| | | +--- genes.tsv
| | | +--- matrix.mtx
datasets_dir <- "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\celaref/celaref_extra_vignette_data/datasets"
dir(datasets_dir)
dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
dataset_path = file.path(datasets_dir,'10X_pbmc4k'),
dataset_genome = "GRCh38",
clustering_set = "kmeans_7_clusters",
id_to_use = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)
可见我选择的是kmeans_7_clusters 的聚类结果来进行细胞定义。处理过之后:
> dataset_se.10X_pbmc4k_k7
class: SummarizedExperiment
dim: 15407 4340
metadata(0):
assays(1): ''
rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(4): ensembl_ID GeneSymbol ID total_count
colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
colData names(2): cell_sample group
下一步是对数据的转化和处理,也是one-to-others的DE分析,非常的慢和消耗内存,有可能跑断的。关键是这个函数做了是什么?其实就是把我们的矩阵文件整理成上面演示的格式。
de_table.10X_pbmc4k_k7 <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7)
Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded # 这个sorry 可以忽略 ,在Linux下才能指定运行进程数。
`fData` has no primerid. I'll make something up.
`cData` has no wellKey. I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
Done!
`fData` has no primerid. I'll make something up.
`cData` has no wellKey. I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
Completed [=======>----------------------------------------------------------------------------------------------------------------] 7% with 0 failures
windows下居然跑了一上午
准备 reference数据集
来自响应的数据库,因为我们的是血细胞,选择的是haemosphere .
A suitable PBMC reference (a ‘HaemAtlas’) has been published by . They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the website (Graaf et al. 2016).
下载页面下载完之后由于是微阵列数据需要做一些特殊处理。
- Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
- Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)
library(readr)
this_dataset_dir <- file.path(datasets_dir, 'haemosphere_datasets','watkins')
norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt")
samples_file <- file.path(this_dataset_dir, "watkins_samples.txt")
norm_expression_table.full <- read.table(norm_expression_file, sep="\t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE)
samples_table <- read_tsv(samples_file, col_types = cols())
samples_table$description <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays
> samples_table$description
[1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult"
[8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"
[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"
[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"
[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"
[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"
[43] "NA." "NA." "NA." "NA." "NA." "NA." "NA."
[50] "NA."
从样本表中可以看出,这个数据集包含了其他组织,但是作为PBMC的参考,我们只考虑外周血样本。与其他数据加载函数一样,要从分析中删除一个样本(或单元格),从样本表中删除它就ok了。
amples_table <- samples_table[samples_table$tissue == "Peripheral Blood",]
> samples_table
# A tibble: 42 x 7
sampleId celltype cell_lineage surface_markers tissue description notes
<chr> <chr> <chr> <lgl> <chr> <chr> <lgl>
1 1674120023_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA
2 1674120023_B granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA
3 1674120023_C natural killer cell NK Cell Lineage NA Peripheral Blood X49.years.adult NA
4 1674120023_D Th lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
5 1674120023_E Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
6 1674120023_F monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA
7 1674120053_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA
8 1674120053_B monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA
9 1674120053_C granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA
10 1674120053_D Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
# ... with 32 more rows
通常情况下,最难的部分是格式化输入。应该使用与查询数据集相同的id,将微阵列表达式值作为规范化的、日志转换的数据。任何探头或样品级的过滤也应事先进行。在这种情况下,从网站获得的数据是正常的,但仍然需要匹配id。
library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))
# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]
convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){
the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
colnames(the_probes_table) <- c("old_id", "new_id")
# Before DE, just pick the top expresed probe to represent the gene
# Not perfect, but this is a ranking-based analysis.
# hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
probe_expression_levels <- rowSums(expression_table)
the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]
the_genes_table <- the_probes_table %>%
group_by(new_id) %>%
top_n(1, avgexpr)
expression_table <- expression_table[the_genes_table$old_id,]
rownames(expression_table) <- the_genes_table$new_id
return(expression_table)
}
# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full,
probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")
# Go...
de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
norm_expression_table = norm_expression_table.genes,
sample_sheet_table = samples_table,
dataset_name = "Watkins2009PBMCs",
extra_factor_name = 'description',
sample_name = "sampleId",
group_name = 'celltype')
> de_table.Watkins2009PBMCs
# A tibble: 113,376 x 21
ID logFC AveExpr t P.Value adj.P.Val B ci CI_upper CI_lower ci_inner ci_outer log2FC fdr group sig sig_up
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <lgl>
1 JCHA~ 6.86 8.65 27.5 7.13e-26 2.70e-23 49.0 0.0952 6.96 6.77 6.77 6.96 6.86 2.70e-23 B ly~ TRUE TRUE
2 FCRLA 6.41 8.77 23.7 1.13e-23 3.29e-21 44.0 0.103 6.51 6.30 6.30 6.51 6.41 3.29e-21 B ly~ TRUE TRUE
3 VPRE~ 6.18 8.66 40.2 1.20e-31 8.70e-29 62.0 0.0587 6.23 6.12 6.12 6.23 6.18 8.70e-29 B ly~ TRUE TRUE
4 TCL1A 6.09 8.52 42.8 1.33e-32 1.14e-29 64.1 0.0544 6.14 6.04 6.04 6.14 6.09 1.14e-29 B ly~ TRUE TRUE
5 CD79A 6.09 9.08 22.7 5.17e-23 1.40e-20 42.4 0.102 6.20 5.99 5.99 6.20 6.09 1.40e-20 B ly~ TRUE TRUE
6 CD19 5.90 8.44 38.4 6.09e-31 4.11e-28 60.5 0.0587 5.96 5.85 5.85 5.96 5.90 4.11e-28 B ly~ TRUE TRUE
7 HLA-~ 5.61 8.35 47.9 2.30e-34 2.56e-31 68.0 0.0447 5.66 5.57 5.57 5.66 5.61 2.56e-31 B ly~ TRUE TRUE
8 OSBP~ 5.48 7.83 53.2 5.60e-36 8.82e-33 71.4 0.0394 5.52 5.45 5.45 5.52 5.48 8.82e-33 B ly~ TRUE TRUE
9 CNTN~ 5.38 7.95 49.3 8.12e-35 9.60e-32 68.9 0.0416 5.42 5.34 5.34 5.42 5.38 9.60e-32 B ly~ TRUE TRUE
10 COBL~ 5.26 7.70 56.4 6.77e-37 1.28e-33 73.3 0.0356 5.30 5.23 5.23 5.30 5.26 1.28e-33 B ly~ TRUE TRUE
# ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>
细胞类型定义(Compare 10X query PBMCs to to reference)
数据准备好就和之前的示例是一样的了。
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)
Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.
Logging the plot will be more informative at the top end for this dataset.
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)
可以打标签啦。
label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)
label_table.pbmc4k_k7_vs_Watkins2009PBMCs
可以看出,七个群有六个找到了与之相似的群名称,并给出了pval.
make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)
行啦,人pbmc细胞的定义工作就完成了。官网也提供了小鼠细胞类型识别的教程。这里就不再复述啦。
完成了细胞类型定义,单细胞的故事才刚刚开始。所以,这不是结束,甚至不是结束的开始,而可能是开始的结束。
SingleR
celaref
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
SummarizedExperiment