单细胞测序单细胞测序专题集合

10X单细胞空间数据分析之表征细胞状态和生态型(EcoTyper

2022-01-13  本文已影响0人  单细胞空间交响乐

隔离的第四天,一个人的日子总是很难熬,孤独的生活却总是来到我的身边,这一篇我们来分享EcoTyper的示例代码,帮助我们来寻找细胞特异性转录状态和共关联模式。这篇发表于cell的软件,文献的分享在10X单细胞空间数据分析之表征细胞状态和生态型(EcoTyper),我们来看看软件的具体使用方法,当然了,我们更加关注单细胞和空间数据的部分

EcoTyper 是一个机器学习框架,用于从bulk和单细胞 (scRNA-seq) 表达数据中大规模识别细胞类型特异性转录状态及其共关联模式

图片.png

软件本身已经定义了癌症(Luca/Steen 等人,Cell 2021)和弥漫性大 B 细胞淋巴瘤(DLBCL)(Steen/Luca 等人,Cancer Cell 2021)中的细胞状态和生态型。 当前版本的 EcoTyper 允许用户在他们自己的数据中恢复这两种肿瘤类别的细胞状态和生态型。 此外,它允许用户在他们感兴趣的系统中发现和恢复细胞状态和生态型,包括直接从 scRNA-seq 数据中恢复

安装EcoTyper

wget https://github.com/digitalcytometry/ecotyper/archive/refs/heads/master.zip
unzip master.zip
cd ecotyper-master
Basic resources

下面列出的 R 包是运行 EcoTyper 所必需的。 版本号表示用于开发和测试 EcoTyper 代码的包版本。 其他 R 版本也可能工作:

这些包与预先存储在 EcoTyper 文件夹中的其他资源一起,允许

当然了,我们关注单细胞和空间数据的分析

When the input is scRNA-seq or bulk-sorted cell type-specific profiles (e.g., FACS-purified), EcoTyper performs the following major steps for discovering cell states and ecotypes:

无论用于导出细胞状态和生态型的输入类型如何,EcoTyper 都可以在外部表达数据集中执行细胞状态和生态型恢复。 可以bulk、scRNA-seq 和空间转录组数据进行恢复。

我们来看看示例,提供的 scRNA-seq 数据中恢复细胞状态和生态型

EcoTyper 预装了用户提供的 scRNA-seq 数据中先前在癌症和淋巴瘤中定义的细胞状态和生态型的参考指导恢复所需的资源。

我们先来看看主脚本EcoTyper_recovery_scRNA.R

suppressPackageStartupMessages({
library(argparse)
source("pipeline/lib/config.R") 
source("pipeline/lib/multithreading.R")
})

parser <- ArgumentParser(add_help = F)

arguments = parser$add_argument_group('Arguments')

arguments$add_argument("-d", "--discovery", type="character", default="Carcinoma", 
    help=paste0("The name of the discovery dataset used to define cell states and ecotypes. Accepted values: ",
    "'Carcinoma' will recover the cell states and ecotypes defined across carcinomas, as described in the EcoTyper carcinoma paper, ",
    "'Lymphoma' will recover the cell states and ecotypes defined in diffuse large B cell lymphoma (DLBCL), as described in the EcoTyper lymphoma paper, ",
    "'<MyDiscovery>' the value used in the field 'Discovery dataset name' of the config file used for running EcoTyper discovery ('EcoTyper_discovery.R') script. ",
    "[default: '%(default)s']"),
    metavar="<character>") 
arguments$add_argument("-m", "--matrix", type = "character", metavar="<PATH>",
    help="Path to a tab-delimited file containing the input scRNA-seq expression matrix, with gene names on the first column and cell ids as column names [required].")
arguments$add_argument("-a", "--annotation", type = "character",  metavar="<PATH>",
    help="Path to a tab-delimited annotation file containing the annotation of cells in the input matrix. This file should contain at least two columns, 'ID' with the same values as the columns of the expression matrix, and 'CellType' (case sensitive) which contains the cell type for each cell. These values are limited to the set of cell types analyzed in the discovery dataset. If the argument '-d' is set to 'Carcinoma', then the accepted values for column 'CellType' are: 'B.cells', 'CD4.T.cells', 'CD8.T.cells', 'Dendritic.cells', 'Endothelial.cells', 'Epithelial.cells', 'Fibroblasts', 'Mast.cells', 'Monocytes.and.Macrophages', 'NK.cells', 'PCs' and 'PMNs'. If the argument '-d' is set to 'Lymphoma', then the accepted values for column 'CellType' are: 'B.cells', 'Plasma.cells', 'T.cells.CD8', 'T.cells.CD4', 'T.cells.follicular.helper', 'Tregs', 'NK.cells', 'Monocytes.and.Macrophages', 'Dendritic.cells', 'Mast.cells', 'Neutrophils', 'Fibroblasts', 'Endothelial.cells'. All other values will be ignored for these two cases. Additionally, this file can contain any number of columns, that can be used for plotting color bars in the output heatmaps (see argument '-c'). [required]")
arguments$add_argument("-c", "--columns",  type = "character",  metavar="<character>", default="NULL", 
    help="A comma-spearated list of column names from the annotation file to be plotted as color bar in the output heatmaps. [default: '%(default)s']")
arguments$add_argument("-z", "--z-score",  type = "character",  metavar="<bool>", default="FALSE", 
    help="A flag indicating whether the significance quantification procedure should be run. Note that this procedure might be slow, as the NMF model is applied 30 times on the same dataset. [default: '%(default)s']")
arguments$add_argument("-s", "--subsample",  type = "integer",  metavar="<integer>", default=-1, 
    help="An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. [default: '%(default)s' (no downsampling)]")
arguments$add_argument("-t", "--threads",  type = "integer",  metavar="<integer>", default=10, 
    help="Number of threads. [default: '%(default)s']")
arguments$add_argument("-o", "--output", type = "character",  metavar="<PATH>", default="RecoveryOutput",
    help="Output directory path. [default: '%(default)s']")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")


args <- parser$parse_args()
#print(args)
if(is.null(args$matrix))
{
    parser$print_help()
    quit()
}

input_mat = normalizePath(args$matrix)
discovery = args$discovery
annotation_path = normalizePath(args$annotation)
columns = args$columns
z_flag = as.logical(args$z)
n_threads = as.numeric(args$threads)
subsample = as.integer(args$subsample)

if(!file.exists(input_mat))
{
    stop("Error: Path to the input expression matrix does not exist!")
}

if(!discovery %in% c("Carcinoma", "Lymphoma"))
{
    config_file = file.path("EcoTyper", discovery, "config_used.yml")   
    if(!file.exists(config_file))
    {
        stop("Error: Cannot read the config file used for the discovery of cell states and ecotypes. It should be in the following path: '", config_file, "'. Please make sure that the '--discovery (-d)' argument provided is correct!")
    }
    config <- config::get(file = config_file)
    fractions = config$"Input"$"Cell type fractions" 
    if(is.null(fractions))
    {
        if(config$"Pipeline settings"$"Filter non cell type specific genes")
        {
            fractions = "Cell_type_specific_genes"
        }else{
            fractions = "All_genes"
        }
    }else{
        if(!fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions"))
        {
            fractions = "Custom"
        }
    }
}else{
    fractions = paste0(discovery, "_Fractions") 
}

if(!file.exists(annotation_path))
{
    stop("Error: Path to the input annotation file does not exist!")
}else{
    annotation = read.delim(annotation_path)    
    if(!all(c("ID", "CellType") %in% colnames(annotation)))
    {
        stop("Error: The annotation file provided, does not contain the columns 'ID' and 'CellType.")
    }

    if("Sample" %in% colnames(annotation))
    {
        recover_ecotypes = T
        
        ecotypes = read.delim(file.path("EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotypes.txt"))
        disc_cell_types = table(ecotypes$CellType)
        rec_cell_types= table(annotation$CellType)
        if(length(rec_cell_types) * 2 >= length(disc_cell_types))
        {
            cat("The annotation file contains column 'Sample', and more than half of the cell types are present in the recovery dataset. Will perform ecotype recovery.\n")
        }else{
            cat("The annotation file contains column 'Sample', but less than half of the cell types are present in the recovery dataset. Will NOT perform ecotype recovery.\n")
        }

    }else{
        recover_ecotypes = F
        cat("The annotation file does not contain column 'Sample'. Will NOT perform ecotype recovery.\n")
    }

    if(columns != "NULL")
    {
        additional_columns = strsplit(columns, ",")[[1]]
    
        if(!all(additional_columns %in% colnames(annotation)))
        {
            stop(paste0("The following columns are missing from the annotation file provided: ", "'", 
                paste(additional_columns[!additional_columns %in% colnames(annotation)], collapse = "'"), "'"))         
        } 
    }else{
        additional_columns = c()
    }
}

recovery = gsub(".tsv$", "", gsub(".txt$", "", basename(input_mat)))

input_dir = file.path("datasets", "scRNA", recovery)
dir.create(input_dir, recursive = T, showWarning = F)

dir.create(file.path(args$output, recovery), recursive = T, showWarnings = F)
final_output = normalizePath(file.path(args$output, recovery))

PushToJobQueue(paste0("ln -sf ", input_mat, " ", file.path(input_dir, "data.txt")))
RunJobQueue()
PushToJobQueue(paste0("ln -sf ", annotation_path, " ", file.path(input_dir, "annotation.txt")))
RunJobQueue()

start = Sys.time()
cur_dir = getwd()
setwd("pipeline")

key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
key = key[key[,1] %in% annotation$CellType,]
if(nrow(key) == 0)
{
    stop("Error: The column 'CellType' of the annotation file contains invalid values!")
}
for(cell_type in key[,1])
{
    n_states = key[key[,1] == cell_type, 2]
    PushToJobQueue((paste("Rscript state_recovery_scRNA.R", discovery, fractions, cell_type, n_states, recovery, z_flag, subsample, paste(additional_columns, collapse = " ")))) 
}   
RunJobQueue()

if(recover_ecotypes)
{
    PushToJobQueue((paste("Rscript ecotypes_recovery_scRNA.R", discovery, fractions, recovery))) 
    RunJobQueue()
}

cat("\nCopying EcoTyper results to the output folder...\n")

if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
    old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
    dir.create(old_results_folder, recursive = T, showWarnings = F)
    warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))  
    system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}

dir.create(final_output, recursive = T, showWarnings = F)

key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
key = key[key[,1] %in% annotation$CellType,] 
for(cell_type in key[,1])
{       
    n_clusters = key[key[,1] == cell_type, 2]   
    ct_output = file.path(final_output, cell_type)
    dir.create(ct_output, recursive = T, showWarnings = F)
        
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment_heatmap.pdf"), ct_output))      
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment_heatmap.png"), ct_output))      
    if(z_flag)
    {
        system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "recovery_z_scores.txt"), ct_output)) 
    }
}   

if(recover_ecotypes)
{
    ct_output = file.path(final_output, "Ecotypes")
    dir.create(ct_output, recursive = T, showWarnings = F)
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "ecotype_assignment.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "ecotype_abundance.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "heatmap_assigned_samples_viridis.pdf"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "heatmap_assigned_samples_viridis.png"), ct_output))
}

end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))

运行一下,

Rscript EcoTyper_recovery_scRNA.R -h
## usage: EcoTyper_recovery_scRNA.R [-d <character>] [-m <PATH>] [-a <PATH>]
##                                  [-c <character>] [-z <bool>] [-s <integer>]
##                                  [-t <integer>] [-o <PATH>] [-h]
## 
## Arguments:
##   -d <character>, --discovery <character>
##                         The name of the discovery dataset used to define cell
##                         states and ecotypes. Accepted values: 'Carcinoma' will
##                         recover the cell states and ecotypes defined across
##                         carcinomas, as described in the EcoTyper carcinoma
##                         paper, 'Lymphoma' will recover the cell states and
##                         ecotypes defined in diffuse large B cell lymphoma
##                         (DLBCL), as described in the EcoTyper lymphoma paper,
##                         '<MyDiscovery>' the value used in the field 'Discovery
##                         dataset name' of the config file used for running
##                         EcoTyper discovery ('EcoTyper_discovery.R') script.
##                         [default: 'Carcinoma']
##   -m <PATH>, --matrix <PATH>
##                         Path to a tab-delimited file containing the input
##                         scRNA-seq expression matrix, with gene names on the
##                         first column and cell ids as column names [required].
##   -a <PATH>, --annotation <PATH>
##                         Path to a tab-delimited annotation file containing the
##                         annotation of cells in the input matrix. This file
##                         should contain at least two columns, 'ID' with the
##                         same values as the columns of the expression matrix,
##                         and 'CellType' (case sensitive) which contains the
##                         cell type for each cell. These values are limited to
##                         the set of cell types analyzed in the discovery
##                         dataset. If the argument '-d' is set to 'Carcinoma',
##                         then the accepted values for column 'CellType' are:
##                         'B.cells', 'CD4.T.cells', 'CD8.T.cells',
##                         'Dendritic.cells', 'Endothelial.cells',
##                         'Epithelial.cells', 'Fibroblasts', 'Mast.cells',
##                         'Monocytes.and.Macrophages', 'NK.cells', 'PCs' and
##                         'PMNs'. If the argument '-d' is set to 'Lymphoma',
##                         then the accepted values for column 'CellType' are:
##                         'B.cells', 'Plasma.cells', 'T.cells.CD8',
##                         'T.cells.CD4', 'T.cells.follicular.helper', 'Tregs',
##                         'NK.cells', 'Monocytes.and.Macrophages',
##                         'Dendritic.cells', 'Mast.cells', 'Neutrophils',
##                         'Fibroblasts', 'Endothelial.cells'. All other values
##                         will be ignored for these two cases. Additionally,
##                         this file can contain any number of columns, that can
##                         be used for plotting color bars in the output heatmaps
##                         (see argument '-c'). [required]
##   -c <character>, --columns <character>
##                         A comma-spearated list of column names from the
##                         annotation file to be plotted as color bar in the
##                         output heatmaps. [default: 'NULL']
##   -z <bool>, --z-score <bool>
##                         A flag indicating whether the significance
##                         quantification procedure should be run. Note that this
##                         procedure might be slow, as the NMF model is applied
##                         30 times on the same dataset. [default: 'FALSE']
##   -s <integer>, --subsample <integer>
##                         An integer specifying the number of cells each cell
##                         type will be downsampled to. For values <50, no
##                         downsampling will be performed. [default: '-1' (no
##                         downsampling)]
##   -t <integer>, --threads <integer>
##                         Number of threads. [default: '10']
##   -o <PATH>, --output <PATH>
##                         Output directory path. [default: 'RecoveryOutput']
##   -h, --help            Print help message.
参数解析:
data = read.delim("example_data/scRNA_CRC_data.txt", nrow = 5)
head(data[,1:5])
##      Gene SMC01.T_AAAGATGCATGGATGG SMC01.T_AAAGTAGCAAGGACAC
## 1    A1BG                        0                        0
## 2    A1CF                        0                        0
## 3     A2M                        0                        0
## 4   A2ML1                        0                        0
## 5 A3GALT2                        0                        0
##   SMC01.T_AAATGCCAGGATCGCA SMC01.T_AACTCTTCACAACGCC
## 1                        0                        0
## 2                        0                        0
## 3                        0                        0
## 4                        0                        0
## 5                        0                        0
data = read.delim("example_data/scRNA_CRC_annotation.txt")
head(data)
##                      Index Patient Class  Sample        Cell_type Cell_subtype
## 1 SMC01-T_AAAGATGCATGGATGG   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 2 SMC01-T_AAAGTAGCAAGGACAC   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 3 SMC01-T_AAATGCCAGGATCGCA   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 4 SMC01-T_AACTCTTCACAACGCC   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 5 SMC01-T_AACTTTCGTTCGGGCT   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 6 SMC01-T_AAGGTTCTCCAATGGT   SMC01 Tumor SMC01-T Epithelial cells         CMS2
##           CellType                       ID Tissue
## 1 Epithelial.cells SMC01.T_AAAGATGCATGGATGG  Tumor
## 2 Epithelial.cells SMC01.T_AAAGTAGCAAGGACAC  Tumor
## 3 Epithelial.cells SMC01.T_AAATGCCAGGATCGCA  Tumor
## 4 Epithelial.cells SMC01.T_AACTCTTCACAACGCC  Tumor
## 5 Epithelial.cells SMC01.T_AACTTTCGTTCGGGCT  Tumor
## 6 Epithelial.cells SMC01.T_AAGGTTCTCCAATGGT  Tumor
运行一下单细胞数据
Rscript EcoTyper_recovery_scRNA.R -d Carcinoma -m example_data/scRNA_CRC_data.txt -a example_data/scRNA_CRC_annotation.txt -o RecoveryOutput

The outputs of this script include the following files, for each cell type provided:

data = read.delim("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment.txt")
head(data[,c("ID", "State")])
##                         ID State
## 1 SMC01.T_TGCGCAGTCGGATGGA   S01
## 2 SMC04.T_CACAAACTCTACTATC   S01
## 3 SMC15.T_GCGCGATTCATAAAGG   S01
## 4 SMC17.T_GTACGTAGTGACTACT   S01
## 5 SMC20.T_CTAAGACCACTGTCGG   S01
## 6 SMC20.T_GTTACAGTCGCGTTTC   S01
两个热图:一个表示发现数据集中细胞状态标记基因表达的热图,一个表示 scRNA-seq 数据集中相同标记基因表达的热图,经过平滑处理以减轻 scRNA-seq 丢失的影响:
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment_heatmap.png")
state_assignment_heatmap.png

如果应用统计显着性量化方法,则每个细胞状态的结果 z 分数将输出到同一目录中:

#data = read.delim("RecoveryOutput/scRNA_CRC_data/Epithelial.cells/recovery_z_scores.txt")
#head(data[,c("State", "Z")])

The output for ecotypes includes:

assign = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(assign)

## [1] 10 33
head(assign[,1:5])
##        SMC01.N     SMC01.T    SMC02.N    SMC02.T    SMC03.N
## CE1 0.03013608 0.151493627 0.10825504 0.21601181 0.02930968
## CE2 0.00000000 0.009612861 0.00000000 0.02671883 0.00000000
## CE3 0.05199290 0.077968658 0.21188075 0.07300246 0.00000000
## CE4 0.01693878 0.043794663 0.03124202 0.00000000 0.00000000
## CE5 0.06285928 0.022531963 0.01115162 0.06406829 0.03197420
## CE6 0.16508230 0.059542004 0.24436913 0.01855854 0.43445664
discrete_assignments = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(discrete_assignments)

## [1] 10 33
head(discrete_assignments[,1:5])
##        SMC01.N     SMC01.T    SMC02.N    SMC02.T    SMC03.N
## CE1 0.03013608 0.151493627 0.10825504 0.21601181 0.02930968
## CE2 0.00000000 0.009612861 0.00000000 0.02671883 0.00000000
## CE3 0.05199290 0.077968658 0.21188075 0.07300246 0.00000000
## CE4 0.01693878 0.043794663 0.03124202 0.00000000 0.00000000
## CE5 0.06285928 0.022531963 0.01115162 0.06406829 0.03197420
## CE6 0.16508230 0.059542004 0.24436913 0.01855854 0.43445664
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Ecotypes/heatmap_assigned_samples_viridis.png") 
heatmap_assigned_samples_viridis.png

示例2,Recovery of Lymphoma Cell States and Ecotypes in scRNA-seq Data

Rscript EcoTyper_recovery_scRNA.R -d Lymphoma -m example_data/scRNA_lymphoma_data.txt -a example_data/scRNA_lymphoma_annotation.txt -o RecoveryOutput -c Tissue
data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment.txt")
head(data[,c("ID", "State")])
##       ID State
## 1 Cell_2   S01
## 2 Cell_3   S01
## 3 Cell_4   S01
## 4 Cell_6   S01
## 5 Cell_8   S01
## 6 Cell_9   S01

热图

knitr::include_graphics("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment_heatmap.png")
state_assignment_heatmap.png

Recovery of Cell States and Ecotypes in Spatial Transcriptomics data(空间转录组示例),当然了,主要是10Xgenomics的空间数据

为了使 EcoTyper 在 Visium 数据中执行细胞状态和生态类型恢复,需要提供以下资源:

如果要分析的系统中预期的主要细胞群被 EcoTyper 癌论文中分析的细胞群(B 细胞、CD4 T 细胞、CD8 T 细胞、树突细胞、内皮细胞、上皮细胞、成纤维细胞、肥大细胞、 单核细胞/巨噬细胞、NK 细胞、浆细胞、中性粒细胞)或 EcoTyper 淋巴瘤纸(B 细胞、CD4 T 细胞、CD8 T 细胞、滤泡辅助 T 细胞、Tregs、树突细胞、内皮细胞、成纤维细胞、肥大细胞、单核细胞/ 巨噬细胞、NK细胞、浆细胞、中性粒细胞),那么需要:
如果要分析的系统中预期的主要细胞群没有被 EcoTyper 癌论文中分析的细胞群(B 细胞、CD4 T 细胞、CD8 T 细胞、树突细胞、内皮细胞、上皮细胞、成纤维细胞、肥大细胞)概括 、单核细胞/巨噬细胞、NK 细胞、浆细胞、中性粒细胞)或 EcoTyper 淋巴瘤论文(B 细胞、CD4 T 细胞、CD8 T 细胞、滤泡辅助 T 细胞、Tregs、树突状细胞、内皮细胞、成纤维细胞、肥大细胞、单核细胞 /巨噬细胞、NK细胞、浆细胞、中性粒细胞),则用户需要为这些群体提供自己的细胞类型比例估计:

首先看一下主脚本

suppressPackageStartupMessages({
library(config)
library(argparse)
source("pipeline/lib/config.R") 
source("pipeline/lib/misc.R") 
source("pipeline/lib/multithreading.R")
})

parser <- ArgumentParser(add_help = F)

arguments = parser$add_argument_group('Arguments')
arguments$add_argument("-c", "--config", type = "character", metavar="<PATH>", 
    help="Path to the config files [required].")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")

args <- parser$parse_args()

if(args$h || is.null(args$config))
{
    parser$print_help()
    quit()
}

config_file = abspath(args$config)

config <- config::get(file = config_file)

discovery = config$Input$"Discovery dataset name"
recovery = config$Input$"Recovery dataset name"
input_path = config$Input$"Input Visium directory"
fractions = config$Input$"Recovery cell type fractions"
coo = config$Input$"Malignant cell of origin"
CSx_username = config$Input$"CIBERSORTx username"
CSx_token = config$Input$"CIBERSORTx token"
n_threads = config$"Pipeline settings"$"Number of threads"

final_output = config$"Output"$"Output folder"

suppressWarnings({
    input_path = abspath(input_path)    
    final_output = abspath(file.path(final_output, recovery))   
})

suppressWarnings({
    fractions_path = abspath(fractions) 
    })

if(!discovery %in% c("Carcinoma", "Lymphoma"))
{
    discovery_config_file = file.path("EcoTyper", discovery, "config_used.yml") 
    if(!file.exists(discovery_config_file))
    {
        stop("Error: Cannot read the configuration file used for the discovery of cell states and ecotypes. It should be in the following path: '", config_file, "'. Please make sure that the '--discovery (-d)' argument provided is correct!")
    }
    config <- config::get(file = discovery_config_file)
    discovery_fractions = config$"Input"$"Cell type fractions"
    if(!is.null(discovery_fractions) && discovery_fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions"))
    {
        fractions = discovery_fractions
    }else{
        if(!is.null(config$"Input"$"Filter non cell type specific genes"))
        {
            if(config$"Input"$"Filter non cell type specific genes")
            {
                fractions = "Cell_type_specific_genes"
            }else{
                fractions = "All_genes"
            }
        }else{
            fractions = "Custom"    
        }
    }
}else{
    fractions = paste0(discovery, "_Fractions") 
}

#Starting EcoTyper
setwd("pipeline")
start = Sys.time()

cat("\nLoading visium data...\n")
PushToJobQueue(paste("Rscript spatial_load_visium_data.R", recovery, input_path))
RunJobQueue()

if(fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions") && !file.exists(fractions_path))
{
    cat("\nRunning CIBERSORTxFractions on the visium dataset...\n")
    PushToJobQueue(paste("Rscript csx_fractions.R", "visium", recovery, "LM22", "B_mode", CSx_username, CSx_token, TRUE))
    RunJobQueue()
    PushToJobQueue(paste("Rscript csx_fractions.R", "visium", recovery, "TR4", "B_mode", CSx_username, CSx_token, TRUE))            
    RunJobQueue()
    PushToJobQueue(paste("Rscript csx_fractions_two_tiered.R", "visium", recovery, "TR4", "B_mode", "LM22", "B_mode", fractions))
    RunJobQueue()
    coo = "Epithelial.cells"
    if(fractions %in% "Lymphoma_Fractions")
    {
        coo = "B.cells"
    }
}else{
    cat("\nLoading user-provided cell type fractions...\n")
        
    dir.create(file.path("../CIBERSORTx/fractions/visium", recovery, fractions), recursive = T, showWarnings = F)
    PushToJobQueue(paste("cp -f", fractions_path, file.path("../CIBERSORTx/fractions/visium", recovery, fractions, "CIBERSORTx_Adjusted.txt")))
    RunJobQueue()       
}

cat("\nRunning cell state recovery on the visium dataset...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
    n_states = key[key[,1] == cell_type, 2]
    PushToJobQueue((paste("Rscript state_recovery_visium.R", discovery, fractions, cell_type, n_states, recovery, "FALSE"))) 
}   
RunJobQueue()

cat("\nCalculating cell state abundances...\n")
print(paste("Rscript spatial_states.R", discovery, recovery, fractions, coo))
PushToJobQueue((paste("Rscript spatial_states.R", discovery, recovery, fractions, coo))) 
RunJobQueue()
cat("\nCalculating ecotype abundances...\n")
PushToJobQueue((paste("Rscript spatial_ecotypes.R", discovery, recovery, fractions, coo))) 
RunJobQueue()

cat("\nPlotting cell state heatmaps...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
    n_states = key[key[,1] == cell_type, 2]
    PushToJobQueue((paste("Rscript spatial_plot_states.R", discovery, recovery, fractions, coo, cell_type)))
}   
RunJobQueue()

PushToJobQueue((paste("Rscript spatial_plot_ecotypes.R", discovery, recovery, fractions, coo))) 
RunJobQueue()

cat("\nCopying EcoTyper results to the output folder!\n")

if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
    old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
    dir.create(old_results_folder, recursive = T, showWarnings = F)
    warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))  
    system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}

dir.create(final_output, recursive = T, showWarnings = F)

key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{           
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, paste0(cell_type, "_spatial_heatmaps.pdf")), final_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, paste0(cell_type, "_spatial_heatmaps.png")), final_output))
}   

system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "state_abundances.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "ecotype_abundances.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "Ecotype_spatial_heatmaps.pdf"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "Ecotype_spatial_heatmaps.png"), final_output))

end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))

运行一下

Rscript EcoTyper_recovery_visium.R -h
## usage: EcoTyper_recovery_visium.R [-c <PATH>] [-h]
## 
## Arguments:
##   -c <PATH>, --config <PATH>
##                         Path to the config files [required].
##   -h, --help            Print help message.

The configuration file

This script takes as input file a configuration file in YAML format. The configuration file for this tutorial is available in config_recovery_visium.yml:

default :
  Input :
    Discovery dataset name : "Carcinoma"
    Recovery dataset name : "VisiumBreast"
    Input Visium directory : "example_data/VisiumBreast"
    #Path to a file containing the precomputed cell fractions for the visium array
    Recovery cell type fractions : "NULL"
    Malignant cell of origin : "Epithelial.cells"
    CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
    CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"

  Output :
    Output folder : "VisiumOutput"

  Pipeline settings :
    Number of threads : 10

The configuration file has three sections, Input, Pipeline settings, and Output. We next will describe the expected content in each of these sections, and instruct the user how to set the appropriate settings in their applications.

Input section

The Input section contains settings regarding the input data.

Discovery dataset name

Discovery dataset name : "Carcinoma"

Discovery dataset name should contain the name of the discovery dataset used for defining cell states. By default, the only accepted values are Carcinoma and Lymphoma (case sensitive), which will recover the cell states that we defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (Tutorials 4-6), the name of the discovery dataset is the value provided in the Discovery dataset name field of the configuration file used for running discovery. For this tutorial, we set the name of the discovery dataset to Carcinoma.

Recovery dataset name

Recovery dataset name : "VisiumBreast"

Recovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes abundances. Any value that contains alphanumeric characters and ’_’ is accepted for this field.

Input Visium directory

Input Visium directory : "example_data/VisiumBreast"

There are 4 input files needed for recovery on the visium data:

list.files("example_data/VisiumBreast")
## [1] "barcodes.tsv.gz"           "features.tsv.gz"          
## [3] "matrix.mtx.gz"             "tissue_positions_list.csv"

The filtered feature-barcode matrices barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz, in the format provided by 10x Genomics, and the tissue_positions_list.csv file produced by the run summary images pipeline, containing the spatial position of barcodes.

Recovery cell type fractions

Recovery cell type fractions : "NULL"

Recovery cell type fractions should contain the path to a file containing the cell type fraction estimations for each spot on the visium array. This field is ignored when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorial 4, using Carcinoma_Fractions or Lymphoma_Fractions. It is only used when users provided their own cell type fractions for deriving cell states and ecotypes in Tutorial 4. In this case, the user needs to provide a path to a tab-delimited file for this field. The file should contain in the first column the same sample names used as column names in the input expression matrix, and in the next columns, the cell type fractions for the same cell populations used for discovering cell states and ecotypes. These fractions should sum up to 1 for each row. An example of such a file is provided in:

data = read.delim("example_data/visium_fractions_example.txt", nrow = 5)
dim(data)
## [1]  5 13

data
##              Mixture Fibroblasts Endothelial.cells Epithelial.cells    B.cells
## 1 AAACAAGTATCTCCCA.1   0.3747796       0.016948078        0.2164860 0.04116797
## 2 AAACACCAATAACTGC.1   0.1231510       0.028426736        0.6737582 0.02209104
## 3 AAACAGAGCGACTCCT.1   0.2383718       0.085296697        0.3124031 0.03159104
## 4 AAACAGGGTCTATATT.1   0.1178922       0.053757339        0.1128586 0.11847960
## 5 AAACAGTGTTCCTGGG.1   0.3699561       0.005238928        0.5008316 0.01311040
##   CD4.T.cells CD8.T.cells Dendritic.cells  Mast.cells Monocytes.and.Macrophages
## 1  0.10447645 0.033647446     0.016196773 0.021842932                0.06183330
## 2  0.04376232 0.025219723     0.006647209 0.008375436                0.03087553
## 3  0.04581258 0.028235504     0.025698640 0.020992271                0.04386487
## 4  0.11018235 0.154411312     0.004780762 0.013200087                0.09278191
## 5  0.02826441 0.007037966     0.005238555 0.006979820                0.02851720
##      NK.cells        PCs        PMNs
## 1 0.030228865 0.06911276 0.013279833
## 2 0.006960189 0.02580987 0.004922716
## 3 0.020617429 0.13918007 0.007936014
## 4 0.000000000 0.21863726 0.003018626
## 5 0.007854087 0.02484208 0.002128834

Since in this tutorial we use the Carcinoma dataset as the discovery dataset, this field is not required. However, if it needs to be provided, it can be set as follows:

Recovery cell type fractions : "example_data/visium_fractions_example.txt"

Malignant cell of origin

Malignant cell of origin : "Epithelial.cells"

The cell of origin population for the cancer type being analyzed, amongst the cell types used for discovery. This field is used for plotting a gray background in the resulting output plot, with the intensity of gray depicting the abundance of the cell of origin population in each spot. It is not used when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorials 4-6, using Carcinoma_Fractions or Lymphoma_Fractions. In these cases, the malignant cells are automatically considered to be originating from Epithelial.cells or B.cells, respectively. Otherwise, this field needs to contain a column name in the file provided in Recovery cell type fractions field, corresponding to the appropriate cell type of origin.

CIBERSORTx username and token

CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"

The fields CIBERSORTx username and CIBERSORTx token should contain the username on the CIBERSORTx website and the token necessary to run the CIBERSORTx source code. The token can be obtained from the CIBERSORTx website.

The output section

The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.

Output folder : "VisiumOutput"

Number of threads

The last section, Pipeline settings, contains only one argument, the number of threads used for performing recovery:

Number of threads : 10

The command line

After editing the configuration file (config_recovery_visium.yml), the command line for recovering the cell states and ecotypes in Visium Spatial Gene Expression data looks as illustrated below. Please note that this script might take up to two hours to run on 10 threads. Also, since CIBERSORTx is run on each spot, the memory requirements might exceed the memory available on a typical laptop. We recommend that this tutorial is run on a server with >32GB of RAM.

Rscript EcoTyper_recovery_visium.R -c config_recovery_visium.yml

The output format

EcoTyper generates for each cell type the following outputs:

data = read.delim("VisiumOutput/VisiumBreast/state_abundances.txt")
dim(data)
## [1] 3813   76

head(data[,1:10])
##                   ID  X   Y       Sample Malignant B.cells_S01 B.cells_S02
## 1 AAACAAGTATCTCCCA.1 50 102 VisiumBreast 0.2164860           0           0
## 2 AAACACCAATAACTGC.1 59  19 VisiumBreast 0.6737582           0           0
## 3 AAACAGAGCGACTCCT.1 14  94 VisiumBreast 0.3124031           0           0
## 4 AAACAGGGTCTATATT.1 47  13 VisiumBreast 0.1128586           1           0
## 5 AAACAGTGTTCCTGGG.1 73  43 VisiumBreast 0.5008316           0           0
## 6 AAACATTTCCCGGATT.1 61  97 VisiumBreast 0.7553180           0           0
##   B.cells_S03 B.cells_S04 B.cells_S05
## 1           0   0.0000000   0.8690817
## 2           0   0.0000000   0.0000000
## 3           0   0.0000000   0.0000000
## 4           0   0.0000000   0.0000000
## 5           0   0.2767688   0.0000000
## 6           0   0.3345834   0.0000000

knitr::include_graphics("VisiumOutput/VisiumBreast/Fibroblasts_spatial_heatmaps.png")
Fibroblasts_spatial_heatmaps.png
data = read.delim("VisiumOutput/VisiumBreast/ecotype_abundances.txt")
dim(data)
## [1] 3813   15
head(data[,1:10])
##                                ID  X   Y       Sample Malignant        CE1
## VisiumBreast.1 AAACAAGTATCTCCCA.1 50 102 VisiumBreast 0.1142685 0.93179887
## VisiumBreast.2 AAACACCAATAACTGC.1 59  19 VisiumBreast 0.7334114 0.23355315
## VisiumBreast.3 AAACAGAGCGACTCCT.1 14  94 VisiumBreast 0.2441395 0.00000000
## VisiumBreast.4 AAACAGGGTCTATATT.1 47  13 VisiumBreast 0.0000000 0.09941146
## VisiumBreast.5 AAACAGTGTTCCTGGG.1 73  43 VisiumBreast 0.4992702 0.64507665
## VisiumBreast.6 AAACATTTCCCGGATT.1 61  97 VisiumBreast 0.8438427 0.02056875
##                      CE2       CE3       CE4 CE5
## VisiumBreast.1 0.0000000 0.0000000 0.7497744   0
## VisiumBreast.2 0.2357562 0.0000000 0.0000000   0
## VisiumBreast.3 0.0000000 0.0000000 0.0000000   0
## VisiumBreast.4 0.0000000 0.0000000 0.0000000   0
## VisiumBreast.5 0.0000000 0.1836142 0.0000000   0
## VisiumBreast.6 0.0000000 0.0000000 0.0000000   0

Plots illustrating the ecotype abundances. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
knitr::include_graphics("VisiumOutput/VisiumBreast/Ecotype_spatial_heatmaps.png")


Ecotype_spatial_heatmaps.png

示例3,De novo Discovery of Cell States and Ecotypes in scRNA-seq Data

在教程中,将说明如何从 scRNA-seq 表达矩阵开始对细胞状态和生态型进行从头识别。 出于说明目的,我们使用来自结肠直肠癌的 scRNA-seq 的下采样版本作为发现数据集,可在 example_data/scRNA_CRC_data.txt 以及示例注释文件 example_data/scRNA_CRC_annotation.txt 中获得。

Overview of the EcoTyper workflow for discovering cell states in scRNA-seq data

EcoTyper 通过一系列步骤从 scRNA-seq 数据中获取细胞状态和生态型:

对于每种细胞类型,EcoTyper 将 NMF 应用于一系列等级(细胞状态数),默认为 2-20 个状态。对于每个等级,NMF 算法使用不同的起始种子多次应用(我们建议至少 50 次),以提高鲁棒性。

Checklist before performing cell states and ecotypes discovery in scRNA-seq data

Cell states and ecotypes discovery in scRNA-seq data,先看一下主脚本EcoTyper_discovery_scRNA.R

suppressPackageStartupMessages({
library(config)
library(argparse)
source("pipeline/lib/config.R") 
source("pipeline/lib/misc.R") 
source("pipeline/lib/multithreading.R")
})

parser <- ArgumentParser(add_help = F)

arguments = parser$add_argument_group('Arguments')

arguments$add_argument("-c", "--config", type = "character", metavar="<PATH>", 
    help="Path to the config files [required].")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")

args <- parser$parse_args()
#print(args)
if(args$h || is.null(args$config))
{
    parser$print_help()
    quit()
}

config_file = abspath(args$config)

config <- config::get(file = config_file)
check_discovery_configuration_scRNA(config)

discovery = config$Input$"Discovery dataset name"
discovery_type = config$Input$"Expression type"
scale_column = config$Input$"Annotation file column to scale by"
additional_columns = config$Input$"Annotation file column(s) to plot"

final_output = config$"Output"$"Output folder"

n_threads = config$"Pipeline settings"$"Number of threads"

nmf_restarts = config$"Pipeline settings"$"Number of NMF restarts"
max_clusters = config$"Pipeline settings"$"Maximum number of states per cell type"
cohpenetic_cutoff = config$"Pipeline settings"$"Cophenetic coefficient cutoff"
skip_steps = config$"Pipeline settings"$"Pipeline steps to skip"
p_value_cutoff = config$"Pipeline settings"$"Jaccard matrix p-value cutoff"

suppressWarnings({
    final_output = abspath(final_output)    
})

#Starting EcoTyper
setwd("pipeline")
start = Sys.time()

if(config$"Pipeline settings"$"Filter non cell type specific genes")
{
    fractions = "Cell_type_specific_genes"
}else{
    fractions = "All_genes"
}

if(!1 %in% skip_steps & config$"Pipeline settings"$"Filter non cell type specific genes")
{
    cat("\nStep 1 (extract cell type specific genes)...\n")

    annotation = read.delim(file.path(file.path("../datasets/discovery", discovery, "annotation.txt"))) 
    cell_types = unlist(levels(as.factor(as.character(annotation$CellType))))   
    for(cell_type in cell_types)
    {
        print(cell_type)
        PushToJobQueue(paste("Rscript state_discovery_scRNA_filter_genes.R", discovery, fractions, cell_type, scale_column))    
    }
    RunJobQueue()   
    cat("Step 1 (extract cell type specific genes) finished successfully!\n")
    
}else{
    cat("Skipping step 1 (extract cell type specific genes)...\n")
}

if(!2 %in% skip_steps)
{
    cat("\nStep 2 (cell state discovery on correrlation matrices): Calculating correlation matrices...\n")

    annotation = read.delim(file.path(file.path("../datasets/discovery", discovery, "annotation.txt"))) 
    cell_types = unlist(levels(as.factor(as.character(annotation$CellType))))   

    for(cell_type in cell_types)
    {
        filter_genes = (fractions == "Cell_type_specific_genes")
        PushToJobQueue(paste("Rscript state_discovery_scRNA_distances.R", discovery, fractions, cell_type, filter_genes, scale_column)) 
    }   
    RunJobQueue() 
    
    cat("Step 2 (cell state discovery on correrlation matrices): Running NMF (Warning: This step might take a long time!)...\n")

    for(cell_type in cell_types)
    {       
        if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, "expression_top_genes_scaled.txt")))
        {           
            next
        }
        for(n_clusters in 2:max_clusters)
        {
            for(restart in 1:nmf_restarts)
            {
                if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, n_clusters, "restarts", restart, "estim.RData")))
                {
                    PushToJobQueue(paste("Rscript state_discovery_NMF.R", "discovery_cross_cor", discovery, fractions, cell_type, n_clusters, restart))
                }else{                  
                    cat(paste0("Warning: Skipping NMF on '", cell_type, "' (number of states = ", n_clusters, ", restart ", restart, "), as the output file '", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData"), "' already exists!\n"))
                }
            } 
        }           
    } 
    RunJobQueue()
        
    cat("Step 2 (cell state discovery on correrlation matrices): Aggregating NMF results...\n")
    for(cell_type in cell_types)
    {   
        if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, "expression_top_genes_scaled.txt")))
        {           
            next
        }               
        PushToJobQueue(paste("Rscript state_discovery_combine_NMF_restarts.R", "discovery_cross_cor", discovery, fractions, cell_type, max_clusters, nmf_restarts))
    } 
    RunJobQueue()
    cat("Step 2 (cell state discovery on correrlation matrices) finished successfully!\n")
}else{
    cat("Skipping step 2 (cell state discovery on correrlation matrices)...\n")
}   

if(!3 %in% skip_steps)
{
    cat("\nStep 3 (choosing the number of cell states)...\n")
    PushToJobQueue(paste("Rscript state_discovery_rank_selection.R", "discovery_cross_cor", discovery, fractions, max_clusters, cohpenetic_cutoff))
    RunJobQueue()
    cat("Step 3 (choosing the number of cell states) finished successfully!\n")
}else{
    cat("Skipping step 3 (choosing the number of cell states)...\n")
}

if(!4 %in% skip_steps)
{
    cat("\nStep 4 (extracting cell state information)...\n")

    system(paste("cp -f ", config_file, file.path("../EcoTyper", discovery, "config_used.yml")))

    key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
    for(cell_type in key[,1])
    {   
        cat(paste("Extracting cell states information for:", cell_type, "\n"))
        n_clusters = key[key[,1] == cell_type, 2]
        PushToJobQueue(paste("Rscript state_discovery_initial_plots_scRNA.R", "discovery_cross_cor", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))       
    }   
    RunJobQueue()
    cat("Step 4 (extracting cell state information) finished successfully!\n")
}else{
    cat("\nSkipping step 4 (extracting cell state information)...\n")
}

if(!5 %in% skip_steps)
{
    cat("\nStep 5 (cell state re-discovery in expression matrices)...\n")

    key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
    for(cell_type in key[,1])
    {   
        cat(paste("Extracting marker genes for cell states defined in:", cell_type, "\n"))
        n_clusters = key[key[,1] == cell_type, 2]
        PushToJobQueue(paste("Rscript state_discovery_extract_features_scRNA.R", discovery, fractions, cell_type, n_clusters))       
    }   
    RunJobQueue()
    
    cat("\nStep 5 (cell state re-discovery in expression matrices): Running NMF on expression matrix...\n")

    key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
    for(cell_type in key[,1])
    {
        if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, "expression_top_genes_scaled.txt")))
        {           
            next
        }
        
        n_clusters = key[key[,1] == cell_type, 2]
        for(restart in 1:nmf_restarts)
        {
            if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData")))
            {
                PushToJobQueue(paste("Rscript state_discovery_NMF.R", "discovery", discovery, fractions, cell_type, n_clusters, restart))
            }else{                  
                cat(paste0("Warning: Skipping NMF on '", cell_type, "' (number of states = ", n_clusters, ", restart ", restart, "), as the output file '", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData"), "' already exists!\n"))
            }
        } 
    } 
    RunJobQueue()
        
    cat("Step 5 (cell state re-discovery in expression matrices): Aggregating NMF results...\n")
    for(cell_type in key[,1])
    {   
        if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, "expression_top_genes_scaled.txt")))
        {           
            next
        }               
        PushToJobQueue(paste("Rscript state_discovery_combine_NMF_restarts.R", "discovery", discovery, fractions, cell_type, max_clusters, nmf_restarts))
    } 
    RunJobQueue()
    cat("Step 5 (cell state re-discovery in expression matrices) finished successfully!\n")
}else{
    cat("Skipping step 5 (cell state re-discovery in expression matrices)...\n")
}   

if(!6 %in% skip_steps) 
{
    cat("\nStep 6 (extracting information for re-discovered cell states)...\n")

    key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
    for(cell_type in key[,1])
    {   
        cat(paste("Extracting cell states information for:", cell_type, "\n"))
        n_clusters = key[key[,1] == cell_type, 2]
        PushToJobQueue(paste("Rscript state_discovery_initial_plots.R", "discovery", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))       
    }   
    RunJobQueue()
    cat("Step 6 (extracting information for re-discovered cell states) finished successfully!\n")
}else{
    cat("\nSkipping step 6 (extracting information for re-discovered cell states)...\n")
}

if(!7 %in% skip_steps)
{
    cat("\nStep 7 (cell state QC filter)...\n")

    key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
    for(cell_type in key[,1])
    {   
        cat(paste("Filtering low-quality cell states for:", cell_type, "\n"))
        n_clusters = key[key[,1] == cell_type, 2]
        PushToJobQueue(paste("Rscript state_discovery_first_filter_scRNA.R", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))       
    }   
    RunJobQueue()
    cat("Step 7 (cell state QC filter) finished successfully!\n")
}else{
    cat("\nSkipping step 7 (cell state QC filter)...\n")
}

if(!8 %in% skip_steps)
{
    cat("\nStep 8 (ecotype discovery)...\n")
    PushToJobQueue(paste("Rscript ecotypes_scRNA.R", discovery, fractions, p_value_cutoff)) 
    RunJobQueue()
    PushToJobQueue(paste("Rscript ecotypes_assign_samples_scRNA.R", discovery, fractions, "State",paste(additional_columns, collapse = " "))) 
    cat("Step 8 (ecotype discovery) finished successfully!\n")
    RunJobQueue()
}else{
    cat("Skipping step 8 (ecotype discovery)...\n")
}

cat("\nCopying EcoTyper results to the output folder!\n")

if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
    old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
    dir.create(old_results_folder, recursive = T, showWarnings = F)
    warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))  
    system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}

dir.create(final_output, recursive = T, showWarnings = F)

system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_plot.pdf"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_plot.png"), final_output))

key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{       
    n_clusters = key[key[,1] == cell_type, 2]   
    ct_output = file.path(final_output, cell_type)
    dir.create(ct_output, recursive = T, showWarnings = F)
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "gene_info.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_abundances.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment_heatmap.pdf"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment_heatmap.png"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "heatmap_data.txt"), ct_output))
    system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "heatmap_top_ann.txt"), ct_output)) 
}   

ct_output = file.path(final_output, "Ecotypes")
dir.create(ct_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotypes.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotype_assignment.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotype_abundance.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "heatmap_assigned_samples_viridis.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "heatmap_assigned_samples_viridis.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "jaccard_matrix.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "jaccard_matrix.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "nclusters_jaccard.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "nclusters_jaccard.pdf"), ct_output))

end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))
Rscript EcoTyper_discovery_scRNA.R -h
## usage: EcoTyper_discovery_scRNA.R [-c <PATH>] [-h]
## 
## Arguments:
##   -c <PATH>, --config <PATH>
##                         Path to the config files [required].
##   -h, --help            Print help message. 

该脚本将 YAML 格式的配置文件作为输入文件。 本教程的配置文件位于 config_discovery_scRNA.yml 中:

default :
  Input :    
    Discovery dataset name : "discovery_scRNA_CRC"
    Expression matrix : "example_data/scRNA_CRC_data.txt"    
    Annotation file : "example_data/scRNA_CRC_annotation.txt" 
    Annotation file column to scale by : NULL
    Annotation file column(s) to plot : []
    
  Output :
    Output folder : "DiscoveryOutput_scRNA"

  Pipeline settings :
    #Pipeline steps:
    #   step 1 (extract cell type specific genes)
    #   step 2 (cell state discovery on correrlation matrices)
    #   step 3 (choosing the number of cell states)
    #   step 4 (extracting cell state information)
    #   step 5 (cell state re-discovery in expression matrices)
    #   step 6 (extracting information for re-discovered cell states)
    #   step 7 (cell state QC filter)
    #   step 8 (ecotype discovery)
    Pipeline steps to skip : [] 
    Filter non cell type specific genes : True
    Number of threads : 10
    Number of NMF restarts : 5
    Maximum number of states per cell type : 20
    Cophenetic coefficient cutoff : 0.95
    #The p-value cutoff used for filtering non-significant overlaps in the jaccard matrix used for discovering ecotypes in step 8. Default: 1 (no filtering).
    Jaccard matrix p-value cutoff : 1

配置文件包含三个部分,输入、输出和管道设置。 接下来,我们将分别描述这三个部分的预期内容,并指导用户如何在其应用程序中设置适当的设置。

Input section

The Input section contains settings regarding the input data.

Discovery dataset name

Discovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes defined on this discovery dataset. It is also the name to be provided to the -d/–discovery argument of scripts EcoTyper_recovery_scRNA.R and EcoTyper_recovery_bulk.R, when performing cell state/ecotypes recovery. Any value that contains alphanumeric characters and ’_’ is accepted for this field.

Discovery dataset name : "discovery_scRNA_CRC"

Expression matrix

Expression matrix : "example_data/scRNA_CRC_data.txt"

Expression matrix field should contain the path to a tab-delimited file containing the expression data, with genes as rows and cells as columns. The expression matrix should be in the TPM, CPM or other suitable normalized space. It should have gene symbols on the first column and gene counts for each cell on the next columns. Column (cells) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function make.names, e.g. having digits at the beginning of the name or containing characters such as space, tab or -:

The expected format for the expression matrix is:

data = read.delim("example_data/scRNA_CRC_data.txt", nrow = 5)
dim(data)
## [1]     5 13781

head(data[,1:5])
##      Gene SMC01.T_AAAGATGCATGGATGG SMC01.T_AAAGTAGCAAGGACAC
## 1    A1BG                        0                        0
## 2    A1CF                        0                        0
## 3     A2M                        0                        0
## 4   A2ML1                        0                        0
## 5 A3GALT2                        0                        0
##   SMC01.T_AAATGCCAGGATCGCA SMC01.T_AACTCTTCACAACGCC
## 1                        0                        0
## 2                        0                        0
## 3                        0                        0
## 4                        0                        0
## 5                        0                        0

Annotation file

Annotation file : "example_data/scRNA_CRC_annotation.txt"

A path to an annotation file should be provided in the field Annotation file. This file should contain a column called ID with the same names (e.g. cell barcodes) as the columns of the expression matrix, a column called CellType indicating cell type for each cell, and a column called Sample indicating the sample identifier for each cell. The latter is used for ecotype discovery. This file can contain any number of additional columns. The additional columns can be used for defining sample batches (see Section Annotation file column to scale by below) and for plotting color bars in the heatmaps output (see Section Annotation file column(s) to plot below). For the current example, the annotation file has the following format:

annotation = read.delim("example_data/scRNA_CRC_annotation.txt", nrow = 5)
dim(annotation)
## [1] 5 9

head(annotation)
##                      Index Patient Class  Sample        Cell_type Cell_subtype
## 1 SMC01-T_AAAGATGCATGGATGG   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 2 SMC01-T_AAAGTAGCAAGGACAC   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 3 SMC01-T_AAATGCCAGGATCGCA   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 4 SMC01-T_AACTCTTCACAACGCC   SMC01 Tumor SMC01-T Epithelial cells         CMS2
## 5 SMC01-T_AACTTTCGTTCGGGCT   SMC01 Tumor SMC01-T Epithelial cells         CMS2
##           CellType                       ID Tissue
## 1 Epithelial.cells SMC01.T_AAAGATGCATGGATGG  Tumor
## 2 Epithelial.cells SMC01.T_AAAGTAGCAAGGACAC  Tumor
## 3 Epithelial.cells SMC01.T_AAATGCCAGGATCGCA  Tumor
## 4 Epithelial.cells SMC01.T_AACTCTTCACAACGCC  Tumor
## 5 Epithelial.cells SMC01.T_AACTTTCGTTCGGGCT  Tumor

Annotation file column to scale by

Annotation file column to scale by : NULL

In order to discover pan-carcinoma cell states and ecotypes in the EcoType carcinoma paper, we standardize genes to mean 0 and unit variance within each tumor type (histology). Field Annotation file column to scale by allows users to specify a column name in the annotation file, by which the cells will be grouped when performing standardization. However, this is an analytical choice, depending on the purpose of the analysis. If the users are interested in defining cell states and ecotypes regardless of tumor type-specificity, this argument can be set to NULL. In this case, the standardization will be applied across all samples in the discovery cohort. The same will happen if the annotation file is not provided.

In the current example, this field is not used and therefore set to NULL.

Annotation file column(s) to plot

Annotation file column(s) to plot : ["Histology", "Tissue"]

Annotation file column(s) to plot field specifies which columns in the annotation file will be used as color bar in the output heatmaps, in addition to the cell state label column, plotted by default.

The output section

The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.

Output folder : "DiscoveryOutput_scRNA"

Pipeline settings

The last section, Pipeline settings, contains settings controlling how EcoTyper is run.

Pipeline steps to skip

 #Pipeline steps:
    #   step 1 (extract cell type specific genes)
    #   step 2 (cell state discovery on correlation matrices)
    #   step 3 (choosing the number of cell states)
    #   step 4 (extracting cell state information)
    #   step 5 (cell state re-discovery in expression matrices)
    #   step 6 (extracting information for re-discovered cell states)
    #   step 7 (cell state QC filter)
    #   step 8 (ecotype discovery)

The Pipeline steps to skip option allows user to skip some of the steps outlined in section Overview of the EcoTyper workflow for discovering cell states. Please note that this option is only intended for cases when the pipeline had already been run once, and small adjustments are made to the parameters. For example, if the Cophenetic coefficient cutoff used in step 3 needs adjusting, the user might want to skip steps 1-2 and re-run from step 3 onwards.

Filter non cell type specific genes

Filter non cell type specific genes : True

Flag indicated whether to apply the filter for cell type specific genes in step 1, outlined in section Overview of the EcoTyper workflow for discovering cell states. For best results, we do recommend applying this filter.

Number of threads

Number of threads : 10

The number of threads EcoTyper will be run on.

Number of NMF restarts

Number of NMF restarts : 5

The NMF approach used by EcoTyper (Brunet et al.), can give slightly different results, depending on the random initialization of the algorithm. To obtain a stable solution, NMF is generally run multiple times with different seeds, and the solution that best explains the discovery data is chosen. Additionally, the variation of NMF solutions across restarts with different seeds is quantified using Cophenetic coefficients and used in step 4 of EcoTyper for selecting the number of states. The parameter Number of NMF restarts specifies how many restarts with different seed should EcoTyper perform for each rank selection, in each cell type. Since this is a very time consuming process, in this example we only use 5. However, for publication-quality results, we recommend at least 50 restarts.

Maximum number of states per cell type

Maximum number of states per cell type : 20

Maximum number of states per cell type specifies the upper end of the range for the number of states possible for each cell type. The lower end is 2.

Cophenetic coefficient cutoff

Cophenetic coefficient cutoff : 0.975

This field indicates the Cophenetic coefficient cutoff, in the range [0, 1], used for automatically determining the number of states in step 4. Lower values generally lead to more clusters being identified. In this particular example, we set it to 0.975.

Jaccard matrix p-value cutoff

Jaccard matrix p-value cutoff : 1

Ecotype identification on step 8 is performed by clustering a jaccard matrix that quantifies the sample overlap between each pair of states. Prior to performing ecotype identification, the jaccard matrix values corresponding to pairs of states for which the sample overlap is not significant are set to 0, in order to mitigate the noise introduced by spurious overlaps. The value provided in this field specifies the p-value cutoff above which the overlaps are considered non-significant. When the number of samples in the scRNA-seq dataset is small, such as in the current example, we recommend this filter is disabled (p-value cutoff = 1), to avoid over-filtering the jaccard matrix. However, we encourage users to set this cutoff to lower values (e.g. 0.05), if the discovery scRNA-seq dataset contains a number of samples large enough to reliably evaluate the significance of overlaps.

The command line

After editing the configuration file (config_discovery_scRNA.yml), the de novo discovery cell states and ecotypes can be run as is illustrated below. Please note that this script might take 24-48 hours to run on 10 threads. Also, EcoTyper cannot be run on the example data from this tutorial using a typical laptop (16GB memory). We recommend that it is run on a server with >50-100GB of RAM or a high performance cluster.

Rscript EcoTyper_discovery_scRNA.R -c config_discovery_scRNA.yml

The output format

EcoTyper generates for each cell type the following outputs:

knitr::include_graphics("DiscoveryOutput_scRNA/rank_plot.png")
rank_plot.png

对于每种细胞类型,都会产生以下输出,此处以成纤维细胞为例:

在发现数据集中的样本中,在步骤 6 和 7(如果运行)中的 QC 过滤器之后剩余的细胞状态丰度:

data = read.delim("DiscoveryOutput_scRNA/Fibroblasts/state_abundances.txt")
dim(data)
## [1]    5 1500
head(data[,1:5])
##     SMC01.T_AAAGTAGAGTGGTAGC SMC01.T_ACACCCTGTTGGTAAA SMC01.T_ACATCAGTCGCCTGAG
## S01             2.074345e-14             4.262766e-02             1.352784e-14
## S02             2.074345e-14             3.332167e-01             3.957569e-01
## S03             2.781436e-02             1.204917e-02             1.706085e-01
## S04             2.074345e-14             9.441681e-13             1.352784e-14
## S05             2.074345e-14             6.121064e-01             7.391235e-02
##     SMC01.T_ACTATCTAGCTAGTCT SMC01.T_ACTGATGAGCACCGCT
## S01             5.529717e-02             1.022447e-14
## S02             2.081811e-14             1.805516e-01
## S03             3.519208e-02             6.787254e-01
## S04             2.081811e-14             1.022447e-14
## S05             2.081811e-14             7.112757e-02

Assignment of samples in the discovery dataset to the cell state with the highest abundance. Only samples assigned to the cell states remaining after the QC filters in steps 6 and 7 (if run) are included. The remaining ones are considered unassigned and removed from this table:

data = read.delim("DiscoveryOutput_scRNA/Fibroblasts/state_assignment.txt")
dim(data)
## [1] 899   3
head(data)
##                           ID State InitialState
## 723 SMC15.T_CATCGAAGTGACCAAG   S01         IS05
## 724 SMC18.T_CTTGGCTCAGTGACAG   S01         IS05
## 725 SMC24.T_TACTTACAGCGCCTTG   S01         IS05
## 726 SMC01.N_CACCAGGCAATAAGCA   S01         IS05
## 727 SMC02.N_AGAGCTTTCTAACCGA   S01         IS05
## 728 SMC02.N_ATAACGCCAATACGCT   S01         IS05

A heatmap illustrating the expression of genes used for cell state discovery, that have the highest fold-change in one of the cell states remaining after the QC filters in steps 6 and 7 (if run). In the current example, the heatmap includes in the top color bar two rows corresponding to Tissue and Histology, that have been provided in configuration file field Annotation file column(s) to plot, in addition to cell state labels always plotted:

knitr::include_graphics("DiscoveryOutput_scRNA/Fibroblasts/state_assignment_heatmap.png")
state_assignment_heatmap.png
The ecotype output files include:

The cell state composition of each ecotype (the set of cell states making up each ecotype):

ecotypes = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotypes.txt")
head(ecotypes[,c("CellType", "State", "Ecotype")])
##                    CellType State Ecotype
## 1                   B.cells   S02      E1
## 2               CD4.T.cells   S02      E1
## 3               CD8.T.cells   S01      E1
## 4           Dendritic.cells   S03      E1
## 5               Fibroblasts   S05      E1
## 6 Monocytes.and.Macrophages   S03      E1

The number of initial clusters obtained by clustering the Jaccard index matrix, selected using the average silhouette:

knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/nclusters_jaccard.png")
nclusters_jaccard.png
A heatmap of the Jaccard index matrix, after filtering ecotypes with less than 3 cell states:
knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/jaccard_matrix.png")
jaccard_matrix.png
The abundance of each ecotype in each sample in the discovery dataset::

The abundance of each ecotype in each sample in the discovery dataset:

abundances = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotype_abundance.txt")
dim(abundances)
## [1]  9 33
head(abundances[,1:5])
##       SMC01.N    SMC01.T    SMC02.N    SMC02.T    SMC03.N
## E1 0.34064095 0.07302366 0.20329837 0.02049678 0.27718758
## E2 0.06078240 0.17093342 0.02937202 0.10322721 0.05241208
## E3 0.02315383 0.34562878 0.01355632 0.36202739 0.01278497
## E4 0.13787420 0.12543986 0.14604672 0.16681631 0.06725426
## E5 0.16081886 0.10434607 0.28980392 0.11903111 0.14459666
## E6 0.00000000 0.07347385 0.03524642 0.14282270 0.00000000

The assignment of samples in the discovery dataset to ecotypes. The samples not assigned to any ecotype are filtered out from this file:

assignments = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotype_assignment.txt")
dim(assignments)
## [1] 32  6
head(assignments[,1:5])
##              ID MaxEcotype  AssignmentP  AssignmentQ AssignedToEcotypeStates
## SMC01-N SMC01-N         E1 1.938649e-04 0.0012795085                    TRUE
## SMC05-N SMC05-N         E1 5.000404e-03 0.0183348142                    TRUE
## SMC05-T SMC05-T         E1 7.568441e-02 0.1541417608                    TRUE
## SMC07-N SMC07-N         E1 2.928585e-03 0.0138061877                    TRUE
## SMC08-N SMC08-N         E1 9.015769e-05 0.0007438009                    TRUE
## SMC19-T SMC19-T         E1 5.936002e-03 0.0195888071                    TRUE

A heatmap of cell state fractions across the samples assigned to ecotypes:

knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/heatmap_assigned_samples_viridis.png")
heatmap_assigned_samples_viridis.png

当然,方法还是很不错的,如果有配套的单细胞空间数据,用一下这个方法寻找细胞状态和生态型,还是非常给力的,参考网址在EcoTyper

生活很好,有你更好

上一篇 下一篇

猜你喜欢

热点阅读