人类癌症中体细胞DNA变异的绝对定量 --- ABSOLUTE
1. 概述
ABSOLUTE 是美国著名生物医学科研机构Broad Institute推出的通过肿瘤样本体细胞拷贝数变异(CNV)与单核苷酸位点变异(SNV)数据,推断肿瘤纯度及恶性肿瘤细胞倍性的算法。有对应R包可以直接拿来使用。文章2012年发表于nature biotechnology,至今引用量已破千。
首先论及推断绝对拷贝数的困难性:
Inferring absolute copy number is more difficult for three reasons: (i) cancer cells are nearly always intermixed with an unknown fraction of normal cells (tumor purity); (ii) the actual DNA content of the cancer cells (ploidy), resulting from gross numerical and structural chromosomal abnormalities, is unknown9–13; and (iii) the cancer cell population may be heterogeneous, perhaps owing to ongoing subclonal evolution.
然后描述 ABSOLUTE 的推理框架:
假定一个混合物组成的癌症组织样本,癌症细胞比例为 α(假定为单染色体组),则正常细胞比例为 1 - α(二倍体)。对于基因组中的每个位点 x,设 q(x) 为癌细胞中该位点的整数拷贝数。设 τ 为癌细胞的平均倍性,定义为整个基因组 q(x) 的平均值。
那么在混合的癌症样本中,位点 x 的平均绝对拷贝数为 αq(x) + 2(1 − α) ;平均倍性D就是 ατ + 2(1 − α)。
因此位点 x 的相对拷贝数为:
由于 q(x) 为整数,所以 R(x)为离散型数值,且其可能的最小值为 2(1 − α)/D,发生在纯和缺失位点,与正常细胞的DNA片段相对应。
而引入SNV数据,可进一步提供相关支持信息:
ABSOLUTE算法会通过相关模型优化α和τ,即肿瘤纯度和癌细胞平均倍性。
整体算法很复杂,不是很能理解,现阶段直接运用R包,将来会再回顾。
2. 应用
R包在官网 https://software.broadinstitute.org/cancer/cga/absolute_download下载,需要先注册。
示例代码在 https://software.broadinstitute.org/cancer/cga/absolute_run 查看,列在下面。其包含了3个主函数:DoAbsolute()
,用于设置各个参数的取值; RunAbsolute()
,真正的run; CreateReviewObject()
,用于结果整合。因为源代码的设计是一次只能运行一个样本,所以可以利用相关R包进行并行计算。
DoAbsolute <- function(scan, sif) {
registerDoSEQ()
library(ABSOLUTE)
plate.name <- "DRAWS"
genome <- "hg18"
platform <- "SNP_250K_STY"
primary.disease <- sif[scan, "PRIMARY_DISEASE"]
sample.name <- sif[scan, "SAMPLE_NAME"]
sigma.p <- 0
max.sigma.h <- 0.02
min.ploidy <- 0.95
max.ploidy <- 10
max.as.seg.count <- 1500
max.non.clonal <- 0
max.neg.genome <- 0
copy_num_type <- "allelic"
seg.dat.fn <- file.path("output", scan, "hapseg",
paste(plate.name, "_", scan, "_segdat.RData", sep=""))
results.dir <- file.path(".", "output", scan, "absolute")
print(paste("Starting scan", scan, "at", results.dir))
log.dir <- file.path(".", "output", "abs_logs")
if (!file.exists(log.dir)) {
dir.create(log.dir, recursive=TRUE)
}
if (!file.exists(results.dir)) {
dir.create(results.dir, recursive=TRUE)
}
sink(file=file.path(log.dir, paste(scan, ".abs.out.txt", sep="")))
RunAbsolute(seg.dat.fn, sigma.p, max.sigma.h, min.ploidy, max.ploidy, primary.disease,
platform, sample.name, results.dir, max.as.seg.count, max.non.clonal,
max.neg.genome, copy_num_type, verbose=TRUE)
sink()
}
arrays.txt <- "./paper_example/mix250K_arrays.txt"
sif.txt <- "./paper_example/mix_250K_SIF.txt"
## read in array names
scans <- readLines(arrays.txt)[-1]
sif <- read.delim(sif.txt, as.is=TRUE)
library(foreach)
## library(doMC)
## registerDoMC(20)
foreach (scan=scans, .combine=c) %dopar% {
DoAbsolute(scan, sif)
}
obj.name <- "DRAWS_summary"
results.dir <- file.path(".", "output", "abs_summary")
absolute.files <- file.path(".", "output",
scans, "absolute",
paste(scans, ".ABSOLUTE.RData", sep=""))
library(ABSOLUTE)
CreateReviewObject(obj.name, absolute.files, results.dir, "allelic", verbose=TRUE)
## At this point you'd perform your manual review and mark up the file
## output/abs_summary/DRAWS_summary.PP-calls_tab.txt by prepending a column with
## your desired solution calls. After that (or w/o doing that if you choose to accept
## the defaults, which is what running this code will do) run the following command:
calls.path = file.path("output", "abs_summary", "DRAWS_summary.PP-calls_tab.txt")
modes.path = file.path("output", "abs_summary", "DRAWS_summary.PP-modes.data.RData")
output.path = file.path("output", "abs_extract")
ExtractReviewedResults(calls.path, "test", modes.path, output.path, "absolute", "allelic")
参考:
http://www.broadinstitute.org/cancer/cga/ABSOLUTE
https://www.genepattern.org/modules/docs/ABSOLUTE/2
https://www.jianshu.com/p/468077752689
https://www.jianshu.com/p/388fb14989df