TCGA 数据挖掘 -- DNA methylation-dri
最近打算重复一篇文章 DNA methylation-driven genes for constructing diagnostic, prognostic, and recurrence models for hepatocellular carcinoma,IF 2020 = 8.579,摘要我大概整理了一下如下:

这是一篇纯TCGA和GEO数据挖掘的文章,通过筛选肝癌的DNA甲基化驱动基因,构建预测预后和复发模型,用到了很多模型的构建和评估的手段,这些正是我目前需要学习的。
此外,由于我一直处理的数据是自己测的甲基化二代测序数据,从来没处理过公共数据,公共数据库也基本没碰过,这有点数不过去,因此也想借此机会学习公共数据的下载、清洗和分析等知识,并将学习过程记录下来,做到哪我就更新到哪,供自己和需要的朋友参考。
2021-01-29
一、原始数据下载
1. TCGA数据
先看下研究下文章和TCGA数据库:

文中介绍,分别下载TCGA的转录组(n = 421)和 DNA甲基化数据(n = 430),从附件中可以找到每个样本的TCGA 样本号

我对TCGA的数据库也不太懂,只知道下载数据是在GDC里下,我就先随便拿了一个样本号(比如第一个
TCGA-2Y-A9H4-01A
)放搜索框里搜一下 康康。
该页面展示了这个病人的各种可获得的数据,比如mapping好的bam文件(什么测序的bam啊?)、RNA-seq数据、SNV数据、CNV数据、DNA甲基化数据、临床信息、和生物切片这些内容,可以说很强了。每一种数据类型又有几个细分的文件你,这里就不细说了。
之后点击Projiect栏的 TCGA-LIHC 链接:
首先搜了一下TCGA-LIHC,https://wiki.cancerimagingarchive.net/display/Public/TCGA-LIHC,全名为:The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) data,既TCGA肝细胞癌数据,从 GDC主页 可以看到,GDC 存储了60多种癌症(可能还有非癌症?)的各种数据,而肝细胞癌就是其中之一。
我们细看TCGA-LIHC一下这个项目:

乍一看跟文章中样本个数有点对不上,因为这里展示的是病人个数而且文件没有具体分类,到后面仔细整理文件后会发现跟文中一模一样。因此在TCGA数据库中只下这部分的内容就行了。
接下来就要正式开始下载TCGA数据了。
然后就要搜如何下载TCGA的数据库了,我使用的方法是使用官方提供的gdc-client
下载工具,下载和使用方法见 TCGA 数据下载工具 -- gdc-client
gdc-client download -m samplesheet/gdc_manifest_methylation.2021-01-27.txt -d raw/methylation
gdc-client download -m samplesheet/gdc_manifest_transcriptome.2021-01-27.txt -d raw/transcriptome
gdc-client download -m samplesheet/gdc_manifest_clinical.2021-01-27.txt -d raw/clinical
然后然后就下载好了:
folders = list.files("raw", full.names = TRUE); folders
[1] "raw/clinical" "raw/methylation" "raw/transcriptome"
Nfiles = sapply(folders, function(x) length(list.files(x))); Nfiles
raw/clinical raw/methylation raw/transcriptome
423 430 1283
sum(Nfiles)
[1] 2136
转录组数据有三种,所以文件很多。
2021-01-31
3. 数据格式转换
Step1. build IDtable
需要做的就是获取每个文件ID所对应的样本的 barcode,用于后面的ID转换,参考:如何将TCGA文件ID(UUID)转换为TCGA 样本ID(Barcode)。
最后得到的样本ID对照表IDtable
如下:
IDtable[1:5,]
barcode
1 TCGA-DD-AAEH-01A-11D-A40S-05
2 TCGA-ED-A7PX-01A-51R-A352-07
3 TCGA-DD-A3A4
4 TCGA-G3-A3CG-01A-11D-A20Z-05
5 TCGA-DD-AACJ-01A-11D-A40S-05
file_name
1 jhu-usc.edu_LIHC.HumanMethylation450.19.lvl-3.TCGA-DD-AAEH-01A-11D-A40S-05.gdc_hg38.txt
2 872f4298-0808-497d-a951-61ca4d55eb58.FPKM-UQ.txt.gz
3 nationwidechildrens.org_clinical.TCGA-DD-A3A4.xml
4 jhu-usc.edu_LIHC.HumanMethylation450.5.lvl-3.TCGA-G3-A3CG-01A-11D-A20Z-05.gdc_hg38.txt
5 jhu-usc.edu_LIHC.HumanMethylation450.19.lvl-3.TCGA-DD-AACJ-01A-11D-A40S-05.gdc_hg38.txt
file_id
1 ce263e52-5f84-49e0-b026-c23a32ea4328
2 52518d6f-cbcb-4bc6-bae6-e9370c15660f
3 08131208-2f43-4b7c-a891-9acbd7eaa353
4 0a263d86-8451-40fa-89f4-fdb138ff0c56
5 ec3e433b-e217-4775-8979-1d2f14329861
dim(IDtable)
[1] 2125 3
IDtable的行数没有上面总文件个数2136
多,是因为下载的时候还有一些log文件,而它们是没有对应的ID的。
Step2: merge methylation data
接下来将合并430个样本的450K甲基化数据,对于基因组区域或位点的注释信息我喜欢保存为GenomicRanges的格式,非常便于统计和整合,当然,根本原因是我喜欢用R(我也只会用R :)
430个DNA 450K甲基化数据位于raw/methylation
路径内,我要把这些样本的完整甲基化数据全部保存为一个GenomicRanges对象S430_CpG_Batch
:
sourceFolder = "raw/methylation"
folders = list.files(sourceFolder, full.names = TRUE)
Ns = length(folders)
grList = list()
for(i in 1:Ns){
cat("Processing", i, "Remian", Ns - i, "\n")
fileIn = list.files(folders[i], "hg38.txt", full.names = TRUE)
SID = IDtable$barcode[IDtable$file_name == basename(fileIn)]
Tx = data.frame(fread(fileIn, header = TRUE, sep = "\t"))
Probe = Tx[, 1]
Beta = Tx[, 2]
Chr = Tx[, 3]
ST = Tx[, 4]
ED = Tx[, 5]
Gene_Symbol = Tx[, 6]
Gene_Type = Tx[, 7]
Transcript_ID = Tx[, 8]
Position_to_TSS = Tx[, 9]
CGI_Coordinate = Tx[, 10]
Feature_Type = Tx[, 11]
meta = data.frame(Beta, SID, Probe, Gene_Symbol, Gene_Type, Transcript_ID, Position_to_TSS, CGI_Coordinate, Feature_Type, stringsAsFactors = FALSE)
gr = buildGR(Chr, ST, ED, meta = meta)
grList[[i]] = gr
rm(gr)
}
S430_CpG_Batch = grListToGR(grList)
save(S430_CpG_Batch, file = "RData/S430_CpG_Batch.RData")
其中有两个自编的函数buildGR
和grListToGR
:
buildGR <- function(Chr, ST, ED, strand = "*", meta = NA){
suppressMessages(library("GenomicRanges"))
gr = GRanges(seqnames = Rle(Chr), ranges = IRanges(ST, ED), strand = strand)
if(class(meta) == "data.frame"){
mcols(gr) = meta
} else if(class(meta) != "logical"){
stop("'meta' must be a dataframe.")
}
return(gr)
}
grListToGR <- function(grList){
return(unlist(GRangesList(grList)))
}
汇总的RData有5G多的大小,2亿多个CpG位点信息:
> S430_CpG_batch
GRanges object with 208798110 ranges and 9 metadata columns:
seqnames ranges strand | Beta
<Rle> <IRanges> <Rle> | <numeric>
[1] chr16 53434200-53434201 * | 0.482689357131024
[2] chr3 37417715-37417716 * | <NA>
[3] chr3 172198247-172198248 * | <NA>
[4] chr1 90729117-90729118 * | 0.110930315594135
[5] chr8 42405776-42405777 * | 0.920895333617517
... ... ... ... . ...
[208798106] chr6 67522149 * | 0.941042226262915
[208798107] chr3 14617359 * | 0.959313818243286
[208798108] chr15 45707625 * | 0.588890472717803
[208798109] chr2 12008094 * | 0.917096287761782
[208798110] chr3 86613005 * | 0.904755374606857
... ...
seqinfo: 25 sequences from an unspecified genome; no seqlengths
# 注释信息
> colnames(mcols(S430_CpG_batch))
[1] "Beta" "SID" "Probe" "Gene_Symbol"
[5] "Gene_Type" "Transcript_ID" "Position_to_TSS" "CGI_Coordinate"
[9] "Feature_Type"
为了便于使用,我将另存一些关键信息:
S430_CpG_Simple = S430_CpG_batch
mcols(S430_CpG_Simple) = mcols(S430_CpG_batch)[, c("Beta", "SID", "Probe")]
save(S430_CpG_Simple, file = "RData/S430_CpG_Simple.RData")
还是有2.7G的大小!
参与差异甲基化分析的数据只需要用到CpG位点和样本ID以及对应的beta值,因此保存一个甲基化矩阵便于后续计算:
# build beta matrix
SID = as.character(unique(S430_CpG_Simple$SID))
betaM = sapply(SID, function(tag){
idx = match(tag, SID)
cat("Processing", idx, "Remian", length(SID) - idx, "\n")
xgr = S430_CpG_Simple[S430_CpG_Simple$SID == tag]
return(xgr$Beta)
})
Pos = as.character(S430_CpG_Simple[S430_CpG_Simple$SID == SID[1]])
dimnames(betaM) = list(Pos, SID)
S430_MM = betaM
save(S430_MM, file = "RData/S430_MM.RData")
S430_MM[1:5, 1:3]
TCGA-LG-A9QD-01A-11D-A383-05
chr16:53434200-53434201 0.4826894
chr3:37417715-37417716 NA
chr3:172198247-172198248 NA
chr1:90729117-90729118 0.1109303
chr8:42405776-42405777 0.9208953
TCGA-2Y-A9H5-01A-11D-A383-05
chr16:53434200-53434201 0.1711770
chr3:37417715-37417716 NA
chr3:172198247-172198248 NA
chr1:90729117-90729118 0.1104399
chr8:42405776-42405777 0.9178311
TCGA-DD-AACN-01A-11D-A40S-05
chr16:53434200-53434201 0.3024225
chr3:37417715-37417716 NA
chr3:172198247-172198248 NA
chr1:90729117-90729118 0.6696605
chr8:42405776-42405777 0.8757493
dim(S430_MM)
[1] 485577 430
这个矩阵就只有1.3G的大小了~
2021-02-02
Step3: merge transcriptome FPKM data
sourceFolder = "raw/transcriptome"
FPKM = list.files(sourceFolder, ".FPKM.txt.gz$", recursive = TRUE, full.names = TRUE)
expM = sapply(FPKM, function(file){
idx = match(file, FPKM)
cat("Processing", idx, "Remian", length(FPKM) - idx, "\n")
gunzip(file, remove = FALSE)
values = read.table(gsub(".gz", "", file))[, 2]
return(values)
})
ENSG = read.table(gsub(".gz", "", FPKM[1]))[, 1]
SID = IDtable$barcode[match(basename(FPKM), IDtable$file_name)]
dimnames(expM) = list(ENSG, SID)
S424_EM = expM
save(S424_EM, file = "RData/S424_EM.RData")
S424_EM[1:5, 1:4]
TCGA-ED-A7PZ-01A-11R-A33R-07 TCGA-BD-A3EP-11A-12R-A22L-07
ENSG00000242268.2 0.000000 0.000000
ENSG00000270112.3 0.000000 0.000000
ENSG00000167578.15 5.852997 1.332776
ENSG00000273842.1 0.000000 0.000000
ENSG00000078237.5 1.825956 1.497775
TCGA-DD-AAVX-01A-11R-A41C-07 TCGA-DD-A113-01A-11R-A131-07
ENSG00000242268.2 0.000000 0.262419
ENSG00000270112.3 0.000000 0.000000
ENSG00000167578.15 1.341712 2.278868
ENSG00000273842.1 0.000000 0.000000
ENSG00000078237.5 1.692771 3.527322
Step 4: process clinical data
clinical.cart_all.2021-02-01.json
文件是从购物车下载的。
# 1. 若days_to_death为NA, 则说明还没死亡, 用days_to_last_follow_up代替
# 如果仍为NA, 则为缺失样本. days_to_death即为总体生存时间: overall_survival
# 2. 去除overall_survival和tumor_stage为NA的样本?
sourceFolder = "samplesheet/clinical.cart_all.2021-02-01.json"
clinical = jsonlite::fromJSON(sourceFolder)
days_to_last_follow_up = sapply(clinical$diagnoses, function(x) x["days_to_last_follow_up"][[1]])
tumor_stage = sapply(clinical$diagnoses, function(x) x["tumor_stage"][[1]])
demographic = clinical$demographic[, c("gender", "days_to_death", "vital_status")]
PID = sapply(strsplit(clinical$demographic[, "submitter_id"], "_"), function(x) x[1])
clinical_trait = data.frame(PID, days_to_last_follow_up, tumor_stage, demographic, stringsAsFactors = FALSE)
idx = is.na(clinical_trait$days_to_death)
clinical_trait[idx, 'days_to_death'] <- clinical_trait[idx, 'days_to_last_follow_up']
clinical_trait = clinical_trait[, c(1,4,5,6,3)]
dimnames(clinical_trait) <- list(PID, c("PID", "gender", "overall_survival", "censoring_status", "tumor_stage"))
save(clinical_trait, file = "RData/clinical_trait.RData")
clinical_trait[1:5, ]
PID gender overall_survival censoring_status tumor_stage
TCGA-DD-AACJ TCGA-DD-AACJ male 2102 Alive stage ii
TCGA-DD-A11C TCGA-DD-A11C male 662 Alive stage i
TCGA-BC-A112 TCGA-BC-A112 male 153 Dead not reported
TCGA-XR-A8TG TCGA-XR-A8TG male 898 Alive stage i
TCGA-DD-AADG TCGA-DD-AADG male 1145 Alive stage iiia