【r<-方案|TCGA】转换TCGA下载文件的样本ID
2018-05-11 本文已影响40人
王诗翔
如果你直接在TCGA官网(现在是GDC portal)下载数据的话,下载的文件名是看起来一堆乱码的东西,如果想要将该文件映射到正确到样本ID(也称为Tumor Sample Barcode),可以使用GDC官网人员开发的R包进行转换,需要使用到的输入文件是下载的manifest文件。
安装好GenomicDataCommons
包(BiocInstaller::biocLite(GenomicDataCommons)
),然后将更改下方的文件路径运行即可:
manifest <- read.table("/location/gdc_manifest.txt", header = TRUE, stringsAsFactors = FALSE)
# library(GenomicDataCommons)
# library(magrittr)
file_uuids <- manifest$id
head(file_uuids)
TCGAtranslateID = function(file_ids, legacy = TRUE) {
info = files(legacy = legacy) %>%
filter( ~ file_id %in% file_ids) %>%
select('cases.samples.submitter_id') %>%
results_all()
# The mess of code below is to extract TCGA barcodes
# id_list will contain a list (one item for each file_id)
# of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
id_list = lapply(info$cases,function(a) {
a[[1]][[1]][[1]]})
# so we can later expand to a data.frame of the right size
barcodes_per_file = sapply(id_list,length)
# And build the data.frame
return(data.frame(file_id = rep(ids(info),barcodes_per_file),
submitter_id = unlist(id_list)))
}
res = TCGAtranslateID(file_uuids)
head(res)