【r<-包|数据集|公开数据库】UCSCXenaTools
XenaR包提供了一个简单的UCSC Xena接口,可以获取一些UCSC Xena存储的信息,包括GDC、TCGA、ICGC、GTEx、CCLE等数据库的上千个数据集。特别是TCGA(hg19版本)的一部分数据UCSC做了非常好的标准化处理,下载即可用。这几天我想要能够通过代码下载相关数据,而不是每次通过网页上的点点点。考虑到XenaR
包的原作者有3年没有更新了,我在它的基础上修正了目前UCSC Xena提供的Hug API,可以完成原包的功能(见https://github.com/DataGeeker/XenaR)。并且,基于这个包,目前正在构建包UCSCXenaTools。
点击查看目前Xena提供的数据集。
目前可以利用它搜索数据集以及下载和导入R了。下面简单讲解下它的用法,目前没时间写文档,所以使用该包看这篇文章很重要。
使用
安装
从Github
上安装,运行下面代码
if(!require(devtools)){
install.packages("devtools", dependencies = TRUE)
}
devtools::install_github("ShixiangWang/UCSCXenaTools")
导入
library(UCSCXenaTools)
探索
使用XenaHub()
可以获取所有的资源,另外可以通过参数指定感兴趣的,包括hosts
,cohorts
以及datasets
。
xe <- XenaHub()
xe
## class: XenaHub
## hosts():
## https://ucscpublic.xenahubs.net
## https://tcga.xenahubs.net
## https://gdc.xenahubs.net
## https://icgc.xenahubs.net
## https://toil.xenahubs.net
## cohorts() (137 total):
## (unassigned)
## 1000_genomes
## Acute lymphoblastic leukemia (Mullighan 2008)
## ...
## TCGA Pan-Cancer (PANCAN)
## TCGA TARGET GTEx
## datasets() (1521 total):
## parsons2008cgh_public/parsons2008cgh_genomicMatrix
## parsons2008cgh_public/parsons2008cgh_public_clinicalMatrix
## vijver2002_public/vijver2002_genomicMatrix
## ...
## TCGA_survival_data
## mc3.v0.2.8.PUBLIC.toil.xena
head(cohorts(xe))
## [1] "(unassigned)"
## [2] "1000_genomes"
## [3] "Acute lymphoblastic leukemia (Mullighan 2008)"
## [4] "B cells (Basso 2005)"
## [5] "Breast Cancer (Caldas 2007)"
## [6] "Breast Cancer (Chin 2006)"
结果返回一个XenaHub
对象。
为了简化hosts()
的输入,我们可以使用hostName
指定我们想要搜索TCGA
的内容,如下:
XenaHub(hostName = "TCGA")
## class: XenaHub
## hosts():
## https://tcga.xenahubs.net
## cohorts() (39 total):
## (unassigned)
## TCGA Acute Myeloid Leukemia (LAML)
## TCGA Adrenocortical Cancer (ACC)
## ...
## TCGA Thyroid Cancer (THCA)
## TCGA Uterine Carcinosarcoma (UCS)
## datasets() (879 total):
## TCGA.OV.sampleMap/HumanMethylation27
## TCGA.OV.sampleMap/HumanMethylation450
## TCGA.OV.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes
## ...
## TCGA.MESO.sampleMap/MESO_clinicalMatrix
## TCGA.MESO.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
有hosts()
、cohorts()
、datasets()
以及samples()
函数可以获取对应的内容,输入参数为XenaHub
对象。
hosts(xe)
## [1] "https://ucscpublic.xenahubs.net" "https://tcga.xenahubs.net"
## [3] "https://gdc.xenahubs.net" "https://icgc.xenahubs.net"
## [5] "https://toil.xenahubs.net"
cohorts(xe)
## [1] "(unassigned)"
## [2] "1000_genomes"
## [3] "Acute lymphoblastic leukemia (Mullighan 2008)"
## [4] "B cells (Basso 2005)"
## [5] "Breast Cancer (Caldas 2007)"
## [6] "Breast Cancer (Chin 2006)"
## [7] "Breast Cancer (Haverty 2008)"
## [8] "Breast Cancer (Hess 2006)"
## [9] "Breast Cancer (Miller 2005)"
## [10] "Breast Cancer (vantVeer 2002)"
## [11] "Breast Cancer (Vijver 2002)"
## [12] "Breast Cancer (Yau 2010)"
## [13] "Breast Cancer Cell Lines (Heiser 2012)"
## [14] "Breast Cancer Cell Lines (Neve 2006)"
## [15] "Cancer Cell Line Encyclopedia (Breast)"
## [16] "Cancer Cell Line Encyclopedia (CCLE)"
## [17] "Connectivity Map"
## [18] "DIPG and Pediatric Non-Brainstem High-Grade Glioma (Wu 2014, St Jude)"
## [19] "Ewing Sarcoma Family of Tumors (Brohl 2014)"
## [20] "GBM (Parsons 2008)"
## [21] "Glioma (Kotliarov 2006)"
## [22] "Inbred mouse (Cutler 2007)"
## [23] "Lung Adenocarcinoma (Ding 2008)"
## [24] "Lung Cancer (Raponi 2006)"
## [25] "Lung Cancer CGH (Weir 2007)"
## [26] "lymph-node-negative breast cancer (Wang 2005)"
## [27] "MAGIC"
## [28] "Melanoma (Lin 2008)"
## [29] "Mouse and Human Colon Tumors (Kaiser 2007)"
## [30] "Mouse pancreatic adenocarcinoma (Bardeesy 2006)"
## [31] "Mouse Tumors (Maser 2007)"
## [32] "NCI60"
## [33] "Neuroblastoma (Khan)"
## [34] "Neuroblastoma (Sausen 2013)"
## [35] "Node-negative breast cancer (Desmedt 2007)"
## [36] "Ovarian Cancer (Etemadmoghadam 2009)"
## [37] "Pancreatic Cancer (Balagurunathan 2008)"
## [38] "Pancreatic Cancer (Harada 2008)"
## [39] "Pancreatic Cancer (Jones 2008)"
## [40] "Pediatric diffuse intrinsic pontine gliomas (Puget 2012)"
## [41] "Pediatric tumor (Khan)"
## [42] "POG TCGA TARGET_NBL"
## [43] "Single-cell RNA-seq mouse cortex (Zeisel)"
## [44] "St Jude PCGP pan-cancer"
## [45] "TARGET Acute Lymphoblastic Leukemia"
## [46] "TARGET neuroblastoma"
## [47] "(unassigned)"
## [48] "TCGA Acute Myeloid Leukemia (LAML)"
## [49] "TCGA Adrenocortical Cancer (ACC)"
## [50] "TCGA Bile Duct Cancer (CHOL)"
## [51] "TCGA Bladder Cancer (BLCA)"
## [52] "TCGA Breast Cancer (BRCA)"
## [53] "TCGA Cervical Cancer (CESC)"
## [54] "TCGA Colon and Rectal Cancer (COADREAD)"
## [55] "TCGA Colon Cancer (COAD)"
## [56] "TCGA Endometrioid Cancer (UCEC)"
## [57] "TCGA Esophageal Cancer (ESCA)"
## [58] "TCGA Formalin Fixed Paraffin-Embedded Pilot Phase II (FPPP)"
## [59] "TCGA Glioblastoma (GBM)"
## [60] "TCGA Head and Neck Cancer (HNSC)"
## [61] "TCGA Kidney Chromophobe (KICH)"
## [62] "TCGA Kidney Clear Cell Carcinoma (KIRC)"
## [63] "TCGA Kidney Papillary Cell Carcinoma (KIRP)"
## [64] "TCGA Large B-cell Lymphoma (DLBC)"
## [65] "TCGA Liver Cancer (LIHC)"
## [66] "TCGA Lower Grade Glioma (LGG)"
## [67] "TCGA lower grade glioma and glioblastoma (GBMLGG)"
## [68] "TCGA Lung Adenocarcinoma (LUAD)"
## [69] "TCGA Lung Cancer (LUNG)"
## [70] "TCGA Lung Squamous Cell Carcinoma (LUSC)"
## [71] "TCGA Melanoma (SKCM)"
## [72] "TCGA Mesothelioma (MESO)"
## [73] "TCGA Ocular melanomas (UVM)"
## [74] "TCGA Ovarian Cancer (OV)"
## [75] "TCGA Pan-Cancer (PANCAN)"
## [76] "TCGA Pancreatic Cancer (PAAD)"
## [77] "TCGA Pheochromocytoma & Paraganglioma (PCPG)"
## [78] "TCGA Prostate Cancer (PRAD)"
## [79] "TCGA Rectal Cancer (READ)"
## [80] "TCGA Sarcoma (SARC)"
## [81] "TCGA Stomach Cancer (STAD)"
## [82] "TCGA Testicular Cancer (TGCT)"
## [83] "TCGA Thymoma (THYM)"
## [84] "TCGA Thyroid Cancer (THCA)"
## [85] "TCGA Uterine Carcinosarcoma (UCS)"
## [86] "(unassigned)"
## [87] "GDC Pan-Cancer (PANCAN)"
## [88] "GDC TARGET-AML"
## [89] "GDC TARGET-CCSK"
## [90] "GDC TARGET-NBL"
## [91] "GDC TARGET-OS"
## [92] "GDC TARGET-RT"
## [93] "GDC TARGET-WT"
## [94] "GDC TCGA Acute Myeloid Leukemia (LAML)"
## [95] "GDC TCGA Adrenocortical Cancer (ACC)"
## [96] "GDC TCGA Bile Duct Cancer (CHOL)"
## [97] "GDC TCGA Bladder Cancer (BLCA)"
## [98] "GDC TCGA Breast Cancer (BRCA)"
## [99] "GDC TCGA Cervical Cancer (CESC)"
## [100] "GDC TCGA Colon Cancer (COAD)"
## [101] "GDC TCGA Endometrioid Cancer (UCEC)"
## [102] "GDC TCGA Esophageal Cancer (ESCA)"
## [103] "GDC TCGA Glioblastoma (GBM)"
## [104] "GDC TCGA Head and Neck Cancer (HNSC)"
## [105] "GDC TCGA Kidney Chromophobe (KICH)"
## [106] "GDC TCGA Kidney Clear Cell Carcinoma (KIRC)"
## [107] "GDC TCGA Kidney Papillary Cell Carcinoma (KIRP)"
## [108] "GDC TCGA Large B-cell Lymphoma (DLBC)"
## [109] "GDC TCGA Liver Cancer (LIHC)"
## [110] "GDC TCGA Lower Grade Glioma (LGG)"
## [111] "GDC TCGA Lung Adenocarcinoma (LUAD)"
## [112] "GDC TCGA Lung Squamous Cell Carcinoma (LUSC)"
## [113] "GDC TCGA Melanoma (SKCM)"
## [114] "GDC TCGA Mesothelioma (MESO)"
## [115] "GDC TCGA Ocular melanomas (UVM)"
## [116] "GDC TCGA Ovarian Cancer (OV)"
## [117] "GDC TCGA Pancreatic Cancer (PAAD)"
## [118] "GDC TCGA Pheochromocytoma & Paraganglioma (PCPG)"
## [119] "GDC TCGA Prostate Cancer (PRAD)"
## [120] "GDC TCGA Rectal Cancer (READ)"
## [121] "GDC TCGA Sarcoma (SARC)"
## [122] "GDC TCGA Stomach Cancer (STAD)"
## [123] "GDC TCGA Testicular Cancer (TGCT)"
## [124] "GDC TCGA Thymoma (THYM)"
## [125] "GDC TCGA Thyroid Cancer (THCA)"
## [126] "GDC TCGA Uterine Carcinosarcoma (UCS)"
## [127] "(unassigned)"
## [128] "ICGC (donor centric)"
## [129] "ICGC (specimen centric)"
## [130] "ICGC (US donors with both RNA and SNV data)"
## [131] "PACA-AU"
## [132] "(unassigned)"
## [133] "GTEX"
## [134] "TARGET Pan-Cancer (PANCAN)"
## [135] "TCGA and TARGET Pan-Cancer (PANCAN)"
## [136] "TCGA Pan-Cancer (PANCAN)"
## [137] "TCGA TARGET GTEx"
datasets(xe)[1:10]
## [1] "parsons2008cgh_public/parsons2008cgh_genomicMatrix"
## [2] "parsons2008cgh_public/parsons2008cgh_public_clinicalMatrix"
## [3] "vijver2002_public/vijver2002_genomicMatrix"
## [4] "vijver2002_public/vijver2002_public_clinicalMatrix"
## [5] "chin2006_public/chin2006Exp_genomicMatrix"
## [6] "chin2006_public/ucsfChinCGH2006_genomicMatrix"
## [7] "chin2006_public/chin2006_public_clinicalMatrix"
## [8] "Treehouse/Treehouse_Khan_neuroblastoma/expression"
## [9] "Treehouse/Treehouse_Khan_neuroblastoma/neuroblastoma_affy_clinicalMatrix"
## [10] "Treehouse/NBL_Sausen_et_al_2013_SNV.tsv"
# samples(xe)[1:10]
# 关于samples的用法请查看 <https://github.com/DataGeeker/XenaR/blob/master/inst/README.Rmd>
# 这里输出内容太多,也不是该包的主题
下载与导入数据
为了能够自定义下载所需要的数据,该包提供了XenaQuery
、XenaDownload
与XenaPrepare
3连击。
下面以下载和导入TCGA
临床数据为例进行说明,其他数据类似。
filter
查看感兴趣的数据集
xe = XenaHub(hostName = "TCGA")
xe
## class: XenaHub
## hosts():
## https://tcga.xenahubs.net
## cohorts() (39 total):
## (unassigned)
## TCGA Acute Myeloid Leukemia (LAML)
## TCGA Adrenocortical Cancer (ACC)
## ...
## TCGA Thyroid Cancer (THCA)
## TCGA Uterine Carcinosarcoma (UCS)
## datasets() (879 total):
## TCGA.OV.sampleMap/HumanMethylation27
## TCGA.OV.sampleMap/HumanMethylation450
## TCGA.OV.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes
## ...
## TCGA.MESO.sampleMap/MESO_clinicalMatrix
## TCGA.MESO.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
可以看到有800+个数据集,太多了。下面使用filterXena()
函数进行过滤。用户可以使用全名或者正则表达式。
(filterXena(xe, filterDatasets = "clinical") -> xe2)
## class: XenaHub
## hosts():
## https://tcga.xenahubs.net
## cohorts() (39 total):
## (unassigned)
## TCGA Acute Myeloid Leukemia (LAML)
## TCGA Adrenocortical Cancer (ACC)
## ...
## TCGA Thyroid Cancer (THCA)
## TCGA Uterine Carcinosarcoma (UCS)
## datasets() (37 total):
## TCGA.OV.sampleMap/OV_clinicalMatrix
## TCGA.DLBC.sampleMap/DLBC_clinicalMatrix
## TCGA.KIRC.sampleMap/KIRC_clinicalMatrix
## ...
## TCGA.READ.sampleMap/READ_clinicalMatrix
## TCGA.MESO.sampleMap/MESO_clinicalMatrix
不是很多了吧?注意该函数的两个参数filterCohorts
与filterDatasets
是相互独立的,因为核心的XenaR
并没有其中一者变化,另外也跟着变化的功能。后续我会想其他办法解决。不过呢,这里因为我们主要聚焦数据集的下载和使用,cohorts
可以不管。
datasets(xe2)
## [1] "TCGA.OV.sampleMap/OV_clinicalMatrix"
## [2] "TCGA.DLBC.sampleMap/DLBC_clinicalMatrix"
## [3] "TCGA.KIRC.sampleMap/KIRC_clinicalMatrix"
## [4] "TCGA.SARC.sampleMap/SARC_clinicalMatrix"
## [5] "TCGA.COAD.sampleMap/COAD_clinicalMatrix"
## [6] "TCGA.PRAD.sampleMap/PRAD_clinicalMatrix"
## [7] "TCGA.LUSC.sampleMap/LUSC_clinicalMatrix"
## [8] "TCGA.ACC.sampleMap/ACC_clinicalMatrix"
## [9] "TCGA.KICH.sampleMap/KICH_clinicalMatrix"
## [10] "TCGA.UCS.sampleMap/UCS_clinicalMatrix"
## [11] "TCGA.COADREAD.sampleMap/COADREAD_clinicalMatrix"
## [12] "TCGA.LUNG.sampleMap/LUNG_clinicalMatrix"
## [13] "TCGA.LUAD.sampleMap/LUAD_clinicalMatrix"
## [14] "TCGA.FPPP.sampleMap/FPPP_clinicalMatrix"
## [15] "TCGA.LAML.sampleMap/LAML_clinicalMatrix"
## [16] "TCGA.GBM.sampleMap/GBM_clinicalMatrix"
## [17] "TCGA.KIRP.sampleMap/KIRP_clinicalMatrix"
## [18] "TCGA.PAAD.sampleMap/PAAD_clinicalMatrix"
## [19] "TCGA.CHOL.sampleMap/CHOL_clinicalMatrix"
## [20] "TCGA.CESC.sampleMap/CESC_clinicalMatrix"
## [21] "TCGA.SKCM.sampleMap/SKCM_clinicalMatrix"
## [22] "TCGA.LGG.sampleMap/LGG_clinicalMatrix"
## [23] "TCGA.PCPG.sampleMap/PCPG_clinicalMatrix"
## [24] "TCGA.TGCT.sampleMap/TGCT_clinicalMatrix"
## [25] "TCGA.BLCA.sampleMap/BLCA_clinicalMatrix"
## [26] "TCGA.THYM.sampleMap/THYM_clinicalMatrix"
## [27] "TCGA.BRCA.sampleMap/BRCA_clinicalMatrix"
## [28] "TCGA.UVM.sampleMap/UVM_clinicalMatrix"
## [29] "TCGA.UCEC.sampleMap/UCEC_clinicalMatrix"
## [30] "TCGA.LIHC.sampleMap/LIHC_clinicalMatrix"
## [31] "TCGA.GBMLGG.sampleMap/GBMLGG_clinicalMatrix"
## [32] "TCGA.THCA.sampleMap/THCA_clinicalMatrix"
## [33] "TCGA.HNSC.sampleMap/HNSC_clinicalMatrix"
## [34] "TCGA.ESCA.sampleMap/ESCA_clinicalMatrix"
## [35] "TCGA.STAD.sampleMap/STAD_clinicalMatrix"
## [36] "TCGA.READ.sampleMap/READ_clinicalMatrix"
## [37] "TCGA.MESO.sampleMap/MESO_clinicalMatrix"
我只想选择肺癌相关,所以再加一些条件:
(filterXena(xe2, filterDatasets = "LUAD|LUSC|LUNG")) -> xe2
如果你很清楚你想要做的,可以使用dplyr
的管道操作符进行连续过滤,不然建议一步一步挑选。
suppressMessages(require(dplyr))
## Warning: 程辑包'dplyr'是用R版本3.5.1 来建造的
xe %>%
filterXena(filterDatasets = "clinical") %>%
filterXena(filterDatasets = "luad|lusc|lung")
## class: XenaHub
## hosts():
## https://tcga.xenahubs.net
## cohorts() (39 total):
## (unassigned)
## TCGA Acute Myeloid Leukemia (LAML)
## TCGA Adrenocortical Cancer (ACC)
## ...
## TCGA Thyroid Cancer (THCA)
## TCGA Uterine Carcinosarcoma (UCS)
## datasets() (3 total):
## TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
## TCGA.LUNG.sampleMap/LUNG_clinicalMatrix
## TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
过滤后返回的还是XenaHub
对象。
query
接下来我们准备下载这3个选择好的数据集。
先构建一个query对象(当前还没有用类封装),就是一个数据框。存储了主机地址,下载的url
等。
xe2_query = XenaQuery(xe2)
xe2_query
## hosts datasets
## 1 https://tcga.xenahubs.net TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
## 2 https://tcga.xenahubs.net TCGA.LUNG.sampleMap/LUNG_clinicalMatrix
## 3 https://tcga.xenahubs.net TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
## url
## 1 https://tcga.xenahubs.net/download/TCGA.LUSC.sampleMap/LUSC_clinicalMatrix.gz
## 2 https://tcga.xenahubs.net/download/TCGA.LUNG.sampleMap/LUNG_clinicalMatrix.gz
## 3 https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/LUAD_clinicalMatrix.gz
download
默认XenaDownload
函数将下载数据到当前目录的Xena_Data
目录下,如果数据已经下载,将提示并不会下载,可以使用force=TRUE
强制下载,另外支持一些到download.file
函数的参数。
注意该函数有返回项,可以用于后续数据的导入。
xe2_download = XenaDownload(xe2_query, destdir = "E:/Github/XenaData/test/")
## We will download files to directory E:/Github/XenaData/test/.
## E:/Github/XenaData/test//TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz, the file has been download!
## E:/Github/XenaData/test//TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz, the file has been download!
## E:/Github/XenaData/test//TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz, the file has been download!
## Note fileNames transfromed from datasets name and / chracter all changed to __ character.
prepare
数据下载之后就可以将数据导入R,背后用的是readr
包的read_tsv
函数。
支持4种导入方式,大于1个文件就会生成一个列表:
- 指定本地目录(目录下所有文件都会导入)
- 指定本地文件
- 指定url,如果只是少量文件,我们可以直接指定url导入,这一步不需要先下载数据到本地(但不推荐)
- 指定
XenaDownload
函数返回的对象
方式1:
# way1: directory
cli1 = XenaPrepare("E:/Github/XenaData/test/")
names(cli1)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
## [3] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
方式2:
# way2: local files
cli2 = XenaPrepare("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz")
class(cli2)
## [1] "tbl_df" "tbl" "data.frame"
cli2 = XenaPrepare(c("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz",
"E:/Github/XenaData/test/TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"))
class(cli2)
## [1] "list"
names(cli2)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
方式3:
# way3: urls
cli3 = XenaPrepare(xe2_download$url[1:2])
names(cli3)
## [1] "LUSC_clinicalMatrix.gz" "LUNG_clinicalMatrix.gz"
方式4:
# way4: xenadownload object
cli4 = XenaPrepare(xe2_download)
names(cli4)
## [1] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
## [3] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
许可证
GPL-3
进一步
查找感兴趣的数据集、下载数据是这个包的核心。除了修复Bug,后续会尝试开发一些更快速运行,支持hosts
、cohorts
和datasets
同步变化的功能,另外增加数据下载后的探索与分析。
欢迎使用、关注、Star与提问。