TCGA数据库介绍
转载:https://biozx.top/TCGA-introduce.html
简介
肿瘤基因组图谱 (TCGA) 计划由美国 National Cancer Institute(NCI) 和 National Human Genome Research Institute(NHGRI)于 2006 年联合启动的项目,目前共计研究 36 种癌症类型。
TCGA 利用大规模测序为主的基因组分析技术,通过广泛的合作,理解癌症的分子机制。提高人们对癌症发病分子基础的科学认识及提高我们诊断、治疗和预防癌症的能力。 最终完成一套完整的与所有癌症基因组改变相关的「图谱」。
library(TCGAbiolinks)
tmp<-getGDCprojects()
# TCGA 总共有如下40个project
tmp$project_id
[1] "TCGA-READ" "TCGA-THCA" "TARGET-CCSK" "TCGA-MESO" "TCGA-SARC" "TARGET-AML" "TCGA-LGG"
[8] "TARGET-NBL" "TCGA-ACC" "TCGA-CESC" "TCGA-KIRP" "TCGA-PAAD" "TARGET-WT" "TCGA-PCPG"
[15] "TCGA-UCS" "TCGA-LUAD" "TCGA-BLCA" "TCGA-OV" "TCGA-CHOL" "TCGA-SKCM" "TCGA-GBM"
[22] "TCGA-KIRC" "TCGA-BRCA" "TCGA-UCEC" "TCGA-PRAD" "TCGA-LAML" "TCGA-STAD" "TCGA-LUSC"
[29] "TCGA-KICH" "TCGA-TGCT" "TCGA-DLBC" "TCGA-THYM" "TCGA-UVM" "FM-AD" "TARGET-OS"
[36] "TCGA-HNSC" "TCGA-ESCA" "TCGA-COAD" "TCGA-LIHC" "TARGET-RT"
数据类型
数据类型 | 说明 |
---|---|
Clinical | 病人的基本信息,诊断情况、TNM分期、肿瘤病理、生存情况等等 |
mRNA | 由mRNA芯片或RNA-seq测得的mRNA表达量数据 |
microRNA | 由microRNA芯片或RNA-seq测得的microRNA表达量数据 |
CopyNumber | 由SNP芯片测序得到的肿瘤对比正常组织染色体各片段的比值 |
Mutation | 肿瘤测序数据相对于参考基因组序列得到的核苷酸变化,包括插入、缺失等 |
Protein | 由蛋白质芯片测序得到的200多种癌症的相关蛋白的表达量。 |
Methylation | 由甲基化芯片测序得到的DNA甲基化程度 |
一、Clinical数据
TCGA临床数据有两种:
- XML数据:包含的信息最全,包括啊辐射、药品信息、跟进、biospecimen等等信息。
- indexed data:只包含最终的状态信息。例如:病人第一状态是alive的,接下来第二状态dead,则数据只包含dead记录。而XML则包含两个状态的信息。
indexed data下载
clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
datatable(clinical, filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
XML数据下载
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
二、mRNA表达量数据
数据文件有 (HTSeq count/ FPKM/ FPKM-UQ)3种
# 数据下载
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
legacy = TRUE)
datatable(getResults(query.exp.hg19),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
三、microRNA数据
miRN数据主要记录了miRNA定量分析产生的数据。分析过程首先是reads比对(BWA),然后注释到mirbas v21和ucsc。这个分析只能注释mirbase有的miRNA,所以不能用于鉴定新的miRNA。
miRNA Expression Quantification
生成raw read counts数据记录==在mirnas.quantification.txt==文件中。多比对用cross-mapped列标注。文件中包括associates miRNA IDs with read count and a normalized count in reads-per-million-miRNA-mapped。
Isoform Expression Quantification
RPM counts记录在 ==isoforms==.quantification.txt文件中。文件中包括miRNA表达量定量分析中的所有列,除此之外还增加了isoforms的基因组坐标信息以及miRNA信息(前体或成熟&accession)
四、CopyNumber数据
使用Affymetrix SNP 6.0芯片,基于TCGA level 2 数据,最终生成txt文件,包含5列(片段名称,染色体,基因组位置,结合到芯片上的探针数量,seqment_mean)
library(TCGAbiolinks)
library(DT)
# 下载CopyNumber数据
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment",
barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)
datatable(data)
五、Methylation数据
包括以下几个平台:
- Illumina Human Methylation 450
- Illumina Human Methylation 27
- Illumina DNA Methylation OMA003 CPI
- Illumina DNA Methylation OMA002 CPI
- Illumina Hi Seq
文件包括以下这些列:
列名 | 描述 |
---|---|
Composite Element | A unique ID for the array probe associated with a CpG site |
Beta Value | Represents the ratio between the methylated array intensity and total array intensity, falls between 0 (lower levels of methylation) and 1 (higher levels of methylation) |
Chromosome | The chromosome in which the probe binding site is located |
Start | The start of the CpG site on the chromosome |
End | The end of the CpG site on the chromosome |
Gene Symbol | The symbol for genes associated with the CpG site. Genes that fall within 1,500 bp upstream of the transcription start site (TSS) to the end of the gene body are used. |
Gene Type | A general classification for each gene (e.g. protein coding, miRNA, pseudogene) |
Transcript ID | Ensembl transcript IDs for each transcript associated with the genes detailed above |
Position to TSS | Distance in base pairs from the CpG site to each associated transcript's start site |
CGI Coordinate | The start and end coordinates of the CpG island associated with the CpG site |
Feature Type | The position of the CpG site in reference to the island: Island, N_Shore or S_Shore (0-2 kb upstream or downstream from CGI), or N_Shelf or S_Shelf (2-4 kbp upstream or downstream from CGI) |
# 下载甲基化数据
query_met.hg38 <- GDCquery(project= "TCGA-LGG",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 450",
barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)
library(SummarizedExperiment)
datatable(as.data.frame(colData(data.hg38)))
datatable(assay(data.hg38)[1:10,])
数据水平
DataLevel | LevelType | 描述 |
---|---|---|
1 | 原始数据BAM文件 | 包括单个样本的低水平数据、没有标准化的数据 |
2 | 处理过的数据 | 包括标准化后的单个样本数据 |
3 | 经过分割、解释的数据 | 包括来自单个样本的经过处理的数据的汇集、通过已探测的基因座的集合来形成较大的contig区域 |
4 | 感兴趣的区域或概要 | 包括量化跨各样本之间的关联、基于两个或多个数据的关联、分子异常及样本特征和临床变量 |
样本标签
样本标签 | 标签代码 | 标签描述 |
---|---|---|
01 | TP | Primary solid Tumor |
02 | TR | Recurrent Solid Tumor |
03 | TB | Primary Blood Derived Cancer - Peripheral Blood |
04 | TRBM | Recurrent Blood Derived Cancer - Bone Marrow |
05 | TAP | Additional - New Primary |
06 | TM | Metastatic |
07 | TAM | Additional Metastatic |
08 | THOC | Human Tumor Original Cells |
09 | TBM | Primary Blood Derived Cancer - Bone Marrow |
10 | NB | Blood Derived Normal |
11 | NT | Solid Tissue Normal |
12 | NBC | Buccal Cell Normal |
13 | NEBV | EBV Immortalized Normal |
14 | NBM | Bone Marrow Normal |
20 | CELLC | Control Analyte |
40 | TRB | Recurrent Blood Derived Cancer - Peripheral Blood |
50 | CELL | Cell Lines |
60 | XP | Primary Xenograft Tissue |
61 | XCL | Cell Line Derived Xenograft Tissue |
样本过滤
library(TCGAbiolinks)
bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",
"TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
"TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
"TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
"TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
"TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
"TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
"TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
"TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
"TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
"TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
"TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
"TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
"TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
"TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
"TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
"TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")
# 筛选TP样本
TCGAquery_SampleTypes(bar,"TP")
[1] "TCGA-G9-6336-01A-11R-1789-07" "TCGA-AO-A0J5-01A-11R-1789-07" "TCGA-G9-6380-01A-11R-1789-07"
[4] "TCGA-G9-6340-01A-11R-1789-07"
# 筛选NB样本
TCGAquery_SampleTypes(bar,"NB")
[1] "TCGA-G9-7036-10A-11R-1789-07" "TCGA-BH-A1ES-10A-11R-1789-07" "TCGA-BH-A1F0-10A-11R-1789-07"