[R]TCGAbiolinks包：数据准备--query、dow

2021-08-19 本文已影响0人小贝学生信

TCGAbiolinks包是一站式分析TCGA数据的R包工具，它集成了TCGA数据下载、分析、可视化的全部流程。此次系列笔记主要跟着 TCGAbiolinks帮助文档重新学习下TCGA数据挖掘流程。

官方文档：https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/index.html

文献：TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data https://pubmed.ncbi.nlm.nih.gov/26704973/

一、查找感兴趣的TCGA数据

GDCquery()

GDCquery(
  project,
  data.category,
  data.type,
  workflow.type,
  legacy = FALSE,
  access,
  platform,
  file.type,
  barcode,
  data.format,
  experimental.strategy,
  sample.type
)

1、可设置的参数

1.1、根据肿瘤类型

project参数：指定一个或多个感兴趣的TCGA项目名
如下代码所示，供包括33种TCGA癌症类型

projects = TCGAbiolinks:::getGDCprojects()$project_id
TCGAs = grep("TCGA", projects, value = T)
sort(TCGAs)
# [1] "TCGA-ACC"  "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
# [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM"  "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
# [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG"  "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
# [19] "TCGA-MESO" "TCGA-OV"   "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
# [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
# [31] "TCGA-UCEC" "TCGA-UCS"  "TCGA-UVM"

Study Abbreviation	Study Name	中文名
ACC	Adrenocortical carcinoma	肾上腺皮质癌
BLCA	Bladder Urothelial Carcinoma	膀胱尿路上皮癌
BRCA	Breast invasive carcinoma	浸润性乳腺癌
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma	宫颈鳞状细胞癌和宫颈内腺癌
CHOL	Cholangiocarcinoma	胆管癌
COAD	Colon adenocarcinoma	结肠腺癌
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	淋巴样肿瘤弥漫大b细胞淋巴瘤
ESCA	Esophageal carcinoma	食管癌癌
GBM	Glioblastoma multiforme	多形性成胶质细胞瘤
HNSC	Head and Neck squamous cell carcinoma	头颈部鳞状细胞癌
KICH	Kidney Chromophobe	肾嫌色细胞癌
KIRC	Kidney renal clear cell carcinoma	肾透明细胞癌
KIRP	Kidney renal papillary cell carcinoma	肾乳头状细胞癌
LAML	Acute Myeloid Leukemia	急性髓系白血病
LGG	Brain Lower Grade Glioma	脑低级别胶质瘤
LIHC	Liver hepatocellular carcinoma	肝脏肝细胞癌
LUAD	Lung adenocarcinoma	肺腺癌
LUSC	Lung squamous cell carcinoma	肺鳞癌
MESO	Mesothelioma	间皮瘤
OV	Ovarian serous cystadenocarcinoma	卵巢浆液性囊腺癌
PAAD	Pancreatic adenocarcinoma	胰腺腺癌
PCPG	Pheochromocytoma and Paraganglioma	嗜铬细胞瘤和副神经节瘤
PRAD	Prostate adenocarcinoma	前列腺腺癌
READ	Rectum adenocarcinoma	直肠腺癌
SARC	Sarcoma	肉瘤
SKCM	Skin Cutaneous Melanoma	皮肤皮肤黑色素瘤
STAD	Stomach adenocarcinoma	胃腺癌
TGCT	Testicular Germ Cell Tumors	睾丸生殖细胞肿瘤
THCA	Thyroid carcinoma	甲状腺癌
THYM	Thymoma	胸腺瘤
UCEC	Uterine Corpus Endometrial Carcinoma	子宫内膜癌
UCS	Uterine Carcinosarcoma	子宫癌肉瘤
UVM	Uveal Melanoma	葡萄膜黑色素瘤

1.2 hg19/hg38

主要根据参考基因组的不同，包含两套数据：GDC Legacy Archive【主要GRCh37 (hg19)】，GDC harmonized database【GRCh38 (hg38)】
通过设置参数legacy ，默认为FALSE(hg19)；TRUE则表示使用hg38参考基因组的测序数据。

1.3 下载数据类型

基于上述的参数，我们可以设置如下参数，交代我们的目标数据类型

data.category = 指定下载什么类型的数据：如组学数据、临床数据....

#查看某一种肿瘤所包含的数据类型
TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
#   file_count case_count               data_category
# 1       4679       1098            Sequencing Reads
# 2       1183       1098                    Clinical
# 3       6627       1098       Copy Number Variation
# 4       5315       1098                 Biospecimen
# 5       1234       1095             DNA Methylation
# 6       6080       1097     Transcriptome Profiling
# 7       8648       1044 Simple Nucleotide Variation

data.type = 更加细节的数据类型选择(optional)
workflow.type = 同一个测序数据可能有不同的pipeline处理流程(optional, for harmonized )
platform = 测序平台（optional）
file.type = 具体的数据文件（optional, for legacy）
如果不知道目标数据的上述信息，可以参考下面的概述

GDC harmonized database

Data.category	Data.type	Workflow.Type	Platform
Transcriptome Profiling	Gene Expression Quantification	HTSeq - Counts
Transcriptome Profiling	Gene Expression Quantification	HTSeq - FPKM
Transcriptome Profiling	Gene Expression Quantification	HTSeq - FPKM-UQ
Transcriptome Profiling	Gene Expression Quantification	STAR - Counts
Transcriptome Profiling	Isoform Expression Quantification	-
Transcriptome Profiling	miRNA Expression Quantification	-
Transcriptome Profiling	Splice Junction Quantification
Copy number variation	Copy Number Segment
Copy number variation	Masked Copy Number Segment
Copy number variation	Gene Level Copy Number Scores
Simple Nucleotide Variation	Masked Somatic Mutation	MuSE Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	MuTect2 Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	SomaticSniper Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	VarScan2 Variant Aggregation and Masking
Raw Sequencing Data	-
Biospecimen	Slide Image
Biospecimen	Biospecimen Supplement
Clinical	-
DNA Methylation	Methylation Beta Value		Illumina Human Methylation 450
DNA Methylation	Methylation Beta Value		Illumina Human Methylation 27

GDC Legacy Archive

Data.category	Data.type	Platform	file.type
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg18.seg
Copy number variation	-	Affymetrix SNP Array 6.0	hg18.seg
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg19.seg
Copy number variation	-	Affymetrix SNP Array 6.0	hg19.seg
Copy number variation	-	Illumina HiSeq	-
Simple nucleotide variation	Simple somatic mutation
Raw sequencing data
Biospecimen
Clinical
Protein expression		MDA RPPA Core	-
Gene expression	Gene expression quantification	Illumina HiSeq	normalized_results
Gene expression	Gene expression quantification	Illumina HiSeq	results
Gene expression	Gene expression quantification	HT_HG-U133A	-
Gene expression	Gene expression quantification	AgilentG4502A_07_2	-
Gene expression	Gene expression quantification	AgilentG4502A_07_1	-
Gene expression	Gene expression quantification	HuEx-1_0-st-v2	FIRMA.txt
Gene expression	Gene expression quantification		gene.txt
Gene expression	Isoform expression quantification	-	-
Gene expression	miRNA gene quantification	-	hg19.mirna
Gene expression	miRNA gene quantification		hg19.mirbase20
Gene expression	miRNA gene quantification		mirna
Gene expression	Exon junction quantification	-	-
Gene expression	Exon quantification	-	-
Gene expression	miRNA isoform quantification	-	hg19.isoform
Gene expression	miRNA isoform quantification	-	isoform
DNA methylation		Illumina Human Methylation 450	Not used
DNA methylation		Illumina Human Methylation 27	Not used
DNA methylation		Illumina DNA Methylation OMA003 CPI	Not used
DNA methylation		Illumina DNA Methylation OMA002 CPI	Not used
DNA methylation		Illumina Hi Seq
DNA methylation	Bisulfite sequence alignment
DNA methylation	Methylation percentage
DNA methylation	Aligned reads
Raw microarray data	Raw intensities	Illumina Human Methylation 450	idat
Raw Microarray Data	Raw intensities	Illumina Human Methylation 27	idat
Structural Rearrangement
Other

1.4 样本标签Barcode

完整的barcode：形如 TCGA-G4-6317-02A-11D-2064-05，这个标签包含了从病人来源到测序过程、分析的所有信息，如下图所示比较重要的是Participant、Sample 、Portion三个部分，分别交代了病人编号、样本类型、测序类型
病人的id：形如 TCGA-G4-6317
样本来源的id：形如 TCGA-G4-6317-02

其中比较重要的是交代样本类型的Sample的两位数信息，是后面进行差异分析的分组依据。具体对应的含义如下。例如01表示病人的原位瘤组织；11表示来自病人的正常组织....
基于上述理解，我们也可以设置sample.type =参数指定下载感兴趣的样本类型数据，例如sample.type = "Primary Tumor"
对于给定的TCGA barcode，可以利用TCGAquery_SampleTypes()提取出目标分组的样本；TCGAquery_MatchedCoupledSampleTypes()函数可以提取来自同一病人的配对样本数据。

query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 1222   29
query_info = getResults(query)
TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = c(TP, NT))
dim(getResults(query))
#[1] 1215   29

Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = Pair_sample)
dim(getResults(query))
#[1] 229  29

如上是查询TCGA目标数据的几种常见标准，还有几个参数没有介绍，可参看函数帮助文档。可根据自己的目的灵活设置上述参数。

2、query示例

2.1 胆管癌转录组数据 | hg19 | 所有样本

TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
#   file_count case_count               data_category
# 1         30         30          Protein expression
# 2        680         36       Copy number variation
# 3         51         51                 Biospecimen
# 4        444         36 Simple nucleotide variation
# 5        450         36             Gene expression
# 6        686         36         Raw microarray data
# 7         45         36             DNA methylation
# 8        193         51                    Clinical
# 9        365         51         Raw sequencing data
query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
dim(getResults(query))
#[1] 45 32
t(getResults(query)[1,])
#                       1                                                                                   
# id                    "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_format           "TXT"                                                                               
# access                "open"                                                                              
# cases                 "TCGA-3X-AAV9-01A-72R-A41I-07"                                                      
# file_name             "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
# submitter_id          NA                                                                                  
# data_category         "Gene expression"                                                                   
# type                  "file"                                                                              
# file_size             437196                                                                              
# platform              "Illumina HiSeq"                                                                    
# state_comment         NA                                                                                  
# tags                  character,3                                                                         
# updated_datetime      "2017-03-05T10:11:44.298823-06:00"                                                  
# md5sum                "23836c9f9bdb053c567d91a67b62159d"                                                  
# file_id               "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_type             "Gene expression quantification"                                                    
# state                 "live"                                                                              
# experimental_strategy "RNA-Seq"                                                                           
# file_state            "submitted"                                                                         
# version               "1"                                                                                 
# data_release          "0.0 - 29.0"                                                                        
# project               "TCGA-CHOL"                                                                         
# center_id             "ee7a85b3-8177-5d60-a10c-51180eb9009c"                                              
# center_center_type    "CGCC"                                                                              
# center_code           "07"                                                                                
# center_name           "University of North Carolina"                                                      
# center_namespace      "unc.edu"                                                                           
# center_short_name     "UNC"                                                                               
# sample_type           "Primary Tumor"                                                                     
# is_ffpe               FALSE                                                                               
# cases.submitter_id    "TCGA-3X-AAV9"                                                                      
# sample.submitter_id   "TCGA-3X-AAV9-01A"

2.2 肺腺癌癌转录组数据 | hg38 | 原位瘤+正常组织

TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
# 4       2916        519     Transcriptome Profiling
query <- GDCquery(project = "TCGA-LUAD",
                  legacy = FALSE,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 594  29

2.3 乳腺癌甲基化数据 | hg19 | Illumina Human Methylation 450平台

TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
#7       1250       1097             DNA methylation
query <- GDCquery(project = "TCGA-BRCA",
                  legacy = TRUE,
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 450")
dim(getResults(query))
#[1] 895  32

二、根据选择的query，下载数据

GDCdownload()函数使用比较简单，指定我们上一步得到的query即可。
提供两种下载方式：api与client，前者较快，但有时不太稳定；后者较慢。推荐使用api方式（default），当下载大文件时，可设置files.per.chunk = n，表示分批下载，每批下载n个病人的数据，可避免中途报错，而前功尽弃。
directory表示下载到哪个文件夹，默认会创建、下载到GDCdata文件夹

GDCdownload(
  query,
  token.file,
  method = "api",
  directory = "GDCdata",
  files.per.chunk = NULL
)

示例数据

query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
# Downloading data for project TCGA-CHOL
# GDCdownload will download 45 files. A total of 19.580796 MB
# Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
# Downloading: 1.9 MB     Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
# Downloading: 1.8 MB     Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
# Downloading: 1.8 MB     Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
# Downloading: 1.9 MB     Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
# Downloading: 900 kB

三、读取已经下载到本地的文件到当前环境

GDCprepare()会根据我们提供的query对象，以及下载数据的储存目录（默认也是GDCdata文件夹），完成数据读取的操作，以SummarizedExperiment格式展示。
还可设置save = TRUE、filename = ****参数，在读取数据后，自动将SummarizedExperiment对象保存为Rdata，以供之后方便调用（defalut
为FALSE）

query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
# -------------------
#   oo Reading 45 files
# -------------------
#   |=================================================|100%                      Completed after 0 s 
# -------------------
#   oo Merging 45 files
# -------------------
#   Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_' 
# chol subtype information from:doi:10.1016/j.celrep.2017.02.033
# => Saving file: CHOL_RNAseq.rda
# => File saved

GDCprepare()在读取数据的过程中，会自动进行样本信息、基因信息的注释。但目前这还不能支持全部类型数据。

library(SummarizedExperiment)
#表达矩阵信息
dim(assay(data))
#[1] 19947    45
assays(data)
# List of length 1
# names(1): normalized_count
assay(data, "normalized_count")[1:4,1:4]
#       TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
# A1BG                      70.9581                      29.9768                  108409.2249                    1485.0630
# A2M                    23986.2548                    8129.6961                   98095.2358                    7119.1570
# NAT1                      72.4007                      52.8682                     160.2275                      76.5504
# NAT2                       8.7099                       0.0000                    1472.3868                      23.2558

#样本(临床)信息
dim(colData(data))
#[1]  45 205
colData(data)[1:4,1:4]
# DataFrame with 4 rows and 4 columns
#                                         barcode      patient           sample shortLetterCode
#                                         <character>  <character>      <character>     <character>
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A              TP
# TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A              TP
# TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A              NT
# TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A              TP

#不同的基因ID类型
dim(rowData(data))
#[1] 19947     3
rowData(data)[1:6,1:3]
# DataFrame with 6 rows and 3 columns
#                   gene_id entrezgene ensembl_gene_id
#                   <character>  <integer>     <character>
# A1BG                 A1BG          1 ENSG00000121410
# A2M                   A2M          2 ENSG00000175899
# NAT1                 NAT1          9 ENSG00000171428
# NAT2                 NAT2         10 ENSG00000156006
# RP11-986E7.7 RP11-986E7.7         12 ENSG00000273259
# AADAC               AADAC         13 ENSG00000114771


#基因的坐标信息
rowRanges(data)
# GRanges object with 19947 ranges and 3 metadata columns:
#           seqnames              ranges strand |      gene_id entrezgene ensembl_gene_id
#         <Rle>           <IRanges>  <Rle> |  <character>  <integer>     <character>
# A1BG    chr19   58856544-58864865      - |         A1BG          1 ENSG00000121410
# A2M    chr12     9220260-9268825      - |          A2M          2 ENSG00000175899
# NAT1     chr8   18027986-18081198      + |         NAT1          9 ENSG00000171428
# NAT2     chr8   18248755-18258728      + |         NAT2         10 ENSG00000156006
# RP11-986E7.7    chr14   95058395-95090983      + | RP11-986E7.7         12 ENSG00000273259
# ...      ...                 ...    ... .          ...        ...             ...
# RASAL2-AS1     chr1 178060643-178063119      - |   RASAL2-AS1  100302401 ENSG00000224687
# LINC00882     chr3 106555658-106959488      - |    LINC00882  100302640 ENSG00000242759
# FTX     chrX   73183790-73513409      - |          FTX  100302692 ENSG00000230590
# TICAM2     chr5 114914339-114961876      - |       TICAM2  100302736 ENSG00000243414
# SLC25A5-AS1     chrX 118599997-118603061      - |  SLC25A5-AS1  100303728 ENSG00000224281
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths

以上就是查找数据，下载数据，读取数据的全部流程，接下来就可以开始分析数据了~

补充：关于病人的临床数据与肿瘤分型

1、获取病人的临床数据

如上在GDCprepare()过程中，会自动注释病人样本的临床信息。
我们也可以预先单独下载每个病人的临床数据，以供参考。

方法一：GDCquery() pipeline

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query, files.per.chunk = 20)
clinical.BCRtab.all <- GDCprepare(query)


grep("clinical_", names(clinical.BCRtab.all), value = T)
# [1] "clinical_drug_brca"               "clinical_omf_v4.0_brca"          
# [3] "clinical_follow_up_v4.0_brca"     "clinical_follow_up_v1.5_brca"    
# [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"           
# [7] "clinical_radiation_brca"          "clinical_nte_brca"               
# [9] "clinical_follow_up_v2.1_brca" 
clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
clinical_patient_brca[1:4,1:4]
#                       bcr_patient_uuid bcr_patient_barcode form_completion_date                  prospective_collection
# 1                     bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
# 2                              CDE_ID:      CDE_ID:2003301              CDE_ID:                          CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416        TCGA-3C-AAAU            2014-1-13                                      NO
# 4 55262FCB-1B01-4480-B322-36570430C917        TCGA-3C-AALI            2014-7-28                                      NO

方法二：GDCquery_clinic()

根据官方介绍，这个函数下载的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
这种方法下载速度较快，建议优先使用。如果没有想要的信息，再使用方法一。

clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical[1:4,1:4]
#   submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
# 1 TCGA-E2-A14U                     No               Stage I     stage i
# 2 TCGA-E9-A1RC                     No            Stage IIIC  stage iiic
# 3 TCGA-D8-A1J9                     No              Stage IA    stage ia
# 4 TCGA-E2-A14P                     No            Stage IIIC  stage iiic

2、获取病人的肿瘤分型

PanCancerAtlas_subtypes()
The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)

subtypes <- PanCancerAtlas_subtypes()
dim(subtypes)
#[1] 7734   10
table(subtypes$cancer.type)
# ACC  AML BLCA BRCA COAD ESCA  GBM HNSC KICH KIRC KIRP  LGG LIHC LUAD LUSC OVCA PCPG 
# 91  187  129 1218  341  169  606  279   66  442  161  516  196  230  178  489  178 
# PRAD READ SKCM STAD THCA UCEC  UCS 
# 333  118  333  383  496  538   57
head(as.data.frame(subtypes))
#   pan.samplesID cancer.type                         Subtype_mRNA   Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other      Subtype_Selected
# 1  TCGA-OR-A5J1         ACC steroid-phenotype-high+proliferation         CIMP-high              NA       miRNA_1       Quiet                COC3           C1A         ACC.CIMP-high
# 2  TCGA-OR-A5J2         ACC steroid-phenotype-high+proliferation          CIMP-low               1       miRNA_1       Noisy                COC3           C1A          ACC.CIMP-low
# 3  TCGA-OR-A5J3         ACC               steroid-phenotype-high CIMP-intermediate               3       miRNA_6 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 4  TCGA-OR-A5J4         ACC                                 <NA>         CIMP-high              NA       miRNA_6 Chromosomal                <NA>          <NA>         ACC.CIMP-high
# 5  TCGA-OR-A5J5         ACC               steroid-phenotype-high CIMP-intermediate              NA       miRNA_2 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 6  TCGA-OR-A5J6         ACC                steroid-phenotype-low          CIMP-low               2       miRNA_1       Noisy                COC1           C1B          ACC.CIMP-low

TCGAquery_subtype()
These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.

brca.subtype <- TCGAquery_subtype(tumor = "brca")
t(brca.subtype[1,])
#                                     [,1]          
# patient                             "TCGA-3C-AAAU"
# Tumor.Type                          "BRCA"        
# Included_in_previous_marker_papers  "NO"          
# vital_status                        "Alive"       
# days_to_birth                       "-20211"      
# days_to_death                       "NA"          
# days_to_last_followup               "4047"        
# age_at_initial_pathologic_diagnosis "55"          
# pathologic_stage                    "NA"          
# Tumor_Grade                         "NA"          
# BRCA_Pathology                      "NA"          
# BRCA_Subtype_PAM50                  "LumA"        
# MSI_status                          "NA"          
# HPV_Status                          "NA"          
# tobacco_smoking_history             "NA"          
# CNV Clusters                        "C6"          
# Mutation Clusters                   "C7"          
# DNA.Methylation Clusters            "C1"          
# mRNA Clusters                       "C1"          
# miRNA Clusters                      "C3"          
# lncRNA Clusters                     "NA"          
# Protein Clusters                    "NA"          
# PARADIGM Clusters                   "C5"          
# Pan-Gyn Clusters                    "NA"

GDCquery_Maf()函数可以支持下载突变数据，这里就暂时不学习了。之后有机会再了解一下。