TCGA数据下载分析(1-1):RTCGA包gene获取表达值并
2018-08-06 本文已影响122人
Y大宽
写在前面:
---------------------------------------------------------------------------
1 安装并加载包
# Load the bioconductor installer.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")
# Install the clinical and mRNA gene expression data packages
biocLite("RTCGA.clinical") ## 14Mb
biocLite('RTCGA.rnaseq') ## (612.6 MB)
biocLite("RTCGA.mRNA") ## (85.0 MB)
biocLite('RTCGA.mutations') ## (103.8 MB)
library(RTCGA.clinical)
library(RTCGA.mRNA)
library(RTCGA.rnaseq)
library(RTCGA.mutations)
> library(RTCGA)
Welcome to the RTCGA (version: 1.8.0).
#查看BRCA的内容
> checkTCGA('DataSets', 'LIHC')
Size Name
1 61K LIHC.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
2 932K LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz
3 1.6G LIHC.Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data.Level_3.2016012800.0.0.tar.gz
4 1.5M LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2016012800.0.0.tar.gz
5 22M LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data.Level_3.2016012800.0.0.tar.gz
6 255K LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz
7 78K LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz.bak.20160128
8 83M LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__exon_expression__data.Level_3.2016012800.0.0.tar.gz
9 8.7M LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz
10 8.0M LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__splice_junction_expression__data.Level_3.2016012800.0.0.tar.gz
11 102M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz
12 32M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz
13 281M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms__data.Level_3.2016012800.0.0.tar.gz
14 80M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data.Level_3.2016012800.0.0.tar.gz
15 905M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data.Level_3.2016012800.0.0.tar.gz
16 70M LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data.Level_3.2016012800.0.0.tar.gz
17 5.8M LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg.Level_3.2016012800.0.0.tar.gz
18 5.8M LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
19 1.5M LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg.Level_3.2016012800.0.0.tar.gz
20 1.5M LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
21 210M LIHC.Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz
22 2.0M LIHC.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
23 416M LIHC.Mutation_Packager_Coverage.Level_3.2016012800.0.0.tar.gz
24 25M LIHC.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0.tar.gz
25 47M LIHC.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz
26 3.8M LIHC.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
27 765M LIHC.Mutation_Packager_Raw_Coverage.Level_3.2016012800.0.0.tar.gz
28 585K LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz
29 275K LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz.bak.20160128
30 312M LIHC.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
31 3.4M LIHC.miRseq_Mature_Preprocess.Level_3.2016012800.0.0.tar.gz
32 2.8M LIHC.miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz
关于这些数据:
- miRSeq is micro-RNA seq. Micro RNAs are a class of non protein coding RNAs that have regulatory functions (they typically bind to the 3`UTR of coding mRNAs and regulate that way)
- mRNA, in this situation, refers to a cDNA microarray, where pre-designed probes on the microarray surface will bind to known target mRNAs, which may be coding or non-coding
- mRNASeq is what most will loosely refer to as 'RNA-seq'. Most protocols will capture all classes of RNA species that have poly-adenylated (poly(A)) tails. RNAs that don't have these tails include ribsomal RNAs and many enhancer RNAs, but there are always exceptions to these rules.
推荐(建议参考)
miRSeq
- illuminahiseq_mirnaseq-miR_gene_expression - normalised micro-RNAseq counts over each micro-RNA
- illuminahiseq_mirnaseq-miR_isoform_expression - nomalised micro-RNAseq counts over each splice isoform of each micro-RNA
mRNA- agilent4502a_07_3-unc_lowess_normalization_gene_level - LOWESS-normalised cDNA expression values over each gene (data from University of North Carolina)
mRNASeq- illuminahiseq_rnaseq-gene_expression - normalised RNAseq counts over each gene
- illuminahiseq_rnaseq-exon_expression - normalised RNAseq counts over each exon of each gene
> checkTCGA('Dates')
[1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21"
[11] "2012-04-12" "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25"
[21] "2012-09-13" "2012-10-04" "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16"
[31] "2013-02-03" "2013-02-22" "2013-03-09" "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23"
[41] "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10" "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16"
[51] "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17" "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01"
[61] "2015-08-21" "2015-11-01" "2016-01-28"
2 下载某肿瘤中Datasets中的某类数据
可以用前述的checkTCGA查看类型,然后针对性下载,用法如下:
downloadTCGA(cancerTypes, dataSet = "Merge_Clinical.Level_1", destDir, date = NULL, untarFile = TRUE, removeTar = TRUE, allDataSets = FALSE)
- dataSet在
checkTCGA
中查看,可以看出数据类型名字很长,所以这里只要部分(连续)匹配即可 - destDir在当前工作目录下创建一个新文件
setwd("E:/00TCGA/data/LIHC")
#downloading the miRNA DATA
dir.create('miRNA')
downloadTCGA(cancerTypes = 'LIHC',
dataSet = 'miR_gene_expression',
destDir = 'rnaseq',
date = tail(checkTCGA('Dates'), 2)[1])
#downloading the ranseq data
dir.create('rnaseq')
downloadTCGA(cancerTypes = 'LIHC',
dataSet = 'Level_3__gene_expression',
destDir = 'rnaseq',
date = tail(checkTCGA('Dates'), 2)[1])
3 获取某基因在任意癌症中的mRNA表达数据并可视化(ggplot2,ggpubr和boxplotTCGA)
3.0 获取mRNA表达数据
获取EZH2,PTEN,HGFAC,FYN,TIGD1等5个gene在BRCA,OV中的mRNA表达数据(LIHC没有mRNA)
expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
> expr
# A tibble: 1,305 x 7
bcr_patient_barcode dataset EZH2 PTEN HGFAC FYN TIGD1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA -1.79 1.36 -2.31 -0.798 -0.86
2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA -1.57 0.428 -2.49 -0.532 -1.06
3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA -2.49 1.31 -2.65 0.007 -1.06
4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA -2.14 0.810 -2.58 -0.466 -0.714
5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA 0.529 0.251 -2.95 0.960 -0.664
6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA -1.44 1.31 -3.26 -0.622 -0.490
7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -0.426 -0.237 -2.71 0.208 0.759
8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -0.579 -1.24 -2.45 0.563 0.022
9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA -1.16 1.21 -2.04 -0.616 -0.287
10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.894 0.288 -2.31 -0.314 1.33
# ... with 1,295 more rows
3.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot
expr<-expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
expr1<-expr[,-1]
expr2<-gather(expr1, key ="mRNA", value="value", -dataset)
ggplot(expr2, aes(y=value,
x=reorder(dataset, value, mean),
fill= dataset))+
geom_boxplot()+
theme_RTCGA()+
scale_fill_brewer(palette = "Set3")+
facet_grid(mRNA~.)+
theme(legend.position = "top")
boxplot如下
5genes in three kinds of cancers.jpeg
3.2 ggpubr绘制指定基因在不同肿瘤中的表达boxplot并进行统计学分析作图
这部分参考jimmy
查看样本量
table(expr$dataset)
> nb_samples
BRCA.mRNA LUSC.mRNA OV.mRNA
590 154 561
把bcr_patient_barcode这列改名,以便下一步可视化作图
expr$dataset <- gsub(pattern = ".mRNA", replacement = "", expr$dataset)
expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))
expr
> expr
# A tibble: 1,305 x 7
bcr_patient_barcode dataset EZH2 PTEN HGFAC FYN TIGD1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BRCA1 BRCA -1.79 1.36 -2.31 -0.798 -0.86
2 BRCA2 BRCA -1.57 0.428 -2.49 -0.532 -1.06
3 BRCA3 BRCA -2.49 1.31 -2.65 0.007 -1.06
4 BRCA4 BRCA -2.14 0.810 -2.58 -0.466 -0.714
5 BRCA5 BRCA 0.529 0.251 -2.95 0.960 -0.664
6 BRCA6 BRCA -1.44 1.31 -3.26 -0.622 -0.490
7 BRCA7 BRCA -0.426 -0.237 -2.71 0.208 0.759
8 BRCA8 BRCA -0.579 -1.24 -2.45 0.563 0.022
9 BRCA9 BRCA -1.16 1.21 -2.04 -0.616 -0.287
10 BRCA10 BRCA -0.894 0.288 -2.31 -0.314 1.33
绘制EZH2表达值的boxplot图
library(ggpubr)
ggboxplot(expr, x = "dataset", y = "EZH2",
title = "EZH2", ylab = "Expression",
color = "dataset", palette = "jco")
ezh2.jpeg
4 获取某基因在任意癌症中的rnaseq表达数据并可视化(ggplot2,ggpubr和boxplotTCGA)
-
ggplot2和ggpubr的用法与第3部分几乎一致,而boxplotTCGA无法直接获取mRNA表达数据,只有rnaseq,另外还有mutation等数据。
-
不同的是,不仅需要gene symbol还要entrez的ID,如MET|4233
4.0 获取rnaseq表达数据
- 获取EZH2,PTEN,HGFAC,FYN等4个gene在BRCA,OV和LUSC中的rnaseq表达数据
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
> expr
# A tibble: 2,071 x 6
bcr_patient_barcode dataset `EZH2|2146` `PTEN|5728` `HGFAC|3083` `FYN|2534`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq 487. 1724. 4.83 247.
2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq 941. 1107. 9.79 523.
3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq 492. 1479. 7.25 814.
4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq 334. 1877. 9.10 562.
5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq 297. 1740. 1.70 549.
6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq 238. 1597. 7.63 689.
7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq 258. 1374. 0.815 853.
8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq 253. 2181. 3.14 84.7
9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq 392. 2529. 0.901 740.
10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq 191. 1876. 0 522.
# ... with 2,061 more rows
4.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot(rnaseq)
#rnaseq ggplot2
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
expr1<-expr[,-1]
expr2<-gather(expr1, key ="rnaseq", value="value", -dataset)
ggplot(expr2, aes(y=value,
x=reorder(dataset, value, mean),
fill= dataset))+
geom_boxplot()+
theme_RTCGA()+
scale_fill_brewer(palette = "Set3")+
facet_grid(rnaseq~.)+
theme(legend.position = "top")
rnaseq ggplot2.jpeg
4.2 ggpubr
查看样本量
nb_samples <- table(expr$dataset)
nb_samples
> nb_samples
BRCA.rnaseq LUSC.rnaseq OV.rnaseq
1212 552 307
ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",
title = "ESR1|2099", ylab = "Expression",
color = "dataset", palette = "jco")
PTEN_5728.jpeg
4.3 boxplotTCGA
具体参考RTCGA的文档
expressionsTCGA(LIHC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,extra
extract.cols = "MET|4233") %>%
rename(cohort = dataset,
MET = `MET|4233`) %>%
#cancer samples
filter(substr(bcr_patient_barcode, 14, 15) == "01") ->
ACC_BLCA_BRCA_OV.rnaseq
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")
MET RNASEQ.jpeg
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)")
Rplot.jpeg
后记-----------------------------------------
这部分只是用RTCGA演示了如何下载数据;mRNA和rnaseq表达值的plot。