TCGA数据库copy number variation数据分析
2020-10-10 本文已影响0人
生信start_site
上一篇文章,我们从TCGA数据库下载了cnv数据,进行了数据的预处理,并使用GISTIC在线软件生成了19个文件。这篇笔记是用我们得到的这些文件进行CNV可视化。还是使用maftools包来画图。
从GISTIC软件生成的19个文件里,有4个文件是maftools包需要的输入文件:
(一)构建GISTIC对象
> library(maftools)
> hnsc.gistic <- readGistic(gisticAllLesionsFile = "all_lesions.conf_90.txt",
gisticAmpGenesFile = "amp_genes.conf_90.txt",
gisticDelGenesFile = "del_genes.conf_90.txt",
gisticScoresFile = "scores.gistic",
isTCGA = TRUE)
-Processing Gistic files..
--Processing amp_genes.conf_90.txt
--Processing del_genes.conf_90.txt
--Processing scores.gistic
--Summarizing by samples
> hnsc.gistic
An object of class GISTIC
ID summary
1: Samples 522
2: nGenes 2716
3: cytoBands 118
4: Amp 168191
5: Del 258478
6: total 426669
构建好gistic对象后,和maf对象一样,也可以查看这个对象的情况总结:
> getSampleSummary(hnsc.gistic)
Tumor_Sample_Barcode Amp Del total
1: TCGA-CV-7254 897 1278 2175
2: TCGA-CN-6024 610 1371 1981
3: TCGA-IQ-A6SH 708 1229 1937
4: TCGA-D6-A6EP 688 1182 1870
5: TCGA-CV-7424 809 1034 1843
---
518: TCGA-CV-7411 0 2 2
519: TCGA-CX-7085 0 2 2
520: TCGA-CR-7391 0 0 0
521: TCGA-CR-7393 0 0 0
522: TCGA-CV-A45O 0 0 0
> getGeneSummary(hnsc.gistic)
Hugo_Symbol Amp Del total AlteredSamples
1: MIR4444-1 115 376 491 389
2: CASC21 407 0 407 406
3: CASC8 407 0 407 406
4: TP63 401 0 401 400
5: LINC01206 392 0 392 392
---
2712: TFB2M 0 56 56 55
2713: LOC101927926 0 41 41 41
2714: LOC101927948 0 41 41 41
2715: LOC101927967 0 41 41 41
2716: SNAR-H 0 41 41 41
可以使用write.GisticSummary
函数保存上面的summary数据:
> write.GisticSummary(hnsc.gistic,"hnsc.gistic.summary")
(二)利用maftools进行可视化
(1)genome plot
> gisticChromPlot(gistic = hnsc.gistic, markBands = "all")
如果去掉上面代码里的“markBands”参数,则默认只展示q值最低的top5区域:
> gisticChromPlot(gistic = hnsc.gistic)
(2)Bubble plot
> gisticBubblePlot(gistic = hnsc.gistic)
(3)oncoplot
在所有的样品里,拷贝数变异top10的区域
> gisticOncoPlot(gistic = hnsc.gistic, sortByAnnotation = TRUE, top = 10)
你也可以结合临床特征来画oncoplot:
> gisticOncoPlot(gistic = hnsc.gistic, clinicalData = getClinicalData(x = hnsc.maf), clinicalFeatures = 'ajcc_clinical_stage', sortByAnnotation = TRUE, top = 10)
(4)Visualizing CBS segments
segment的可视化可以用所有样品的seg文件:
> plotCBSsegments(cbsFile = "tumor_segments.txt", tsb = 'ALL') #这时R会把所有样品的图依次画出来
也可以只画某一个样品的segment,但是需要把某个样品的数据从总的样品中提取出来:
> all_seg <- read.table("tumor_segments.txt",sep = "\t",header = T)
> all_seg$Sample <- substr(all_seg$Sample,1,12)
> tcga.4p.aa8j.seg <- all_seg[all_seg$Sample =="TCGA-4P-AA8J",]
> write.table(tcga.4p.aa8j,"tcga.4p.aa8j.txt",sep = "\t",col.names = T,row.names = F)
> plotCBSsegments(cbsFile = "tcga.4p.aa8j.txt")
(5) Ignoring variants in copy number altered regions
我们可以使用拷贝数信息来忽略那些落在拷贝数变异区域的variants。拷贝数的变异导致了过高/过低的variant等位基因频率,从而影响clustering。去掉这些variant会提高估算的clustering和密度,保留具有生物学意义的结果。
#Ignoring variants in copy number altered regions
> tcga.4p.aa8j.het <- inferHeterogeneity(maf = hnsc.maf, tsb = 'TCGA-4P-AA8J', segFile = "tcga.4p.aa8j.txt", vafCol = NULL)
#Visualizing results. Highlighting those variants on copynumber altered variants.
> plotClusters(clusters = tcga.4p.aa8j.het, genes = 'CN_altered', showCNvars = TRUE)
在上图中,凡是落在峰的外侧的基因,都是落在拷贝数变异区域的variants。在我的这个结果里,没有这类基因,官网上贴出的图是个很好的例子:
NF1和SUZ12两个基因就是落在拷贝数变异区域的基因,在分析中被忽略的。