甲可

TCGA数据库copy number variation数据分析

2020-10-10  本文已影响0人  生信start_site

上一篇文章,我们从TCGA数据库下载了cnv数据,进行了数据的预处理,并使用GISTIC在线软件生成了19个文件。这篇笔记是用我们得到的这些文件进行CNV可视化。还是使用maftools包来画图。

从GISTIC软件生成的19个文件里,有4个文件是maftools包需要的输入文件:

(一)构建GISTIC对象

> library(maftools)
> hnsc.gistic <- readGistic(gisticAllLesionsFile = "all_lesions.conf_90.txt", 
                          gisticAmpGenesFile = "amp_genes.conf_90.txt", 
                          gisticDelGenesFile = "del_genes.conf_90.txt", 
                          gisticScoresFile = "scores.gistic", 
                          isTCGA = TRUE)

-Processing Gistic files..
--Processing amp_genes.conf_90.txt
--Processing del_genes.conf_90.txt
--Processing scores.gistic
--Summarizing by samples

> hnsc.gistic
An object of class  GISTIC 
          ID summary
1:   Samples     522
2:    nGenes    2716
3: cytoBands     118
4:       Amp  168191
5:       Del  258478
6:     total  426669

构建好gistic对象后,和maf对象一样,也可以查看这个对象的情况总结:

> getSampleSummary(hnsc.gistic)
     Tumor_Sample_Barcode Amp  Del total
  1:         TCGA-CV-7254 897 1278  2175
  2:         TCGA-CN-6024 610 1371  1981
  3:         TCGA-IQ-A6SH 708 1229  1937
  4:         TCGA-D6-A6EP 688 1182  1870
  5:         TCGA-CV-7424 809 1034  1843
 ---                                    
518:         TCGA-CV-7411   0    2     2
519:         TCGA-CX-7085   0    2     2
520:         TCGA-CR-7391   0    0     0
521:         TCGA-CR-7393   0    0     0
522:         TCGA-CV-A45O   0    0     0
> getGeneSummary(hnsc.gistic)
       Hugo_Symbol Amp Del total AlteredSamples
   1:    MIR4444-1 115 376   491            389
   2:       CASC21 407   0   407            406
   3:        CASC8 407   0   407            406
   4:         TP63 401   0   401            400
   5:    LINC01206 392   0   392            392
  ---                                          
2712:        TFB2M   0  56    56             55
2713: LOC101927926   0  41    41             41
2714: LOC101927948   0  41    41             41
2715: LOC101927967   0  41    41             41
2716:       SNAR-H   0  41    41             41

可以使用write.GisticSummary函数保存上面的summary数据:

> write.GisticSummary(hnsc.gistic,"hnsc.gistic.summary")

(二)利用maftools进行可视化

(1)genome plot
> gisticChromPlot(gistic = hnsc.gistic, markBands = "all")

如果去掉上面代码里的“markBands”参数,则默认只展示q值最低的top5区域:

> gisticChromPlot(gistic = hnsc.gistic)
(2)Bubble plot
> gisticBubblePlot(gistic = hnsc.gistic)
(3)oncoplot

在所有的样品里,拷贝数变异top10的区域

> gisticOncoPlot(gistic = hnsc.gistic, sortByAnnotation = TRUE, top = 10)

你也可以结合临床特征来画oncoplot:

> gisticOncoPlot(gistic = hnsc.gistic, clinicalData = getClinicalData(x = hnsc.maf), clinicalFeatures = 'ajcc_clinical_stage', sortByAnnotation = TRUE, top = 10)
(4)Visualizing CBS segments

segment的可视化可以用所有样品的seg文件:

> plotCBSsegments(cbsFile = "tumor_segments.txt", tsb = 'ALL') #这时R会把所有样品的图依次画出来

也可以只画某一个样品的segment,但是需要把某个样品的数据从总的样品中提取出来:

> all_seg <- read.table("tumor_segments.txt",sep = "\t",header = T)
> all_seg$Sample <- substr(all_seg$Sample,1,12)
> tcga.4p.aa8j.seg <- all_seg[all_seg$Sample =="TCGA-4P-AA8J",]
> write.table(tcga.4p.aa8j,"tcga.4p.aa8j.txt",sep = "\t",col.names = T,row.names = F)
> plotCBSsegments(cbsFile = "tcga.4p.aa8j.txt")
(5) Ignoring variants in copy number altered regions

我们可以使用拷贝数信息来忽略那些落在拷贝数变异区域的variants。拷贝数的变异导致了过高/过低的variant等位基因频率,从而影响clustering。去掉这些variant会提高估算的clustering和密度,保留具有生物学意义的结果。

#Ignoring variants in copy number altered regions
> tcga.4p.aa8j.het <- inferHeterogeneity(maf = hnsc.maf, tsb = 'TCGA-4P-AA8J', segFile = "tcga.4p.aa8j.txt", vafCol = NULL)
#Visualizing results. Highlighting those variants on copynumber altered variants.
> plotClusters(clusters = tcga.4p.aa8j.het, genes = 'CN_altered', showCNvars = TRUE)

在上图中,凡是落在峰的外侧的基因,都是落在拷贝数变异区域的variants。在我的这个结果里,没有这类基因,官网上贴出的图是个很好的例子:

NF1和SUZ12两个基因就是落在拷贝数变异区域的基因,在分析中被忽略的。
上一篇下一篇

猜你喜欢

热点阅读