DNA 2. SCI 文章中基因组变异分析神器之 maftool
点击关注,桓峰基因
桓峰基因 生物信息分析,SCI文章撰写及生物信息基础知识学习:R语言学习,perl基础编程,linux系统命令,Python遇见更好的你 80篇原创内容 --> 公众号
前 言
随着癌症基因组学的进步,突变注释格式(MAF)正在被广泛接受并用于存储检测到的体细胞变异。癌症基因组图谱项目已经对30多种不同的癌症进行了测序,每种癌症的样本量都超过了200个。由体细胞变体组成的结果数据以突变注释格式的形式存储。只要数据是MAF格式,本包试图从TCGA来源或任何内部研究中以有效的方式总结、分析、注释和可视化MAF文件。
实例解析
1. 软件安装
在安装这个软件maftools时,需要先安装BioManager,然后再安装maftools,如下:
if (!require("BiocManager")) {
install.packages("BiocManager")
}
if (!require(maftools)) {
BiocManager::install("maftools")
}
library(maftools)
2. 数据读取
maftools工具需要读入两个文件,如下:
1.MAF文件-可以是gz压缩。必需的;
2.与MAF中每个样本/肿瘤样本条码相关的可选推荐的临床数据;
3.一个可选的拷贝数数据:可以是GISTIC输出或自定义表。
1.maf文件格式
MAF文件包含许多字段,从染色体名称到cosmic注释。然而,maftools中的大多数分析使用以下列如下:1.Hugo_Symbol;
2.Chromosome;
3.Start_Position;
4.End_Position;
5.Reference_Allele;
6.Tumor_Seq_Allele2;
7.Variant_Classification;
8.Variant_Type;
9.Tumor_Sample_Barcode.
同时读取maf文件和临床信息文件,看下结果,如下:
# path to TCGA LAML MAF file
laml.maf = system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
# clinical information containing survival information and histology. This is
# optional
laml.clin = system.file("extdata", "tcga_laml_annot.tsv", package = "maftools")
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475
## -Summarizing
## -Processing clinical data
## -Finished in 3.790s elapsed (0.550s cpu)
print(laml@data[1:5, ])
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1: ABCA10 10349 genome.wustl.edu 37 17
## 2: ABCA4 24 genome.wustl.edu 37 1
## 3: ABCB11 8647 genome.wustl.edu 37 2
## 4: ABCC3 8714 genome.wustl.edu 37 17
## 5: ABCF1 23 genome.wustl.edu 37 6
## Start_Position End_Position Strand Variant_Classification Variant_Type
## 1: 67170917 67170917 + Splice_Site SNP
## 2: 94490594 94490594 + Missense_Mutation SNP
## 3: 169780250 169780250 + Missense_Mutation SNP
## 4: 48760974 48760974 + Missense_Mutation SNP
## 5: 30554429 30554429 + Missense_Mutation SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1: T T C TCGA-AB-2988
## 2: C C T TCGA-AB-2869
## 3: G G A TCGA-AB-3009
## 4: C C T TCGA-AB-2887
## 5: G G A TCGA-AB-2920
## Protein_Change i_TumorVAF_WU i_transcript_name
## 1: p.K960R 45.66000 NM_080282.3
## 2: p.R1517H 38.12000 NM_000350.2
## 3: p.A1283V 46.97218 NM_003742.2
## 4: p.P1271S 56.41000 NM_003786.1
## 5: p.G658S 40.95000 NM_001025091.1
2.临床信息格式
临床数据格式包括:每个样本/肿瘤样本条码,和对应的临床数据,如下:
print(laml@clinical.data[1:5, ])
## Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1: TCGA-AB-2802 M4 365
## 2: TCGA-AB-2803 M3 792
## 3: TCGA-AB-2804 M3 2557
## 4: TCGA-AB-2805 M0 577
## 5: TCGA-AB-2806 M1 945
## Overall_Survival_Status
## 1: 1
## 2: 1
## 3: 0
## 4: 1
## 5: 1
3.拷贝数变异格式
拷贝数数据包含样本名称,基因名称和拷贝数状态(Amp或Del)。
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")
laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes,
gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
laml.gistic@data[1:5, ]
## Hugo_Symbol Tumor_Sample_Barcode Variant_Classification Variant_Type
## 1: KMT2A TCGA-AB-2805 Amp CNV
## 2: LINC00689 TCGA-AB-2805 Del CNV
## 3: MIR595 TCGA-AB-2805 Del CNV
## 4: RN7SL142P TCGA-AB-2805 Del CNV
## 5: SHH TCGA-AB-2805 Del CNV
## Cytoband
## 1: AP_2:11q23.3
## 2: DP_4:7q32.3
## 3: DP_4:7q32.3
## 4: DP_4:7q32.3
## 5: DP_4:7q32.3
3.数据统计
对整体数据情况进行汇总,统计样本量,突变基因量,以及突变类型,最后整理到表格中,如下:
# Shows sample summry.
getSampleSummary(laml)
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA-AB-3009 0 5 0
## 2: TCGA-AB-2807 1 0 1
## 3: TCGA-AB-2959 0 0 0
## 4: TCGA-AB-3002 0 0 0
## 5: TCGA-AB-2849 0 1 0
## ---
## 189: TCGA-AB-2942 0 0 0
## 190: TCGA-AB-2946 0 0 0
## 191: TCGA-AB-2954 0 0 0
## 192: TCGA-AB-2982 0 0 0
## 193: TCGA-AB-2903 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
## 3: 0 22 0 1 23
## 4: 0 15 1 5 21
## 5: 0 16 1 2 20
## ---
## 189: 1 0 0 0 1
## 190: 0 1 0 0 1
## 191: 0 1 0 0 1
## 192: 0 1 0 0 1
## 193: 0 0 0 0 0
# Shows gene summary.
getGeneSummary(laml)
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1: FLT3 0 0 1 33
## 2: DNMT3A 4 0 0 0
## 3: NPM1 0 33 0 0
## 4: IDH2 0 0 0 0
## 5: IDH1 0 0 0 0
## ---
## 1237: ZNF689 0 0 0 0
## 1238: ZNF75D 0 0 0 0
## 1239: ZNF827 1 0 0 0
## 1240: ZNF99 0 0 0 0
## 1241: ZPBP 0 0 0 0
## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1: 15 0 3 52 52
## 2: 39 5 6 54 48
## 3: 1 0 0 34 33
## 4: 20 0 0 20 20
## 5: 18 0 0 18 18
## ---
## 1237: 1 0 0 1 1
## 1238: 1 0 0 1 1
## 1239: 0 0 0 1 1
## 1240: 1 0 0 1 1
## 1241: 1 0 0 1 1
## AlteredSamples
## 1: 52
## 2: 48
## 3: 33
## 4: 20
## 5: 18
## ---
## 1237: 1
## 1238: 1
## 1239: 1
## 1240: 1
## 1241: 1
# shows clinical data associated with samples
getClinicalData(laml)
## Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1: TCGA-AB-2802 M4 365
## 2: TCGA-AB-2803 M3 792
## 3: TCGA-AB-2804 M3 2557
## 4: TCGA-AB-2805 M0 577
## 5: TCGA-AB-2806 M1 945
## ---
## 189: TCGA-AB-3007 M3 1581
## 190: TCGA-AB-3008 M1 822
## 191: TCGA-AB-3009 M4 577
## 192: TCGA-AB-3011 M1 1885
## 193: TCGA-AB-3012 M3 1887
## Overall_Survival_Status
## 1: 1
## 2: 1
## 3: 0
## 4: 1
## 5: 1
## ---
## 189: 0
## 190: 1
## 191: 1
## 192: 0
## 193: 0
# Shows all fields in MAF
getFields(laml)
## [1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
## [4] "NCBI_Build" "Chromosome" "Start_Position"
## [7] "End_Position" "Strand" "Variant_Classification"
## [10] "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode" "Protein_Change"
## [16] "i_TumorVAF_WU" "i_transcript_name"
# Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = "laml")
数据汇总统计绘图,将每个样本中的变量数量显示为堆叠的barplot,将变量类型显示为Variant_Classification汇总的箱线图。如下:
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = "median", dashboard = TRUE,
titvRaw = FALSE)
4.突变数据可视化
瀑布图
# oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)
顺反式图
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
# plot titv summary
plotTiTv(res = laml.titv)
棒棒糖图(氨基酸)
# lollipop plot for DNMT3A, which is one of the most frequent mutated gene in
# Leukemia.
lollipopPlot(maf = laml, gene = "DNMT3A", AACol = "Protein_Change", showMutationRate = TRUE,
labelPos = 882)
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_022552 NP_072046 912
## 2: DNMT3A NM_153759 NP_715640 723
## 3: DNMT3A NM_175629 NP_783328 912
棒棒糖图(蛋白)
plotProtein(gene = "TP53", refSeqID = "NM_000546")
降雨图
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.4)
## Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1: 8 98129348 98133560 7 702.0000 4212
## 2: 8 98398549 98403536 9 623.3750 4987
## 3: 8 98453076 98456466 9 423.7500 3390
## 4: 8 124090377 124096810 22 306.3333 6433
## 5: 12 97436055 97439705 7 608.3333 3650
## 6: 17 29332072 29336153 8 583.0000 4081
## Tumor_Sample_Barcode C>G C>T
## 1: TCGA-A8-A08B 4 3
## 2: TCGA-A8-A08B 1 8
## 3: TCGA-A8-A08B 1 8
## 4: TCGA-A8-A08B 1 21
## 5: TCGA-A8-A08B 4 3
## 6: TCGA-A8-A08B 4 4
突变负荷与TCGA比较
laml.mutload = tcgaCompare(maf = laml, cohortName = "Example-LAML", logscale = TRUE,
capture_size = 50)
VAF分布图
plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')
CNV可视化
基因组分布
gisticChromPlot(gistic = laml.gistic, markBands = "all")
气泡图
gisticBubblePlot(gistic = laml.gistic)
瀑布图
gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = "FAB_classification",
sortByAnnotation = TRUE, top = 10)
CBS 可视化
tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
## NULL
这期主要说说单纯的可视化数据,下期说说利用maftools分析一些实质性的操作,敬请期待,想学习生信的可以到B站,每天上传教程!
我们在也推出克隆进化的系列文章,从组织到单细胞的克隆进化分析都有介绍,供大家参考:
有做肿瘤的转移,癌症的耐药性亚克隆,靶向用药等课题的老师都可以直接联系我,保证最好的服务,最顶级的科研方法,助力高分!
References:
本文使用 文章同步助手 同步