把maf格式的somatic突变数据导入annovar去除人群频

2018-12-24  本文已影响22人  因地制宜的生信达人

把maf格式的somatic突变数据导入annovar去除人群频率变异

首先maf格式的somatic突变数据制作成为annovar软件的输入格式:

cut -f 5-7,12,13,1,16 ping_organoids_all.maf  |cut -f 2-7  > 1
cut -f 5-7,12,13,1,16 ping_organoids_all.maf |cut -f 1 > 2
paste 1 2 > for_annovar.input ### 共 13027 位点

然后运行annovar软件的批量注释功能

#!/bin/bash
#SBATCH --job-name                      Annovar
#SBATCH --partition                     FHS_NORMAL
#SBATCH --nodes                         1
#SBATCH --tasks-per-node        1
#SBATCH --mem                           10G
#SBATCH --output                        Annovar.%j.out
#SBATCH --error                         Annovar.%j.err
#SBATCH --mail-type                     FAIL
#SBATCH --mail-user                     yb77613@umac.mo
bin=/home/haitaowang/Database/hg38/Annovar/
db=/home/haitaowang/Database/hg38/Annovar/ 
perl $bin/table_annovar.pl  for_annovar.input $db -buildver hg38 -out  tmp \
-protocol refGene,cytoBand,esp6500siv2_all,exac03,gnomad_genome,cosmic70,1000g2015aug_all,clinvar_20170905 \
-operation g,r,r,f,f,f,f,f -nastring NA

因为数据库较多,所以注释耗时很长,注释结果如下:

551K Aug 17 16:52 for_annovar.input
 99K Aug 17 17:10 tmp.hg38_ALL.sites.2015_08_dropped
423K Aug 17 17:10 tmp.hg38_ALL.sites.2015_08_filtered
459K Aug 17 16:57 tmp.hg38_avsnp150_dropped
174K Aug 17 16:57 tmp.hg38_avsnp150_filtered
 56K Aug 17 17:11 tmp.hg38_clinvar_20170905_dropped
477K Aug 17 17:11 tmp.hg38_clinvar_20170905_filtered
 35K Aug 17 17:10 tmp.hg38_cosmic70_dropped
470K Aug 17 17:10 tmp.hg38_cosmic70_filtered
667K Aug 17 16:53 tmp.hg38_cytoBand
366K Aug 17 17:10 tmp.hg38_dbnsfp33a_dropped
445K Aug 17 17:10 tmp.hg38_dbnsfp33a_filtered
115K Aug 17 16:53 tmp.hg38_esp6500siv2_all_dropped
410K Aug 17 16:53 tmp.hg38_esp6500siv2_all_filtered
423K Aug 17 16:53 tmp.hg38_exac03_dropped
307K Aug 17 16:53 tmp.hg38_exac03_filtered
286K Aug 17 16:53 tmp.hg38_genomicSuperDups
731K Aug 17 16:55 tmp.hg38_gnomad_genome_dropped
178K Aug 17 16:55 tmp.hg38_gnomad_genome_filtered
153K Aug 17 17:11 tmp.hg38_intervar_20180118_dropped
436K Aug 17 17:11 tmp.hg38_intervar_20180118_filtered
 40K Aug 17 17:12 tmp.hg38_mcap_dropped
457K Aug 17 17:12 tmp.hg38_mcap_filtered
6.3M Aug 17 17:12 tmp.hg38_multianno.txt
 48K Aug 17 17:12 tmp.hg38_revel_dropped
447K Aug 17 17:12 tmp.hg38_revel_filtered
 68K Aug 17 17:12 tmp.invalid_input
 956 Aug 17 17:12 tmp.log
208K Aug 17 16:53 tmp.refGene.exonic_variant_function
 68K Aug 17 16:53 tmp.refGene.invalid_input
1.2K Aug 17 16:53 tmp.refGene.log
761K Aug 17 16:53 tmp.refGene.variant_function

简单数一下:

   1454 tmp.hg38_ALL.sites.2015_08w
   7400 tmp.hg38_avsnp150_dropped
    170 tmp.hg38_clinvar_20170905_dropped
    326 tmp.hg38_cosmic70_dropped
    937 tmp.hg38_dbnsfp33a_dropped
   1800 tmp.hg38_esp6500siv2_all_dropped
   4262 tmp.hg38_exac03_dropped
   7305 tmp.hg38_gnomad_genome_dropped
   1167 tmp.hg38_intervar_20180118_dropped
    648 tmp.hg38_mcap_dropped
    890 tmp.hg38_revel_dropped

需要一个个数据库来解读。

首先,被千人基因组计划的人群频率0.05过滤掉的坐标拿出来:

perl -alne '{print if $F[1]>0.05}' tmp.hg38_ALL.sites.2015_08_dropped > filter_by_1000g.pos

有了这些坐标,就可以去过滤我们的原始maf文件啦。

cat filter_by_1000g.pos  ping_organoids_all.maf   |perl -alne '{if(/^1000/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' > filter_by_1000g.maf

同理,其它数据库也是同样的操作

perl -alne '{print if (split(",",$F[1]))[0]>0.05}' tmp.hg38_gnomad_genome_dropped > filter_by_gnomad.pos
perl -alne '{print if (split(",",$F[1]))[0]>0.05}' tmp.hg38_exac03_dropped  > filter_by_exac03.pos
cat filter_by_exac03.pos filter_by_1000g.maf |perl -alne '{if(/^exac03/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' >  filter_by_1000g_exac03.maf
cat filter_by_gnomad.pos   filter_by_1000g_exac03.maf|perl -alne '{if(/^gnomad_genome/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' >  filter_by_1000g_exac03_gnomeAD.maf

过滤效果如下:

wc -l *.maf
    9955 filter_by_1000g_exac03_gnomeAD.maf
   10588 filter_by_1000g_exac03.maf
   12141 filter_by_1000g.maf

最后的 filter_by_1000g_exac03_gnomeAD.maf 就可以拿去做下游分析啦。

上一篇下一篇

猜你喜欢

热点阅读