把maf格式的somatic突变数据导入annovar去除人群频
2018-12-24 本文已影响22人
因地制宜的生信达人
把maf格式的somatic突变数据导入annovar去除人群频率变异
首先maf格式的somatic突变数据制作成为annovar软件的输入格式:
cut -f 5-7,12,13,1,16 ping_organoids_all.maf |cut -f 2-7 > 1
cut -f 5-7,12,13,1,16 ping_organoids_all.maf |cut -f 1 > 2
paste 1 2 > for_annovar.input ### 共 13027 位点
然后运行annovar软件的批量注释功能
#!/bin/bash
#SBATCH --job-name Annovar
#SBATCH --partition FHS_NORMAL
#SBATCH --nodes 1
#SBATCH --tasks-per-node 1
#SBATCH --mem 10G
#SBATCH --output Annovar.%j.out
#SBATCH --error Annovar.%j.err
#SBATCH --mail-type FAIL
#SBATCH --mail-user yb77613@umac.mo
bin=/home/haitaowang/Database/hg38/Annovar/
db=/home/haitaowang/Database/hg38/Annovar/
perl $bin/table_annovar.pl for_annovar.input $db -buildver hg38 -out tmp \
-protocol refGene,cytoBand,esp6500siv2_all,exac03,gnomad_genome,cosmic70,1000g2015aug_all,clinvar_20170905 \
-operation g,r,r,f,f,f,f,f -nastring NA
因为数据库较多,所以注释耗时很长,注释结果如下:
551K Aug 17 16:52 for_annovar.input
99K Aug 17 17:10 tmp.hg38_ALL.sites.2015_08_dropped
423K Aug 17 17:10 tmp.hg38_ALL.sites.2015_08_filtered
459K Aug 17 16:57 tmp.hg38_avsnp150_dropped
174K Aug 17 16:57 tmp.hg38_avsnp150_filtered
56K Aug 17 17:11 tmp.hg38_clinvar_20170905_dropped
477K Aug 17 17:11 tmp.hg38_clinvar_20170905_filtered
35K Aug 17 17:10 tmp.hg38_cosmic70_dropped
470K Aug 17 17:10 tmp.hg38_cosmic70_filtered
667K Aug 17 16:53 tmp.hg38_cytoBand
366K Aug 17 17:10 tmp.hg38_dbnsfp33a_dropped
445K Aug 17 17:10 tmp.hg38_dbnsfp33a_filtered
115K Aug 17 16:53 tmp.hg38_esp6500siv2_all_dropped
410K Aug 17 16:53 tmp.hg38_esp6500siv2_all_filtered
423K Aug 17 16:53 tmp.hg38_exac03_dropped
307K Aug 17 16:53 tmp.hg38_exac03_filtered
286K Aug 17 16:53 tmp.hg38_genomicSuperDups
731K Aug 17 16:55 tmp.hg38_gnomad_genome_dropped
178K Aug 17 16:55 tmp.hg38_gnomad_genome_filtered
153K Aug 17 17:11 tmp.hg38_intervar_20180118_dropped
436K Aug 17 17:11 tmp.hg38_intervar_20180118_filtered
40K Aug 17 17:12 tmp.hg38_mcap_dropped
457K Aug 17 17:12 tmp.hg38_mcap_filtered
6.3M Aug 17 17:12 tmp.hg38_multianno.txt
48K Aug 17 17:12 tmp.hg38_revel_dropped
447K Aug 17 17:12 tmp.hg38_revel_filtered
68K Aug 17 17:12 tmp.invalid_input
956 Aug 17 17:12 tmp.log
208K Aug 17 16:53 tmp.refGene.exonic_variant_function
68K Aug 17 16:53 tmp.refGene.invalid_input
1.2K Aug 17 16:53 tmp.refGene.log
761K Aug 17 16:53 tmp.refGene.variant_function
简单数一下:
1454 tmp.hg38_ALL.sites.2015_08w
7400 tmp.hg38_avsnp150_dropped
170 tmp.hg38_clinvar_20170905_dropped
326 tmp.hg38_cosmic70_dropped
937 tmp.hg38_dbnsfp33a_dropped
1800 tmp.hg38_esp6500siv2_all_dropped
4262 tmp.hg38_exac03_dropped
7305 tmp.hg38_gnomad_genome_dropped
1167 tmp.hg38_intervar_20180118_dropped
648 tmp.hg38_mcap_dropped
890 tmp.hg38_revel_dropped
需要一个个数据库来解读。
首先,被千人基因组计划的人群频率0.05过滤掉的坐标拿出来:
perl -alne '{print if $F[1]>0.05}' tmp.hg38_ALL.sites.2015_08_dropped > filter_by_1000g.pos
有了这些坐标,就可以去过滤我们的原始maf文件啦。
cat filter_by_1000g.pos ping_organoids_all.maf |perl -alne '{if(/^1000/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' > filter_by_1000g.maf
同理,其它数据库也是同样的操作
perl -alne '{print if (split(",",$F[1]))[0]>0.05}' tmp.hg38_gnomad_genome_dropped > filter_by_gnomad.pos
perl -alne '{print if (split(",",$F[1]))[0]>0.05}' tmp.hg38_exac03_dropped > filter_by_exac03.pos
cat filter_by_exac03.pos filter_by_1000g.maf |perl -alne '{if(/^exac03/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' > filter_by_1000g_exac03.maf
cat filter_by_gnomad.pos filter_by_1000g_exac03.maf|perl -alne '{if(/^gnomad_genome/){$h{"$F[2]\t$F[3]"}=1}else{print unless exists $h{"$F[4]\t$F[5]"}}}' > filter_by_1000g_exac03_gnomeAD.maf
过滤效果如下:
wc -l *.maf
9955 filter_by_1000g_exac03_gnomeAD.maf
10588 filter_by_1000g_exac03.maf
12141 filter_by_1000g.maf
最后的 filter_by_1000g_exac03_gnomeAD.maf
就可以拿去做下游分析啦。