全基因组/外显子组测序分析生信小白

全外显子组数据分析笔记(四):变异检测

2018-11-11  本文已影响51人  TOP生物信息

变异检测

得到原始vcf的命令很简单

#snp=/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" HaplotypeCaller \
               -R ${ref}  \
               -I ${info2}.smrb.bam \
               --dbsnp ${snp} \
               -O ${info2}.raw.vcf \
               1>${info2}.HaplotypeCaller.log 2>&1

如果只想对特定的区域进行变异检测呢?可以这样:

samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 | samtools sort -o STK11.smrb.bam -
samtools index STK11.smrb.bam

这里举一个错误的例子:

samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 > STK11.smrb.bam

这是不对的,重定向(>)之前文件是以sam格式打开的,即使重新定义为bam,它本身还是sam文件(可以直接用less查看),所以可以samtools sort转换一下(虽然排序的作用是多余的,但生成了真的bam文件)。
接着可以简单看一下指定区域的比对情况,用samtools tview命令,类似于IGV,但远没有IGV清晰。

samtools tview -p chr19:1205798-1228434 STK11.smrb.bam \
--reference ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta
#如下
#此界面下输入问号?显示帮助信息,区域小还能用一下,区域大了简直难受
   1206801   1206811   1206821   1206831   1206841   1206851   1206861   1206871   1206881   1206891   1206901   1206911   1206921
TTTTTTTTTTCTTTTTTCTTTGTAAAATTTTGGAGAAGGGAAGTCGGAACACAAGGAAGGACCGCTCACCCGCGGACTCAGGGCTGGCGGCGGGACTCCAGGACCCTGGGTCCAGCATGGAGGTGGTGGACCC
.....................................................................................................................................
......... ..................................................    .................................................. ,,,,,,,,,,,,,,,,,,
................ ..................................................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  ,,,,,,,,,,,
..................    ..................................................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,
..................     ..................................................   ........................T.........................      .
..................     ..................................................   ..................................................
........................... ..................................................              ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...................................  ..................................................        ......................................
......................................   ..................................................    ......................................
........................................... .................................................. ......................................
........................................... .................................................. ......................................
................................................   ..................................................  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
     ..................................................   ............................................G.....   ......................
      ..................................................        .................................................. ,,,,,,,,,,,,,,,,,,
      ..................................................        .................................................. ,,,,,,,,,,,,,,,,,,
       ...........................A.........G............   .................................................. ,,,,,,,,,,,,,,,,,,
       ..................................................   ..................................................  .................
       ..................................................   ..................................................   ................
       ..................................................   ........................N.........................   ................

针对特定基因找变异

这里找变异的流程主要参考的是samtools官网提供的流程,链接http://www.htslib.org/workflow/#mapping_to_variant
在官网主页有这样一段话:

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
Samtools
Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
BCFtools
Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
HTSlib
A C library for reading/writing high-throughput sequencing data

第一步
bcftools mpileup --output-type u -o STK11.bcf \
-f ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
STK11.smrb.bam
这里--output-type u表示输出不压缩的bcf文件

这一步之后会生成bcf文件,我们知道vcf(variant call format)是一种用于存储基因序列突变信息的文本格式,而bcf格式文件是vcf格式的二进制文件,所以bcf不能直接查看。注意这里强调的是格式而不是内容,上面的bcf文件直接由bam得到,没有经过任何筛选,所以不可能每一个位置都是变异点。

可以这样查看
bcftools view ./STK11.bcf | lsx
如下,ALT是<*>当然就是没有变异咯
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL100072545
chr19   1206383 .       C       <*>     0       .       DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0   PL      0,3,30
chr19   1206384 .       G       <*>     0       .       DP=1;I16=1,0,0,0,27,729,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0   PL      0,3,27
chr19   1206385 .       G       <*>     0       .       DP=1;I16=1,0,0,0,31,961,0,0,60,3600,0,0,2,4,0,0;QS=1,0;MQ0F=0   PL      0,3,31
再来看看vcf文件(事先准备好的)
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL100072545
chr19   1207142 .       GT      G       222     .       INDEL;IDV=29;IMF=0.432836;DP=67;VDB=0.74912;SGB=-0.693079;MQSB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,23,15,14;MQ=60   GT:PL   0/1:255,0,255
chr19   1207239 .       G       T       225     .       DP=68;VDB=0.150142;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,11,51;MQ=60  GT:PL   1/1:255,184,0
chr19   1207281 .       C       T       131     .       DP=34;VDB=0.232203;SGB=-0.69168;RPB=0.572388;MQB=1;BQB=0.010939;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,14,0,19;MQ=60      GT:PL   0/1:164,0,183
第二步
bcftools call -v -m -O v -o STK11.vcf STK11.bcf
-v  output variant sites only
-O v  output type: 'v' uncompressed VCF

变异校正(VQSR, Variant Quality Score Recalibration)

第一步
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
        -R ${ref}  \
        -V ${info2}.raw.vcf \
        --resource 1000G,known=false,training=true,truth=false,prior=10.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
        --resource hapmap,known=false,training=true,truth=true,prior=15.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/hapmap_3.3.hg38.vcf.gz \
        --resource dbsnp,known=true,training=false,truth=false,prior=2.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz \
        -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
        -mode SNP \
        -O ${info2}.snps.recal \
        --tranches-file ${info2}.snps.tranches \
        --rscript-file ${info2}.snps.plots.R
第二步
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
        -R ${ref}  \
        -V ${info2}.raw.vcf \
        -mode SNP \
        --recal-file ${info2}.snps.recal \
        --tranches-file ${info2}.snps.tranches \
        -O ${info2}.snps.VQSR.vcf
        --truth-sensitivity-filter-level 99.0

再对indel来一遍

${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
        -R ${ref}  \
        -V ${info2}.snps.VQSR.vcf \
        -resource mills,known=true,training=true,truth=true,prior=12.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
        -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
        -mode INDEL \
        -O  ${info2}.snps.indels.recal \
        --tranches-file ${info2}.snps.indels.tranches \
        --rscript-file ${info2}.snps.indels.plots.R
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
        -R ${ref}  \
        -V ${info2}.snps.VQSR.vcf \
        -mode INDEL \
        --recal-file ${info2}.snps.indels.recal \
        --tranches-file ${info2}.snps.indels.tranches \
        -O ${info2}.snps.indels.VQSR.vcf
        --truth-sensitivity-filter-level 99.0

这一步其实就是变异过滤,通常来讲有两种策略:

硬指标过滤

给一些指标设定临界值,各项指标大于临界值的位点才会被保留。

变异校正(机器学习的方法)

就是上面的VQSR,更强大,但需要非常完备的数据集。


Hard-filtering is a very blunt instrument
Variant recalibration is far more discriminating

接下来就可以按照snp和indel把vcf分开,到这儿才算结束了这一步。

java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-V CL100072545.44.snps.indels.VQSR.vcf \
-selectType SNP -o snp.vcf 
java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-V CL100072545.44.snps.indels.VQSR.vcf \
-selectType INDEL -o indel.vcf

reference

(howto) Apply hard filters to a call set: https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

上一篇 下一篇

猜你喜欢

热点阅读