2020-06-24 学习WES数据分析流程3
- 群体遗传专题-重测序技术介绍
- 群体遗传专题:系统发育基础知识介绍
- 群体遗传专题-structure图的构建与理解
- 群体遗传专题-学习运用Fst
- 群体遗传专题:读取数据和数据的基本格式
- 群体遗传专题-PCA分析
- 群体遗传专题:基础知识概念理解
- 群体遗传专题:关于SNP的过滤(2):如何使用vcftools进行SNP过滤
- 群体遗传专题:关于SNP的过滤(1):如何使用vcftools进行SNP过滤
- 群体遗传专题:关于SNP的过滤(3):如何从NGS生成高质量的SNP数据集?
今天的学习之前需要复习一下基础如上所示:
使用HaplotypeCaller对上述的bam文件进行call variant,就是寻找变异的过程,生成一个VCF文件用于后续位点的质控和注释,从一个bam到一个vcf文件
time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk HaplotypeCaller -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -I 9Y1640_bqsr.bam -O 9Y1640WES.HC.vcf
这个过程比昨天的还要久运行结果如下 :
6761758 total reads filtered
09:19:47.687 INFO ProgressMeter - HLA-DRB1*16:02:01:10700 464.1 11090871 23898.8
09:19:47.687 INFO ProgressMeter - Traversal complete. Processed 11090871 total regions in 464.1 minutes.
09:19:48.095 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 17613.102690375003
09:19:48.095 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 1732.41 sec
09:19:48.095 INFO HaplotypeCaller - Shutting down engine
[June 24, 2020 at 9:19:48 AM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 464.12 minutes.
Runtime.totalMemory()=2202009600
real 464m13.962s
user 474m2.054s
sys 3m4.715s
共耗时464.1分钟,之后就可以用不同的方法过滤掉置信度不高的位点,然后进行注释,注释过程见下文。
先来比较一下和昨天获得的raw.vcf(也就是gvcf) .
(base) root@1100150:~/wes_cancer/project/5.gatk# less -S 9Y1640_raw.vcf |grep -v "#"|less -S |head
chr1 69091 . A <NON_REF> . . END=69161 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,11
chr1 69162 . T <NON_REF> . . END=69175 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73
chr1 69176 . G <NON_REF> . . END=69191 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,93
chr1 69192 . A <NON_REF> . . END=69232 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,54
chr1 69233 . T <NON_REF> . . END=69254 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,95
chr1 69255 . G <NON_REF> . . END=69263 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,151
chr1 69264 . C <NON_REF> . . END=69269 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,195
chr1 69270 rs201219564 A G,<NON_REF> 128.96 . DB;DP=5;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=3125,5 GT:AD:DP:GQ:PL:SB 1/1:0,5,0:5:15:143,15,0,143,15,143:0,0,5,0
chr1 69271 . C <NON_REF> . . END=69276 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,189
chr1 69277 . G <NON_REF> . . END=69285 GT:DP:GQ:MIN_DP:PL 0/0:6:18:6:0,18,222
(base) root@1100150:~/wes_cancer/project/5.gatk# less -S 9Y1640WES.HC.vcf |grep -v "#"|less -S |head
chr1 14574 . A G 159.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.653;DP=37;ExcessHet=3.0103;FS=2.663;MLEAC=1;MLEAF=0.500;MQ=30.00;MQRankSum=-4.861;QD=4.56;ReadPosRankSum=-0.128;SOR=1.539 GT:AD:DP:GQ:PL 0/1:25,10:35:99:167,0,647
chr1 14590 . G A 298.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=4.998;DP=48;ExcessHet=3.0103;FS=7.638;MLEAC=1;MLEAF=0.500;MQ=32.48;MQRankSum=-5.175;QD=6.22;ReadPosRankSum=1.525;SOR=2.670 GT:AD:DP:GQ:PL 0/1:38,10:48:99:306,0,1566
chr1 14599 . T A 291.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.291;DP=50;ExcessHet=3.0103;FS=10.372;MLEAC=1;MLEAF=0.500;MQ=32.67;MQRankSum=-5.213;QD=5.83;ReadPosRankSum=1.614;SOR=3.014 GT:AD:DP:GQ:PL 0/1:40,10:50:99:299,0,1649
chr1 14604 . A G 399.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.537;DP=59;ExcessHet=3.0103;FS=12.817;MLEAC=1;MLEAF=0.500;MQ=33.70;MQRankSum=-4.818;QD=6.77;ReadPosRankSum=0.247;SOR=3.533 GT:AD:DP:GQ:PL 0/1:46,13:59:99:407,0,1872
chr1 14610 . T C 411.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.871;DP=66;ExcessHet=3.0103;FS=21.573;MLEAC=1;MLEAF=0.500;MQ=33.76;MQRankSum=-4.854;QD=6.33;ReadPosRankSum=0.367;SOR=4.266 GT:AD:DP:GQ:PL 0/1:51,14:65:99:419,0,1958
chr1 14653 . C T 1074.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.334;DP=149;ExcessHet=3.0103;FS=6.839;MLEAC=1;MLEAF=0.500;MQ=37.03;MQRankSum=-2.288;QD=7.57;ReadPosRankSum=-3.040;SOR=1.266 GT:AD:DP:GQ:PL 0/1:89,53:142:99:1082,0,1965
chr1 14932 . G T 67.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.524;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.83;MQRankSum=-1.645;QD=13.53;ReadPosRankSum=1.036;SOR=1.609 GT:AD:DP:GQ:PL 0/1:3,2:5:75:75,0,120
chr1 14937 . T C 67.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.645;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.53;MQRankSum=-1.645;QD=13.53;ReadPosRankSum=1.036;SOR=1.609 GT:AD:DP:GQ:PL 0/1:3,2:5:75:75,0,120
chr1 15903 . G GC 585 . AC=2;AF=1.00;AN=2;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.15;QD=32.50;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,18:18:54:599,54,0
chr1 16378 . T C 51.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.204;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=23.26;MQRankSum=-2.189;QD=7.38;ReadPosRankSum=-1.718;SOR=0.495 GT:AD:DP:GQ:PL 0/1:4,3:7:59:59,0,98
GATK找变异
参考:https://zhuanlan.zhihu.com/p/69726572用GATK找变异
#SNP calling
# rescource 文件需要先index,参考文件尽量全,最后三行的文件都是输出
# 步骤1-4
time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk VariantRecalibrator -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V 9Y1640WES.HC.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/wes_cancer/data/hapmap_3.3.hg38.vcf.gz \
-resource:omini,known=false,training=true,truth=false,prior=12.0 ~/wes_cancer/data/1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/wes_cancer/data/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ~/wes_cancer/data/dbsnp_146.hg38.vcf.gz \
-an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode SNP -tranche 100.0 \
-tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
-O 9Y1640WES.snp.recal \
--tranches-file 9Y1640WES.snp.tranches \
--rscript-file 9Y1640WES.snp.plots.R
# 步骤5
time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk ApplyVQSR -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V 9Y1640WES.HC.vcf \
--ts-filter-level 99.0 --tranches-file 9Y1640WES.snp.tranches \
--recal-file 9Y1640WES.snp.recal \
-mode SNP \
--output data/9Y1640WES.snps.VQSR.vcf.gz
## 查看数据文件
$ cat 9Y1640WES.snp.tranches
-tranche
默认是输出[100,99.9,99.0,90.0]
4个tranche阈值的统计结果,如果想看其他阈值的结果,需要自行加上;结果就是看9Y1640WES.snp.tranches
,而9Y1640WES.snp.recal
文件则是用于变异质控ApplyVQSR
的.
#INDEL calling
#加了--max-gaussians 6用于设定Gaussians(clusters of variants that have similar properties)的数目,即减少聚类的组数,从而使得每个组的变异位点数目达到要求
time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk VariantRecalibrator -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V data/9Y1640WES.snps.VQSR.vcf.gz \
-resource:mills,known=true,training=true,truth=true,prior=12.0 ~/wes_cancer/data/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
-mode INDEL --max-gaussians 6 \
--rscript-file 9Y1640WES_L1.snp.indel.plots.R \
--tranches-file 9Y1640WES.snp.indel.tranches \
--output data/9Y1640WES.snp.indel.recal
~/wes_cancer/biosoft/gatk-4.1.7.0/gatk ApplyVQSR \
-R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta \
-V data/9Y1640WES.snps.VQSR.vcf.gz \
-O 9Y1640WES.indel.VQSR.vcf \
--truth-sensitivity-filter-level 99.0 \
--tranches-file 9Y1640WES.snp.indel.tranches \
--recal-file data/9Y1640WES.snp.indel.recal \
-mode INDEL
#(上面这个9Y1640WES.indel.VQSR.vcf就是我们千辛万苦想得到的文件!)
VQSR大概步骤:
-
GATK认为VQSR比根据各种annotations进行hard-filtering过滤要好,减少了人为阈值的局限性,避免了一刀切的弊端,从而在sensitivity和specificity之间达到一定的平衡
-
VQSR根据机器学习算法从highly validated变异位点数据集(每个位点的annotation profile,一般使用5-8个annotation)中获取到good variants/bad variants
-
根据上述的位点从我们自己数据集中挑选出一个变异子集(probably true positives)来建模训练,获得一个可识别good variants的模型;bad variants的模型也是如此获得
-
然后根据上述获得的模型,对自己数- 据集的每个变异位点进行一个总的打分
-
最后根据设定的sensitivity阈值对变异位点进行过滤
按照官方教程:
SNP的VQSR过滤,选用的resource datasets为:
- HapMap,hapmap_3.3.hg38.vcf.gz,
truth=true
表示VQSR将这个数据集中的变异位点作为真位点true sites,training=true
表示VQSR将true sites用于训练recalibration model,并赋予这些变异位点prior likelihood值为Q15 (96.84%)
- Omni,1000G_omni2.5.hg38.vcf.gz,
truth=true
,training=false
(文档中写着是true,参数建议中写着的是false。。。我就按照参数上的来了),Q12 (93.69%)
- 1000G,1000G_phase1.snps.high_confidence.hg38.vcf.gz,
truth=false
表示VQSR考虑到在1000G数据集中的不仅包含了true variants还有false positives,training=true
,Q10 (90%)
- dbSNP,dbsnp_146.hg38.vcf.gz/dbsnp_138.hg38.vcf.gz,
truth=false
表示VQSR未将dbSNP数据集中的位点作为可信数据集,training=false
表示不用于训练数据集,known=true
表示stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not,Q2 (36.90%)
INDEL的VQSR过滤,选用的resource datasets为:
- Mills,Mills_and_1000G_gold_standard.indels.hg38.vcf.gz,
truth=true
,training=true
,Q12 (93.69%)
- dbSNP,dbsnp_146.hg38.vcf.gz/dbsnp_138.hg38.vcf.gz,
truth=false
,training=false
,known=true
,Q2 (36.90%)
查看最后的vcf文件:
(wes) root@1100150:~/wes_cancer/project/5.gatk# tail 9Y1640WES.indel.VQSR.vcf
chrUn_JTFH01001981v1_decoy 1912 . T C 841.06 PASS AC=2;AF=1.00;AN=2;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=42.43;QD=28.54;SOR=1.931;VQSLOD=0.691;culprit=SOR GT:AD:DP:GQ:PL 1/1:0,22:22:57:855,57,0
chrUn_JTFH01001981v1_decoy 1916 . A C 841.06 PASS AC=2;AF=1.00;AN=2;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=41.21;QD=33.27;SOR=2.124;VQSLOD=2.39;culprit=SORGT:AD:DP:GQ:PL 1/1:0,24:24:57:855,57,0
chrUn_JTFH01001981v1_decoy 1923 . T C 886.06 VQSRTrancheSNP99.00to99.90 AC=2;AF=1.00;AN=2;DP=25;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=41.46;QD=29.92;SOR=1.893;VQSLOD=0.124;culprit=SOR GT:AD:DP:GQ:PL 1/1:0,25:25:60:900,60,0
chrUn_JTFH01001981v1_decoy 1945 . T C 1246.06 PASS AC=2;AF=1.00;AN=2;DP=33;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.33;QD=30.47;SOR=1.751;VQSLOD=1.78;culprit=SORGT:AD:DP:GQ:PL 1/1:0,33:33:84:1260,84,0
chrUn_JTFH01001981v1_decoy 1958 . A T 1370.06 PASS AC=2;AF=1.00;AN=2;DP=39;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=45.74;QD=29.10;SOR=1.022;VQSLOD=2.74;culprit=FSGT:AD:DP:GQ:PL 1/1:0,33:33:99:1384,99,0
chrUn_JTFH01001981v1_decoy 1995 . T A 906.06 VQSRTrancheSNP99.00to99.90 AC=2;AF=1.00;AN=2;DP=37;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=44.95;QD=24.49;SOR=0.746;VQSLOD=0.283;culprit=QD GT:AD:DP:GQ:PL 1/1:0,37:37:99:920,99,0
chrUn_JTFH01001981v1_decoy 2025 . A C 651.06 PASS AC=2;AF=1.00;AN=2;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=45.83;QD=27.13;SOR=0.859;VQSLOD=0.875;culprit=QDGT:AD:DP:GQ:PL 1/1:0,24:24:66:665,66,0
chrUn_JTFH01001981v1_decoy 2069 . A C 447.06 VQSRTrancheSNP99.00to99.90 AC=2;AF=1.00;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=49.36;QD=26.30;SOR=1.739;VQSLOD=0.408;culprit=QD GT:AD:DP:GQ:PL 1/1:0,17:17:48:461,48,0
HLA-DRB1*16:02:01 5879 . TGA T 40.60 VQSRTrancheINDEL99.00to99.90 AC=1;AF=0.500;AN=2;BaseQRankSum=-1.012;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=51.08;MQRankSum=-1.571;NEGATIVE_TRAIN_SITE;QD=2.90;ReadPosRankSum=-1.184;SOR=0.223;VQSLOD=-2.913e+00;culprit=MQRankSum GT:AD:DP:GQ:PL 0/1:12,2:14:48:48,0,428
HLA-DRB1*16:02:01 5883 . A T 40.63 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.500;AN=2;BaseQRankSum=-1.097;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=51.08;MQRankSum=-1.571;NEGATIVE_TRAIN_SITE;QD=2.90;ReadPosRankSum=-1.602;SOR=0.223;VQSLOD=-4.746e+00;culprit=QD GT:AD:DP:GQ:PL 0/1:12,2:14:48:48,0,428