10-GATK的最佳实践-bqsr检查
2022-05-08 本文已影响0人
jiarf
前面sed跑完之后就可以跑bqsr了,sed还是很快的,
在跑一下
GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
cd /data1/jiarongf/learning/cancer-WES/5.GATK
cat ../data/case | while read id
do
if [ ! -f ${id}_bqsr.bam ]
then
echo "start BQSR for ${id}" `date`
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/" BaseRecalibrator \
-R $ref \
-I ${id}_marked.chr.bam \
--known-sites $snp \
--known-sites $indel \
-O ${id}_recal.table \
1>../log/${id}_log.recal 2>&1
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" ApplyBQSR \
-R $ref \
-I ${id}_marked.chr.bam \
-bqsr ${id}_recal.table \
-O ${id}_bqsr.bam \
1>../log/${id}_log.ApplyBQSR 2>&1
echo "end BQSR for ${id}" `date`
fi
done
# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
# -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta \
# -I case1_biorep_A_techrep_WES_marked.chr.sed.bam \
# --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
# --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
# -O case1_biorep_A_techrep_WES_recal.table \
看着一点第一个出来的log。recal,那个chrMT还有没有问题了
(KG) 11:37:50 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/log
$
tail case5_germline_WES_log.recal
11:39:04.846 INFO ProgressMeter - chr11:8872371 6.8 17145000 2505382.4
11:39:14.853 INFO ProgressMeter - chr11:18291832 7.0 17583000 2508262.0
11:39:24.871 INFO ProgressMeter - chr11:31370706 7.2 17928000 2497973.9
11:39:34.896 INFO ProgressMeter - chr11:44600085 7.3 18257000 2485941.1
11:39:44.903 INFO ProgressMeter - chr11:55913813 7.5 18663000 2484794.3
11:39:54.904 INFO ProgressMeter - chr11:62690975 7.7 19155000 2494931.1
11:40:04.908 INFO ProgressMeter - chr11:67441227 7.8 19681000 2508960.9
11:40:14.943 INFO ProgressMeter - chr11:75071241 8.0 20112000 2510375.6
11:40:25.006 INFO ProgressMeter - chr11:86807547 8.2 20467000 2502307.7
11:40:35.009 INFO ProgressMeter - chr11:100089766 8.3 20804000 2492696.1
才跑到11号染色体,在等的
感觉不对,发现sed那个还是没有把chrMT改成chrM
samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam
先试一个,跑的挺慢的
(KG) 14:09:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam
(KG) 14:32:22 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
samtools view -h case2_techrep_2_WES_marked.chr.sed.bam | grep chrMT
(KG) 14:41:19 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
可以看到已经没有这chrMT了,在循环跑一下chrMT的代替
(KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1
这个5.GATK下面已经六百多G了,删掉一些吧,目前这个文件下面
删掉marked.bam就可以了
(KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1
^C
因为跑的时间太长了,把每个都分开跑吧,多开几个screen
只跑了case1的
做bqsr
GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
cd /data1/jiarongf/learning/cancer-WES/5.GATK
#cat ../data/case | while read id
ls *chr.sed.bam | while read id
do
if [ ! -f ${id}_bqsr.bam ]
then
echo "start BQSR for ${id}" `date`
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/" BaseRecalibrator \
-R $ref \
-I ${id}_marked.chr.sed.bam \
--known-sites $snp \
--known-sites $indel \
-O ${id}_recal.table \
1>../log/${id}_log.recal 2>&1
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" ApplyBQSR \
-R $ref \
-I ${id}_marked.chr.sed.bam \
-bqsr ${id}_recal.table \
-O ${id}_bqsr.bam \
1>../log/${id}_log.ApplyBQSR 2>&1
echo "end BQSR for ${id}" `date`
fi
done
# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
# -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta \
# -I case1_biorep_A_techrep_WES_marked.chr.sed.bam \
# --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
# --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
# -O case1_biorep_A_techrep_WES_recal.table \
又有问题了,我猜后面的染色体应该都有问题,所以这里告诉我们不应该乱加自己的chr,还有就是之前bwa比对的时候,那个基因组应该是加chr的染色体,不然比对的bam文件到后面的GATK的fasta基因组的时候会有很大的问题,所以比对的那个也要找找怎么回事
好了这个我猜后面还有错误的,就不继续往下了,就是要记住自己比对的bwa基因组要和GTAK的基因组的染色体要注意一致。