10-GATK的最佳实践-bqsr检查

2022-05-08  本文已影响0人  jiarf

前面sed跑完之后就可以跑bqsr了,sed还是很快的,
在跑一下

GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
cd /data1/jiarongf/learning/cancer-WES/5.GATK
cat ../data/case  | while read id
do
    if [ ! -f ${id}_bqsr.bam ]
    then
    echo "start BQSR for ${id}" `date`
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/"  BaseRecalibrator \
    -R $ref  \
    -I ${id}_marked.chr.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${id}_recal.table \
    1>../log/${id}_log.recal 2>&1 

    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"  ApplyBQSR \
    -R $ref  \
    -I ${id}_marked.chr.bam  \
    -bqsr ${id}_recal.table \
    -O ${id}_bqsr.bam \
    1>../log/${id}_log.ApplyBQSR  2>&1 

    echo "end BQSR for ${id}" `date`
    fi
done

# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    # -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta  \
    # -I case1_biorep_A_techrep_WES_marked.chr.sed.bam  \
    # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
    # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    # -O case1_biorep_A_techrep_WES_recal.table \

看着一点第一个出来的log。recal,那个chrMT还有没有问题了

(KG) 11:37:50 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/log
$
tail case5_germline_WES_log.recal
11:39:04.846 INFO  ProgressMeter -        chr11:8872371              6.8              17145000        2505382.4
11:39:14.853 INFO  ProgressMeter -       chr11:18291832              7.0              17583000        2508262.0
11:39:24.871 INFO  ProgressMeter -       chr11:31370706              7.2              17928000        2497973.9
11:39:34.896 INFO  ProgressMeter -       chr11:44600085              7.3              18257000        2485941.1
11:39:44.903 INFO  ProgressMeter -       chr11:55913813              7.5              18663000        2484794.3
11:39:54.904 INFO  ProgressMeter -       chr11:62690975              7.7              19155000        2494931.1
11:40:04.908 INFO  ProgressMeter -       chr11:67441227              7.8              19681000        2508960.9
11:40:14.943 INFO  ProgressMeter -       chr11:75071241              8.0              20112000        2510375.6
11:40:25.006 INFO  ProgressMeter -       chr11:86807547              8.2              20467000        2502307.7
11:40:35.009 INFO  ProgressMeter -      chr11:100089766              8.3              20804000        2492696.1

才跑到11号染色体,在等的
感觉不对,发现sed那个还是没有把chrMT改成chrM

samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam

先试一个,跑的挺慢的

(KG) 14:09:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam
(KG) 14:32:22 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
samtools view -h case2_techrep_2_WES_marked.chr.sed.bam | grep chrMT
(KG) 14:41:19 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK

可以看到已经没有这chrMT了,在循环跑一下chrMT的代替

(KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1

这个5.GATK下面已经六百多G了,删掉一些吧,目前这个文件下面



删掉marked.bam就可以了

(KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
$
bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1
^C

因为跑的时间太长了,把每个都分开跑吧,多开几个screen
只跑了case1的
做bqsr

GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
cd /data1/jiarongf/learning/cancer-WES/5.GATK
#cat ../data/case  | while read id
ls *chr.sed.bam | while read id
do
    if [ ! -f ${id}_bqsr.bam ]
    then
    echo "start BQSR for ${id}" `date`
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/"  BaseRecalibrator \
    -R $ref  \
    -I ${id}_marked.chr.sed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${id}_recal.table \
    1>../log/${id}_log.recal 2>&1 

    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"  ApplyBQSR \
    -R $ref  \
    -I ${id}_marked.chr.sed.bam  \
    -bqsr ${id}_recal.table \
    -O ${id}_bqsr.bam \
    1>../log/${id}_log.ApplyBQSR  2>&1 

    echo "end BQSR for ${id}" `date`
    fi
done

# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    # -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta  \
    # -I case1_biorep_A_techrep_WES_marked.chr.sed.bam  \
    # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
    # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    # -O case1_biorep_A_techrep_WES_recal.table \
又有问题了,我猜后面的染色体应该都有问题,所以这里告诉我们不应该乱加自己的chr,还有就是之前bwa比对的时候,那个基因组应该是加chr的染色体,不然比对的bam文件到后面的GATK的fasta基因组的时候会有很大的问题,所以比对的那个也要找找怎么回事

好了这个我猜后面还有错误的,就不继续往下了,就是要记住自己比对的bwa基因组要和GTAK的基因组的染色体要注意一致。

上一篇下一篇

猜你喜欢

热点阅读