生信生信分析流程

WGS分析笔记(7)—— 新的mapping工具

2020-03-15  本文已影响0人  十三而舍

数据预处理

$ fastp -i /your/path/of/data/male_data/141008_D00360_0058_AHB675ADXX/NA24694_GCCAAT_L001_R1_001.fastq.gz -I /your/path/of/data/male_data/141008_D00360_0058_AHB675ADXX/NA24694_GCCAAT_L001_R2_001.fastq.gz -o /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R1_001.fastq.gz -O /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R2_001.fastq.gz -h male.008.html -j male.008.json -c -q 20 -u 50 -n 15 -5 20 -3 20 -w 12

bwa mem2

$ curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 | tar jxf -
$ bwa-mem2-2.0pre2_x64-linux/bwa-mem2 index ref.fa
$ bwa-mem2-2.0pre2_x64-linux/bwa-mem2 mem ref.fa read1.fq read2.fq > out.sam
# 两个下载该参考基因组的地址,一个是 EBI 网站上的,一个是 GATK 上的,没啥区别,除了文件名,下载后请自行解压。
$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
$ wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/human_g1k_v37_decoy.fasta.gz
$ time bwa-mem2 mem -t 12 -R "@RG\tID:NA24694\tPL:ILLUMINA\tLB:L001\tSM:NA24694" /your/path/of/b37/human_g1k_v37_decoy.fasta /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R1_001.fastq.gz /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R2_001.fastq.gz | samtools view -Shb - > male008.bam
# -t 是线程数,三个软件我都用12线程进行运行
# -R 是头文件,这里可以不设置
# time命令用来统计时间

vg

$ wget https://github.com/vgteam/vg/releases/download/v1.22.0/vg
$ chmod +x vg
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
# 建立索引,tabix是bcftools下载完后自带的
$ tabix ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
#1
(seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -R {} -r /your/path/of/b37/human_g1k_v37_decoy.fasta -v ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -t 1 -m 32 >{}.vg"
#2
vg ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo $i.vg; done)
#3
vg index -x wg.xg $(for i in $(seq 22; echo X; echo Y); do echo $i.vg; done)
#4
for chr in $(seq 1 22; echo X; echo Y);
do
    vg prune -r $chr.vg > $chr.pruned.vg
done
#5
vg index -b ./tmp -t 24 -g wg.gcsa $(for i in $(seq 22; echo X; echo Y); do echo $i.pruned.vg; done)
$ time vg map -t 12 -x wg.xg -g wg.gcsa -f /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R1_001.fastq.gz -f /bio-analysis1/mapper_test/data/N1/NA24694_GCCAAT_L001_R2_001.fastq.gz > male_008.gam
$ vg surject -x wg.xg -b in.gam > out.bam

Graph Genome

/biosoft/graph_genome
├── docker-bpa-0.9.1.tar
├── docker-rasm-0.5.20.tar
├── example_human_Illumina.pe_1.fastq
├── example_human_Illumina.pe_2.fastq
├── LICENSE.pdf
├── manifest.json
├── opensource-software-terms-and-copyrights.md
├── repositories
├── SAMPLE.sort.vcf
└── SBG.Graph.B37.V6.rc6.vcf.gz
$ docker load --input docker-bpa-0.9.1.tar
$ docker load --input docker-rasm-0.5.20.tar
# 查看安装完的镜像信息
$ docker images
$ time docker run -v "/your/path/of/fastq/":"/input" -v "/your/path/of/vcf/":"/file" -v "/your/path/of/reference/":"/ref" gral-bpa:"0.9.1" /usr/local/bin/aligner --vcf /file/SBG.Graph.B37.V6.rc6.vcf.gz --reference /ref/human_g1k_v37_decoy.fasta -q /input/NA24694_GCCAAT_L001_R1_001.fastq.gz -Q /input/NA24694_GCCAAT_L001_R2_001.fastq.gz -o /file/out/male_008.bam --threads 12

结果比较

# bwa-mem2
#==================================================
#All numbers are READ count
#==================================================

Total records:                          8024912

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        0
Unmapped reads:                         57103
mapq < mapq_cut (non-unique):           585944

mapq >= mapq_cut (unique):              7381865
Read-1:                                 3726141
Read-2:                                 3655724
Reads map to '+':                       3691175
Reads map to '-':                       3690690
Non-splice reads:                       7381865
Splice reads:                           0
Reads mapped in proper pairs:           7272853
Proper-paired reads map to different chrom:1110

# graph genome
#==================================================
#All numbers are READ count
#==================================================

Total records:                          8000000

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        0
Unmapped reads:                         165747
mapq < mapq_cut (non-unique):           513535

mapq >= mapq_cut (unique):              7320718
Read-1:                                 3728036
Read-2:                                 3592682
Reads map to '+':                       3660744
Reads map to '-':                       3659974
Non-splice reads:                       7320718
Splice reads:                           0
Reads mapped in proper pairs:           7137762
Proper-paired reads map to different chrom:0

# vg
#==================================================
#All numbers are READ count
#==================================================

Total records:                          8000000

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        0
Unmapped reads:                         170272
mapq < mapq_cut (non-unique):           738039

mapq >= mapq_cut (unique):              7091689
Read-1:                                 0
Read-2:                                 0
Reads map to '+':                       3545844
Reads map to '-':                       3545845
Non-splice reads:                       7091689
Splice reads:                           0
Reads mapped in proper pairs:           0
Proper-paired reads map to different chrom:0
# bwa-mem2
8024912 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
24912 + 0 supplementary
0 + 0 duplicates
7967809 + 0 mapped (99.29% : N/A)
8000000 + 0 paired in sequencing
4000000 + 0 read1
4000000 + 0 read2
7746532 + 0 properly paired (96.83% : N/A)
7886064 + 0 with itself and mate mapped
56833 + 0 singletons (0.71% : N/A)
64324 + 0 with mate mapped to a different chr
45865 + 0 with mate mapped to a different chr (mapQ>=5)

# graph genome
8000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
7834253 + 0 mapped (97.93% : N/A)
8000000 + 0 paired in sequencing
4000000 + 0 read1
4000000 + 0 read2
7593524 + 0 properly paired (94.92% : N/A)
7671608 + 0 with itself and mate mapped
162645 + 0 singletons (2.03% : N/A)
23810 + 0 with mate mapped to a different chr
17071 + 0 with mate mapped to a different chr (mapQ>=5)

# vg
8000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
7829728 + 0 mapped (97.87% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

  水平有限,要是存在什么错误请指出,可发送邮件至shiyuant@outlook.com!请大家多多批评指正,相互交流,共同成长,谢谢!!!

上一篇 下一篇

猜你喜欢

热点阅读