对组装之后的三代基因组进行polish
2019-10-15 本文已影响0人
多啦A梦的时光机_648d
一:利用pilon软件进行二代数据对三代数据polish
1.下载最新的pilon包
$wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
$java -Xmx10G -jar pilon-1.23.jar
编译的时候发现报错
$java -Xmx10G -jar pilon-1.23.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/pilon/Pilon : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
可以看到这里执行java的版本低于被编译的版本,所以就去下了一个低版本的pilon-1.22.jar,再去编译发现就可以了!
$java -Xmx10G -jar pilon-1.22.jar
Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400
Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]
[...other options...]
pilon --help for option details
2. 准备数据
- 三代数据组装好的基因组文件:draft.fa
- illumina的双端测序数据经过质控之后的数据:read1_fq.gz read2_fq.gz
3. 比对(bwa)
- 构建索引
$bwa index -p index/draft draft.fa
- 比对并排序
$bwa mem -t 16 index/draft raed1_fq.gz read2_fq.gz |samtools sort -@ 10 -O bam -o align.bam
- 对比对好的bam文件建索引
$samtools index -@ 10 align.bam
4. 标记重复
$sambamba markup -t 10 align.bam align_markup.bam
5. 过滤高质量比对的reads
$samtools view -@ 10 -q 30 align_markup.bam >align_filter.bam
$samtools index -@ 10 align_filter.bam
6. 使用pilon进行polish
java -Xmx10G -jar pilon-1.23.jar --genome draft.fa --frags align_filter.bam --fix snp,indels --output pilon_polished --vcf & >pilon.log
- pilon的参数
--frags: 表示输入的是1kb以内的paired-end文库,--jumps表示 大于1k以上的mate pair文库,--bam则是让软件自己猜测
-vcf: 输出一个vcf文件,包含每个碱基的信息
--fix: Pilon将会处理的内容,基本上选snps和indels就够了
--variant: 启发式的变异检测,等价于--vcf --fix all,breaks, 如果是polish不要使用该选项
--minmq: 用于Pilon堆叠的read最低比对质量,默认是0。
二:利用Pacbio组装的自身数据进行polish
1.准备软件
samtools
arrow
pbmm2
pbindex 最后两个可以推荐用conda安装pacbio公司的工具全家桶。
# 安装
conda create -n pb-assembly pb-assembly
# 启动
conda activate pb-assembly
2. 准备数据
- 组装得到的基因组文件raw_assembly.fa[falcon, canu, mecat2以及flye等软件组装好的数据]
- 公司给的raw bam文件【类似这样的XXX.subreads.bam】
3. 运行
$samtools faidx assembly.fa
$pbmm2 align pacbio.subreads.bam assembly.fa | samtools sort -@ 16 > map.pacbio.bam
$pbindex map.pacbio.bam
$arrow -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam
由于GenomicConsensus只能在Python上运行,所以已经被gcpp取代了,因此最后一步arrow也可以用gcpp运行:
gcpp用法与GenomicConsensus类似,参数都类似,所以最后一步可以改为:
$gcpp -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam
最后可以看看他的详细参数和用法:
gcpp - Compute genomic consensus from alignments and call variants relative to the reference.
Usage:
gcpp [options] <input.bam>
input.bam STR The input BAM file.
Required input/output files:
-r,--reference FILE The filename of the reference FASTA file.
-o,--output STR The output filename(s), as a comma-separated list. Valid output formats are
.fa/.fasta, .fq/.fastq, .gff, .vcf
Output filtering:
-q,--min-confidence INT The minimum confidence for a variant call to be output to variants.{gff,vcf} [40]
-x,--min-coverage INT The minimum site coverage that must be achieved for variant calls and consensus
to be calculated for a site. [5]
--no-evidence-call STR The consensus base that will be output for sites with no effective coverage.
Valid choices: (nocall, reference, lowercasereference). [lowercasereference]
Read selection/filtering:
-X,--coverage INT A designation of the maximum coverage level to be used for analysis. Exact
interpretation is algorithm-specific. The meaningful range of this argument is
[1, 1000]. [100]
--min-accuracy FLOAT The minimum acceptable window-global alignment accuracy for reads that will be
used for the analysis (arrow-only). [0.82]
-m,--min-map-qv INT The minimum MapQV for reads that will be used for analysis. [10]
--min-read-score FLOAT The minimum ReadScore for reads that will be used for analysis (arrow-only).
[0.65]
--min-snr FLOAT The minimum acceptable signal-to-noise over all channels for reads that will be
used for analysis (arrow-only). [2.5]
-w,--windows STR The window (or multiple comma-delimited windows) of the reference to be
processed, in the format refGroup:refStart-refEnd (default: entire reference).
Algorithm and parameter settings:
--algorithm STR The consensus algorithm used. Valid choices: (arrow, plurality, poa). [arrow]
--mask-radius INT Radius of window to use when excluding local regions for exceeding
maskMinErrorRate, where 0 disables any filtering (arrow-only). [3]
--mask-error-rate FLOAT Maximum local error rate before the local region defined bymaskRadius is excluded
from polishing (arrow-only). [0.7]
-P,--parameters-file STR Path to a model file or directory containing model files.
-p,--parameters-spec STR Name of chemistry or model to use, overriding default selection.
--max-iterations INT Maximum number of iterations to polish the template. [40]
--max-poa-coverage INT Maximum number of sequences to use for consensus calling. [11]
--mutation-separation INT Find the best mutations within a separation window for iterative polishing. [10]
--mutation-neighborhood INT Find nearby mutations within neighborhood for iterative polishing. [20]
--read-stumpiness-threshold FLOAT Filter out reads whose aligned length along a subread is lower than a percentage
of its corresponding reference length. [0.1]
Verbosity and debugging:
-d,--dump-evidence STR Dump evidence data. Valid choices: (variants, all, outliers, none). [none]
--evidence-directory DIR Directory to dump evidence into.
--annotate-gff Augment GFF variant records with additional information
--report-effective-coverage Additionally record the *post-filtering* coverage at variant sites.
Advanced configuration options:
-C,--reference-chunk-size INT Size of reference chunks. [500]
--reference-chunk-overlap INT Size of reference chunk overlaps. [5]
--simple-chunking Disable adaptive reference chunking.
--sort-strategy STR Read sorting strategy. Valid choices: (longest_and_strand_balanced, longest,
spanning, file_order). [longest_and_strand_balanced]
--min-poa-coverage INT Minimum number of reads required within a window to call consensus and variants
using arrow or poa. [3]
-h,--help Show this help and exit.
--version Show application version and exit.
-j,--num-threads INT Number of threads to use, 0 means autodetection. [0]
--log-level STR Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
--log-file FILE Log to a file, instead of stderr.
要是你觉得很麻烦,你也可以用hoptop的arrow_polish.sh运行也是可以的,不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。
- input.info,里面每一行类似于xxx.subreads.bam, 是公司提供的subread数据。
#!/bin/bash
set -e
set -o pipefail
set -u
REF=$1
BAM=$2
THREADS=100
source activate pb-assembly
if [ ! -f $REF.fai ]; then
samtools faidx $REF
fi
if [ ! f aln.bam ];then
pbalign \
--tmpDir=./ --nproc=${THREADS} \
--minAccuracy=0.75 --minLength=50 \
--minAnchorSize=12 --maxDivergence=30 --concordant --algorithm=blasr \
--algorithmOptions=--useQuality --maxHits=1 --hitPolicy=random --seed=1 \
$BAM ${REF} aln.bam
fi
variantCaller --algorithm=arrow \
-x 5 -X 120 -q 20 -j 24 \
-r $REF aln.bam \
-o cns.fasta -o cns.fastq || echo quvier failed
最后运行代码
$bash arrow_polish.sh raw_assembly.fa input.fofn
三. 最后利用多个软件拼接的结果进行合并,来提高组装质量.。
1. quickmerge安装
$unzip quickmerge-master.zip
$cd quickmerge-master
$bash make_merger.sh
$export PATH=/data1/spider/ytbiosoft/soft/quickmerge-master:/data1/spider/ytbiosoft/soft/quickmerge-master/MUMmer3.23:$PATH
或者
$source /data1/spider/ytbiosoft/soft/quickmerge-master/.quickmerge
安装好之后记得加入环境变量哦!
- 看一下帮助文档
Usage: quickmerge -d out.delta -q query.fasta -r reference.fasta -hco (default=5.0) -c (default=1.5) -l seed_length_cutoff -ml merging_length_cutoff -p prefix
=========================================================
quickmerge version 0.3
Options:
-d : delta alignment file from nucmer
-q : fasta used as query in nucmer
-r : fasta used as reference in nucmer
-hco : seed alignment HCO cutoff (default=5.0)
-c : high confidence overlap cutoff (default=1.5)
-l : seed alignment length cutoff (long integer)
-ml : merging length cutoff (integer)
-p : output prefix
-h/--help : prints this help
- 参数
-d nucmer生成的delta文件
-q nucmer所用到的query文件
-r nucmer所用到的reference文件
-hco default=5.0
-c 高可信度的overlap cutoff(默认为1.5)
-l seed对齐的length cutoff(长整数)
-ml 合并的length cutoff(整数)
-p 输出文件的前缀
2.开始
- 1 最简单就是运他的一个py脚本就可以了:
$merge_wrapper.py hybrid_assembly.fasta self_assembly.fasta
自己merge_wrapper_v2.py -h去看详细介绍。
- 2分步运行
$nucmer -l 100 -p out1 -t 8 reference.fa query.fa
$delta-filter -i 95 -r -q out.delta > out.rq.delta
$quickmerge -d out.rq.delta -q query.fa -r reference.fa -hco 5.0 -c 1.5 -l 520000 -ml 10000
- 一般-l选择引用(-r)程序集的N50作为初始值,-ml一般大于5000。
这里讲一下nucmer和delta-filter都是mumer里面的程序包,quickmerge里面自带了mummer,要是想进一步了解也可以自己下载:
mummer官网
github的mummer - nucmer参数及用法
$nucmer [options] <Reference> <Query>
-l|minmatch 设置单个匹配的最小长度(默认20)
-p|prefix 设置输出文件的前缀(默认为out)
- delta-filter参数及用法
$delta-filter [options] <deltafile>
-i float 设置最小对齐标识[0,100],默认为0
-r 允许query overlaps(多对多)
-q 允许reference overlaps(多对多)
一些建议
-
1 It can be used to merge two different long molecule only assemblies (e.g. one generated with PBcRor canu and another generated with FALCON).
-
2 You can run Ka-kit's finisherSC after running quickmerge to improve the contiguity even further.
-
3 Assembly polishing with Quiver and pilon before and after assembly merging is strongly recommended.