对组装之后的三代基因组进行polish

2019-10-15 本文已影响0人多啦A梦的时光机_648d

一：利用pilon软件进行二代数据对三代数据polish

1.下载最新的pilon包

$wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
$java -Xmx10G -jar pilon-1.23.jar

编译的时候发现报错

$java -Xmx10G -jar pilon-1.23.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/pilon/Pilon : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

可以看到这里执行java的版本低于被编译的版本，所以就去下了一个低版本的pilon-1.22.jar，再去编译发现就可以了！

下载pilon-1.22.jar

$java -Xmx10G -jar pilon-1.22.jar
Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400


    Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]
                 [...other options...]
           pilon --help for option details

2. 准备数据

三代数据组装好的基因组文件：draft.fa
illumina的双端测序数据经过质控之后的数据：read1_fq.gz read2_fq.gz

3. 比对（bwa）

构建索引

$bwa index -p index/draft draft.fa

比对并排序

$bwa mem -t 16 index/draft raed1_fq.gz read2_fq.gz |samtools sort -@ 10 -O bam -o align.bam

对比对好的bam文件建索引

$samtools index -@ 10 align.bam

4. 标记重复

$sambamba markup -t 10 align.bam align_markup.bam

5. 过滤高质量比对的reads

$samtools view -@ 10 -q 30 align_markup.bam >align_filter.bam
$samtools index -@ 10 align_filter.bam

6. 使用pilon进行polish

java -Xmx10G -jar pilon-1.23.jar --genome draft.fa --frags align_filter.bam --fix snp,indels --output pilon_polished --vcf & >pilon.log

pilon的参数

--frags: 表示输入的是1kb以内的paired-end文库，--jumps表示 大于1k以上的mate pair文库,--bam则是让软件自己猜测
-vcf: 输出一个vcf文件，包含每个碱基的信息
--fix: Pilon将会处理的内容，基本上选snps和indels就够了
--variant: 启发式的变异检测，等价于--vcf --fix all,breaks, 如果是polish不要使用该选项
--minmq: 用于Pilon堆叠的read最低比对质量，默认是0。

二：利用Pacbio组装的自身数据进行polish

1.准备软件

samtools
arrow
pbmm2
pbindex 最后两个可以推荐用conda安装pacbio公司的工具全家桶。

# 安装
conda create -n pb-assembly pb-assembly
# 启动
conda activate pb-assembly

2. 准备数据

组装得到的基因组文件raw_assembly.fa[falcon, canu, mecat2以及flye等软件组装好的数据]
公司给的raw bam文件【类似这样的XXX.subreads.bam】

3. 运行

$samtools faidx assembly.fa
$pbmm2 align pacbio.subreads.bam assembly.fa | samtools sort -@ 16 > map.pacbio.bam
$pbindex map.pacbio.bam
$arrow -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

由于GenomicConsensus只能在Python上运行，所以已经被gcpp取代了，因此最后一步arrow也可以用gcpp运行：
gcpp用法与GenomicConsensus类似，参数都类似，所以最后一步可以改为：

$gcpp -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

最后可以看看他的详细参数和用法：

gcpp - Compute genomic consensus from alignments and call variants relative to the reference.

Usage:
  gcpp [options] <input.bam>

  input.bam                    STR    The input BAM file.

Required input/output files:
  -r,--reference               FILE   The filename of the reference FASTA file.
  -o,--output                  STR    The output filename(s), as a comma-separated list. Valid output formats are
                                      .fa/.fasta, .fq/.fastq, .gff, .vcf

Output filtering:
  -q,--min-confidence          INT    The minimum confidence for a variant call to be output to variants.{gff,vcf} [40]
  -x,--min-coverage            INT    The minimum site coverage that must be achieved for variant calls and consensus
                                      to be calculated for a site. [5]
  --no-evidence-call           STR    The consensus base that will be output for sites with no effective coverage.
                                      Valid choices: (nocall, reference, lowercasereference). [lowercasereference]

Read selection/filtering:
  -X,--coverage                INT    A designation of the maximum coverage level to be used for analysis. Exact
                                      interpretation is algorithm-specific. The meaningful range of this argument is
                                      [1, 1000]. [100]
  --min-accuracy               FLOAT  The minimum acceptable window-global alignment accuracy for reads that will be
                                      used for the analysis (arrow-only). [0.82]
  -m,--min-map-qv              INT    The minimum MapQV for reads that will be used for analysis. [10]
  --min-read-score             FLOAT  The minimum ReadScore for reads that will be used for analysis (arrow-only).
                                      [0.65]
  --min-snr                    FLOAT  The minimum acceptable signal-to-noise over all channels for reads that will be
                                      used for analysis (arrow-only). [2.5]
  -w,--windows                 STR    The window (or multiple comma-delimited windows) of the reference to be
                                      processed, in the format refGroup:refStart-refEnd (default: entire reference).

Algorithm and parameter settings:
  --algorithm                  STR    The consensus algorithm used. Valid choices: (arrow, plurality, poa). [arrow]
  --mask-radius                INT    Radius of window to use when excluding local regions for exceeding
                                      maskMinErrorRate, where 0 disables any filtering (arrow-only). [3]
  --mask-error-rate            FLOAT  Maximum local error rate before the local region defined bymaskRadius is excluded
                                      from polishing (arrow-only). [0.7]
  -P,--parameters-file         STR    Path to a model file or directory containing model files.
  -p,--parameters-spec         STR    Name of chemistry or model to use, overriding default selection.
  --max-iterations             INT    Maximum number of iterations to polish the template. [40]
  --max-poa-coverage           INT    Maximum number of sequences to use for consensus calling. [11]
  --mutation-separation        INT    Find the best mutations within a separation window for iterative polishing. [10]
  --mutation-neighborhood      INT    Find nearby mutations within neighborhood for iterative polishing. [20]
  --read-stumpiness-threshold  FLOAT  Filter out reads whose aligned length along a subread is lower than a percentage
                                      of its corresponding reference length. [0.1]

Verbosity and debugging:
  -d,--dump-evidence           STR    Dump evidence data. Valid choices: (variants, all, outliers, none). [none]
  --evidence-directory         DIR    Directory to dump evidence into.
  --annotate-gff                      Augment GFF variant records with additional information
  --report-effective-coverage         Additionally record the *post-filtering* coverage at variant sites.

Advanced configuration options:
  -C,--reference-chunk-size    INT    Size of reference chunks. [500]
  --reference-chunk-overlap    INT    Size of reference chunk overlaps. [5]
  --simple-chunking                   Disable adaptive reference chunking.
  --sort-strategy              STR    Read sorting strategy. Valid choices: (longest_and_strand_balanced, longest,
                                      spanning, file_order). [longest_and_strand_balanced]
  --min-poa-coverage           INT    Minimum number of reads required within a window to call consensus and variants
                                      using arrow or poa. [3]

  -h,--help                           Show this help and exit.
  --version                           Show application version and exit.
  -j,--num-threads             INT    Number of threads to use, 0 means autodetection. [0]
  --log-level                  STR    Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
  --log-file                   FILE   Log to a file, instead of stderr.

要是你觉得很麻烦，你也可以用hoptop的arrow_polish.sh运行也是可以的，不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。

input.info，里面每一行类似于xxx.subreads.bam, 是公司提供的subread数据。

#!/bin/bash


set -e
set -o pipefail
set -u


REF=$1
BAM=$2
THREADS=100


source activate pb-assembly


if [ ! -f $REF.fai ]; then
    samtools faidx $REF
fi


if [ ! f aln.bam ];then
pbalign \
    --tmpDir=./ --nproc=${THREADS} \
    --minAccuracy=0.75 --minLength=50 \
    --minAnchorSize=12 --maxDivergence=30 --concordant --algorithm=blasr \
    --algorithmOptions=--useQuality --maxHits=1 --hitPolicy=random --seed=1 \
    $BAM ${REF} aln.bam
fi
variantCaller --algorithm=arrow \
    -x 5 -X 120 -q 20 -j 24 \
    -r $REF aln.bam \
    -o cns.fasta -o cns.fastq || echo quvier failed

最后运行代码

$bash arrow_polish.sh raw_assembly.fa input.fofn

三. 最后利用多个软件拼接的结果进行合并，来提高组装质量.。

quickmerge

1. quickmerge安装

$unzip quickmerge-master.zip
$cd quickmerge-master
$bash make_merger.sh 
$export PATH=/data1/spider/ytbiosoft/soft/quickmerge-master:/data1/spider/ytbiosoft/soft/quickmerge-master/MUMmer3.23:$PATH
或者
$source /data1/spider/ytbiosoft/soft/quickmerge-master/.quickmerge

安装好之后记得加入环境变量哦！

看一下帮助文档

Usage: quickmerge -d out.delta -q query.fasta -r reference.fasta -hco (default=5.0) -c (default=1.5) -l seed_length_cutoff -ml merging_length_cutoff -p prefix
=========================================================
quickmerge version 0.3
   Options:
       -d : delta alignment file from nucmer
       -q : fasta used as query in nucmer
       -r : fasta used as reference in nucmer
     -hco : seed alignment HCO cutoff (default=5.0)
       -c : high confidence overlap cutoff (default=1.5)
       -l : seed alignment length cutoff (long integer)
      -ml : merging length cutoff (integer)
       -p : output prefix
-h/--help : prints this help

参数

-d              nucmer生成的delta文件
-q              nucmer所用到的query文件
-r              nucmer所用到的reference文件
-hco            default=5.0
-c              高可信度的overlap cutoff(默认为1.5)

-l              seed对齐的length cutoff（长整数）
-ml             合并的length cutoff(整数)
-p              输出文件的前缀

2.开始

1 最简单就是运他的一个py脚本就可以了：

$merge_wrapper.py hybrid_assembly.fasta self_assembly.fasta

自己merge_wrapper_v2.py -h去看详细介绍。

2分步运行

$nucmer -l 100 -p out1 -t 8 reference.fa query.fa
$delta-filter -i 95 -r -q out.delta > out.rq.delta
$quickmerge -d out.rq.delta -q query.fa -r reference.fa -hco 5.0 -c 1.5 -l 520000 -ml 10000

一般-l选择引用(-r)程序集的N50作为初始值，-ml一般大于5000。
这里讲一下nucmer和delta-filter都是mumer里面的程序包，quickmerge里面自带了mummer,要是想进一步了解也可以自己下载：
mummer官网
 github的mummer
nucmer参数及用法

$nucmer  [options]  <Reference>  <Query>

-l|minmatch       设置单个匹配的最小长度(默认20)
-p|prefix         设置输出文件的前缀(默认为out)

delta-filter参数及用法

$delta-filter  [options]  <deltafile>

-i float         设置最小对齐标识[0,100]，默认为0
-r               允许query overlaps（多对多）
-q               允许reference overlaps（多对多）

一些建议

1 It can be used to merge two different long molecule only assemblies (e.g. one generated with PBcRor canu and another generated with FALCON).
2 You can run Ka-kit's finisherSC after running quickmerge to improve the contiguity even further.
3 Assembly polishing with Quiver and pilon before and after assembly merging is strongly recommended.

最后

这里有一些quickmerge设置的技巧

对组装之后的三代基因组进行polish

一：利用pilon软件进行二代数据对三代数据polish

1.下载最新的pilon包

2. 准备数据

3. 比对（bwa）

4. 标记重复

5. 过滤高质量比对的reads

6. 使用pilon进行polish

二：利用Pacbio组装的自身数据进行polish

1.准备软件

2. 准备数据

3. 运行

要是你觉得很麻烦，你也可以用hoptop的arrow_polish.sh运行也是可以的，不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。

三. 最后利用多个软件拼接的结果进行合并，来提高组装质量.。

1. quickmerge安装

2.开始

一些建议

最后

猜你喜欢

热点阅读