生信分析|MUMmer4+SyRi

2021-08-25 本文已影响0人行走的脱发机double

前言

Comparing SVs between two genome：use whole-genome alignment

1、MUMmer4

MUMmer is a system for rapidly aligning DNA and protein sequences.

## install 
conda create -n mummer4 -c bioconda mummer4
## use nucmer 对齐(out out.delta)
nucmer --maxmatch --noextend -t 24 Q1ref.fa Q8ref.fa
## delta-filter 整理
delta-filter -1 -l 1000 out.delta > out.filter.delta
## show-coords
show-coords -THrd out.filter.delta > out.filter.coords

2.SyRi

安装

conda create -n syri python=3.5
conda activate syri
conda install cython numpy scipy pandas=0.23.4 biopython psutil matplotlib=3.0.0
conda install -c conda-forge python-igraph
conda install -c bioconda pysam
#use chroder
conda install -c bioconda longestrunsubsequence 
git clone --single-branch --branch V1.4.1 https://github.com.cnpmjs.org/schneebergerlab/syri.git
cd syri
python3 setup.py install                    
chmod +x syri/bin/syri syri/bin/chroder syri/bin/plotsr

## syri
python3 syri_path --nosnp -c out.filtered.coords -d out.delta -r  ref.fa -q query.fa
## syri plot
python3 syri_plotsr_plot syri.out ref.fa query.fa -H 16 -W 12 -o pdf 
## chroder (非必要步骤，query genome没有挂载到染色体时，可用chrorder生成一个)
syri/bin/chroder out.filter.coords Q1ref.fa Q8ref.fa -o out.filter.chroder

参数说明

## syri
--nosnp  Set to skip SNP/Indel (within alignment) identification (default: False)
## syri plot
-H 16   height of the plot
-W 12  width of the plot
-o pdf   output file format(pdf,png,svg)

3、参数说明

3.1 nucmer参数

Usage: nucmer [options] ref:path qry:path+

nucmer generates nucleotide alignments between two mutli-FASTA input
files. The out.delta output file lists the distance between insertions
and deletions that produce maximal scoring alignments between each
sequence. The show-* utilities know how to read this format.

By default, nucmer uses anchor matches that are unique in in the
reference but not necessarily unique in the query. See --mum and
--maxmatch for different bevahiors.

Options (default value in (), *required):
     --mum                                Use anchor matches that are unique in both the reference and query (false)
     --maxmatch                           Use all anchor matches regardless of their uniqueness (false)（可能包含重复序列的结果）
 -b, --breaklen=uint32                    Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (200)
 -c, --mincluster=uint32                  Sets the minimum length of a cluster of matches (65)
 -D, --diagdiff=uint32                    Set the maximum diagonal difference between two adjacent anchors in a cluster (5)
 -d, --diagfactor=double                  Set the maximum diagonal difference between two adjacent anchors in a cluster as a differential fraction of the gap length (0.12)
     --noextend                           Do not perform cluster extension step (false)，设置是否对clustering两端进行延伸。默认设置--extend
 -f, --forward                            Use only the forward strand of the Query sequences (false)
 -g, --maxgap=uint32                      Set the maximum gap between two adjacent matches in a cluster (90)
 -l, --minmatch=uint32                    Set the minimum length of a single exact match (20)
 -L, --minalign=uint32                    Minimum length of an alignment, after clustering and extension (0)
     --nooptimize                         No alignment score optimization, i.e. if an alignment extension reaches the end of a sequence, it will not backtrack to optimize the alignment score and instead terminate the alignment at the end of the sequence (false)
 -r, --reverse                            Use only the reverse complement of the Query sequences (false)
     --nosimplify                         Don't simplify alignments by removing shadowed clusters. Use this option when aligning a sequence to itself to look for repeats (false)
 -p, --prefix=PREFIX                      Write output to PREFIX.delta (out)
     --delta=PATH                         Output delta file to PATH (instead of PREFIX.delta)
     --sam-short=PATH                     Output SAM file to PATH, short format
     --sam-long=PATH                      Output SAM file to PATH, long format
     --save=PREFIX                        Save suffix array to files starting with PREFIX
     --load=PREFIX                        Load suffix array from file starting with PREFIX
     --batch=BASES                        Proceed by batch of chunks of BASES from the reference
 -t, --threads=NUM                        Use NUM threads (# of cores)
 -U, --usage                              Usage
 -h, --help                               This message
     --full-help                          Detailed help
 -V, --version                            Version

nucmer is designed for DNA sequence alignment and proteing is designed for protein or translated sequence alignment

3.2 delta-filter 参数

USAGE: delta-filter  [options]  <deltafile>

-1            1-to-1 alignment allowing for rearrangements
              (intersection of -r and -q alignments)
-g            1-to-1 global alignment not allowing rearrangements
-h            Display help information
-i float      Set the minimum alignment identity [0, 100], default 0
-l int        Set the minimum alignment length, default 0
-m            Many-to-many alignment allowing for rearrangements
              (union of -r and -q alignments)
-q            Maps each position of each query to its best hit in
              the reference, allowing for reference overlaps
-r            Maps each position of each reference to its best hit
              in the query, allowing for query overlaps
-u float      Set the minimum alignment uniqueness, i.e. percent of
              the alignment matching to unique reference AND query
              sequence [0, 100], default 0
-o float      Set the maximum alignment overlap for -r and -q options
              as a percent of the alignment length [0, 100], default 100

  Reads a delta alignment file from either nucmer or promer and
filters the alignments based on the command-line switches, leaving
only the desired alignments which are output to stdout in the same
delta format as the input. For multiple switches, order of operations
is as follows: -i -l -u -q -r -g -m -1. If an alignment is excluded
by a preceding operation, it will be ignored by the succeeding
operations.
  An important distinction between the -g option and the -1 and -m
options is that -g requires the alignments to be mutually consistent
in their order, while the -1 and -m options are not required to be
mutually consistent and therefore tolerate translocations,
inversions, etc. In general cases, the -m option is the best choice,
however -1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping. Finally, for mapping query contigs, or
sequencing reads, to a reference genome, use -q.

3.3 show-coords 参数

Convert alignment information to a .TSV format as required by SyRI

USAGE: show-coords  [options]  <deltafile>

-b          Merges overlapping alignments regardless of match dir
            or frame and does not display any idenitity information.
-B          Switch output to btab format
-c          Include percent coverage information in the output
-d          Display the alignment direction in the additional
            FRM columns (default for promer)
-g          Deprecated option. Please use 'delta-filter' instead
-h          Display help information
-H          Do not print the output header
-I float    Set minimum percent identity to display
-k          Knockout (do not display) alignments that overlap
            another alignment in a different frame by more than 50%
            of their length, AND have a smaller percent similarity
            or are less than 75% of the size of the other alignment
            (promer only)
-l          Include the sequence length information in the output
-L long     Set minimum alignment length to display
-o          Annotate maximal alignments between two sequences, i.e.
            overlaps between reference and query sequences
-q          Sort output lines by query IDs and coordinates
-r          Sort output lines by reference IDs and coordinates
-T          Switch output to tab-delimited format

  Input is the .delta output of either the "nucmer" or the
"promer" program passed on the command line.
  Output is to stdout, and consists of a list of coordinates,
percent identity, and other useful information regarding the
alignment data contained in the .delta file used as input.
  NOTE: No sorting is done by default, therefore the alignments
will be ordered as found in the <deltafile> input.

3.4 syri参数说明：

usage

usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR] [--prefix PREFIX] [--seed SEED]
            [--nc NCORES] [--novcf] [-f] [--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]
            [--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar] [-s SSPATH]

Input Files:
  -c INFILE             File containing alignment coordinates (default: None)
  -r REF                Genome A (which is considered as reference for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
  -q QRY                Genome B (which is considered as query for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
  -d DELTA              .delta file from mummer. Required for short variation (SNPs/indels) identification when CIGAR string is not available (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -F {T,S,B}            Input file type. T: Table, S: SAM, B: BAM (default: T)
  -k                    Keep intermediate output files (default: False)
  --log {DEBUG,INFO,WARN}
                        log level (default: INFO)
  --lf LOG_FIN          Name of log file (default: syri.log)
  --dir DIR             path to working directory (if not current directory). All files must be in this directory. (default: None)
  --prefix PREFIX       Prefix to add before the output file Names (default: )
  --seed SEED           seed for generating random numbers (default: 1)
  --nc NCORES           number of cores to use in parallel (max is number of chromosomes) (default: 1)
  --novcf               Do not combine all files into one output file (default: False)
  -f                    Filter out low quality alignments (default: True)

SR identification:
  --nosr                Set to skip structural rearrangement identification (default: False)
  --tdgaplen TDGL       Maximum allowed gap-length between two alignments of a multi-alignment translocation or duplication (TD). Larger values increases TD
                        identification sensitivity but also runtime. (default: 500000)
  --tdmaxolp TDOLP      Maximum allowed overlap between two translocations. Value should be in range (0,1]. (default: 0.8)
  -b BRUTERUNTIME       Cutoff to restrict brute force methods to take too much time (in seconds). Smaller values would make algorithm faster, but could have marginal
                        effects on accuracy. In general case, would not be required. (default: 60)
  --unic TRANSUNICOUNT  Number of uniques bps for selecting translocation. Smaller values would select smaller TLs better, but may increase time and decrease accuracy.
                        (default: 1000)
  --unip TRANSUNIPERCENT
                        Percent of unique region requried to select translocation. Value should be in range (0,1]. Smaller values would allow selection of TDs which are
                        more overlapped with other regions. (default: 0.5)
  --inc INCREASEBY      Minimum score increase required to add another alignment to translocation cluster solution (default: 1000)
  --no-chrmatch         Do not allow SyRI to automatically match chromosome ids between the two genomes if they are not equal (default: False)

ShV identification:
  --nosv                Set to skip structural variation identification (default: False)
  --nosnp               Set to skip SNP/Indel (within alignment) identification (default: False)
  --all                 Use duplications too for variant identification (default: False)
  --allow-offset OFFSET
                        BPs allowed to overlap (default: 5)
  --cigar               Find SNPs/indels using CIGAR string. Necessary for alignment generated using aligners other than nucmers (default: False)
  -s SSPATH             path to show-snps from mummer (default: show-snps)

chroder 参数

usage: chroder [-h] [-n NCOUNT] [-o OUT] [-noref] [-F {T,S,B}] coords ref qry

positional arguments:
  coords      Alignment coordinates in a tsv format
  ref         Assembly of genome A in multi-fasta format
  qry         Assembly of genome B in multi-fasta format

optional arguments:
  -h, --help  show this help message and exit
  -n NCOUNT   number of N's to be inserted (default: 500)
  -o OUT      output file prefix (default: out)
  -noref      Use this parameter when no assembly is at chromosome level
              (default: False)
  -F {T,S,B}  Input coords type. T: Table, S: SAM, B: BAM (default: T)

参考：

http://mummer.sourceforge.net/manual/
syri：https://www.jianshu.com/p/3571d7019fb7
syri fileformat：https://schneebergerlab.github.io/syri/fileformat.html
MUMmer：https://www.jianshu.com/p/c12f2a117892
nucmer：http://www.chenlianfu.com/?p=2559