bwa 使用指南
bwa - Burrows-Wheeler Alignment Tool
#1. 介绍
BWA 是一个能将差异较小的测序数据比对到参考基因组的工具。它包含三种算法:BWA-backtrack, BWA-SW 和 BWA-MEM.
- BWA-backtrack:适用于Illumina 数据,最长100bp; aln/samse/sampe
- BWA-SW:长序列,70bp-1Mbp;bwasw
- BWA-MEM:长序列,70bp-1Mbp;mem
注:但是对于高质量的数据,通常推荐使用最新的BWA-MEM,因为它更快、更准确。对于70-100bp Illumina读数,BWA-MEM也比BWA-backtrack有更好的性能。
对于所有的算法,BWA首先需要为参考基因组构建fm-index(index 命令)。使用不同的子命令调用比对算法: BWA-backtrack可调用aln/samse/sampe,BWA-SW可调用bwasw,对于BWA-MEM算法调用mem。
#2. 安装
bunzip2 bwa-0.5.9.tar.bz2
tar xvf bwa-0.5.9.tar
cd bwa-0.5.9
make
- 添加到环境路径
export PATH=$PATH:/path/to/bwa-0.5.9
source ~/.bashrc
- 测试
$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'. There are
three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
for the manual.
#3. 使用示例
- 创建参考索引
bwa index -p hg19bwaidx -a bwtsw wg.fa
- 比对:使用4个CPU,reads在s_3_sequence.txt.gz文件
bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa
注:BWA 可以读取压缩文件;当超过10个cpu比对时,观察到SAM输出有问题。
- 将格式转换为sam
bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam
- Mapping short reads to RefSeq mRNAs
bwa samse RefSeqbwaidx s_3_sequence.txt.bwa s_3_sequence.txt > s_3_sequence.txt.sam
- Mapping long reads (454)
bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam
#4. 常用命令行
bwa index ref.fa
bwa mem ref.fa reads.fq > aln-se.sam
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
bwa aln ref.fa short_read.fq > aln_sa.sai
bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam
bwa bwasw ref.fa long_read.fq > aln.sam
#5. 命令及参数
##5.1 index
-
FASTA格式的数据建立索引.
-
bwa index [-p prefix] [-a algoType] <in.db.fasta>
参数
-p STR 输出前缀
-a STR BWT索引构建算法 (is和bwtsw)
- is 需要5.37N的内存,其中N是数据库的大小。IS的速度比较快,但是不能用于大于2GB的数据库。由于其简单性,IS是默认算法。较小的参考基因组。
- bwtsw 算法内置于BWT-SW。这种方法适用于整个人类基因组。较大的参考基因组
##5.2 mem
- bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] [-w bandWidth] [-d zDropoff] [-r seedSplitRatio] [-c maxOcc] [-A matchScore] [-B mmPenalty] [-O gapOpenPen] [-E gapExtPen] [-L clipPen] [-U unpairPen] [-R RGline] [-v verboseLevel] db.prefix reads.fq [mates.fq]
使用BWA-MEM算法比对70bp-1Mbp的测序数据;简而言之,此算法是基于最大精确匹配(maximal exact matches,MEMs)的种子比对(seeding alignments),然后通过the affine-gap Smith-Waterman algorithm (SW)延伸种子长度。
当没有mates.fq文件以及 没有指定参数-p时,测序会被认为是单端测序。
- mates.fq存在,reads.fq与mates.fq中数据组成pair read。
- 使用-p,reads.fq文件中的2i和(2i+1)read 组成一对read.
在paired-end模式下,mem命令将从部分reads推断read 方向和插入大小分布。
BWA-MEM 进行local alignment,因此对于一个read可能产生多个匹配结果。
参数
-t INT: 线程数; 默认1
-k INT:种子最短长度,小于种子长度的比对结果会被丢弃。默认19
-w INT: gap 最大长度。值得一提,最大gap长度也受评分矩阵和命中长度的影响,而不仅仅由这个参数决定。默认100
-d INT:Off-diagonal X-dropoff (Z-dropoff). [100]
-r FLOAT:a MEM longer than minSeedLen*FLOAT时,重新生成种子。这是调优性能的一个探索式参数。值越大,得到的种子越少,校准速度越快,精度越低。[1.5]
-c INT:当一个MEM在参考基因组多余INT次结果,丢弃。[10000]
-P: In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.
-A INT:匹配分数[1]
-B INT:Mismatch 罚分;序列错误率约为{.75 * exp[-log(4) * B/A]}[4]
-O INT:gap罚分。[6]
-E INT:Gap延伸罚分,A gap of length k costs O + k*E (i.e. -O is for opening a zero-length gap). [1]
-L INT:read剪切罚分;进行SW 延伸时。碱基剪切后,匹配分数减去罚分结果更差,则不进行[5]
-U INT:Penalty for an unpaired read pair。[9]
-p:假定第一个文件就包含双端测序的所有数据,查看另一个参数-P。
-R STR:Complete read group header line. ’@RG\tID:foo\tSM:bar’. [null]
-T INT:匹配打分的最低阈值。 [30]
-a:输出比对的所有结果。
-C:Append append FASTA/Q comment to SAM output.
-H:Use hard clipping ’H’ in the SAM output.
-M:Mark shorter split hits as secondary (for Picard compatibility).
-v INT:Control the verbose level of the output.
##5.3 aln
- bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>
使用aln查找read SA坐标;在第一个种子区域的最大差异设置maxSeedDiff ,整个read序列匹配的差异阈值设置:maxDiff
-n NUM:最大的编辑距离,或者百分比;[0.04]
-o INT:gap最大数量;[1]
-e INT:Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]
-d INT:3端最大删除碱基数目;[16]
-i INT:末端插入最大碱基数目;[5]
-l INT:Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]
-k INT:种子最大编辑距离;[2]
-t INT:线程数
-M INT: 错误罚分[3]
-O INT: Gap open penalty [11]
-E INT: Gap extension penalty [4]
-R INT: Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).
-c: Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)
-N: 禁用迭代搜索。所有没有超过maxDiff 的匹配都被找到
-q INT: Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
-I:The input is in the Illumina 1.3+ read format (quality equals ASCII-64).
-B INT:Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
-b: Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options -1 or -2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:
bwa aln ref.fa -b1 reads.bam > 1.sai
bwa aln ref.fa -b2 reads.bam > 2.sai
bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam
-0: When -b is specified, only use single-end reads in mapping.
-1: When -b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
-2:When -b is specified, only use the second read in a read pair in mapping.
##5.4 samse
-
bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
-
单端测序,产生SAM文件,重复序列随机选择。
-n INT:Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
-r STR:Specify the read group in a format like ‘@RG\tID:foo\tSM:bar’. [null]
##5.5 sampe双端测序
- bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
- 双端测序,产生SAM文件,重复序列随机选择。
-a INT:最大插入片段大小. [500]
-o INT:Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
-P: Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
-n INT: Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
-N INT:Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons). If a read has more than INT hits, the XA tag will not be written. [10]
-r STR:指定read group ‘@RG\tID:foo\tSM:bar’. [null]
##5.6 bwasw
bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq]
Align query sequences in the in.fq file. When mate.fq is present, perform paired-end alignment. The paired-end mode only works for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW may still output split alignments but they are all marked as not properly paired; the mate positions will not be written if the mate has multiple local hits.
OPTIONS:
-a INT 比对分数 [1]
-b INT 错配罚分 [3]
-q INT Gap 罚分 [5]
-r INT Gap 延伸罚分. The penalty for a contiguous gap of size k is q+kr. [2]
-t INT 线程数 [1]
-w INT Band width in the banded alignment [33]
-T INT Minimum score threshold divided by a [37]
-c FLOAT Coefficient for threshold adjustment according to query length. Given an l-long query, the threshold for a hit to be retained is amax{T,c*log(l)}. [5.5]
-z INT Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
-s INT Maximum SA interval size for initiating a seed. Higher -s increases accuracy at the cost of speed. [3]
-N INT Minimum number of seeds supporting the resultant alignment to skip reverse alignment. [5]
##5.7 SAM比对格式
aln的结果是二进制,只适用于BWA 。BWA 将其转换为SAM (Sequence Alignment/Map) 格式:
Col | Field | Description |
---|---|---|
1 | QNAME | Query (pair) NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
5 | MAPQ | MAPping Quality (Phred-scaled) |
6 | CIAGR | extended CIGAR string |
7 | MRNM | Mate Reference sequence NaMe (‘=’ if same as RNAME) |
8 | MPOS | 1-based Mate POSistion |
9 | ISIZE | Inferred insert SIZE |
10 | SEQ | query SEQuence on the same strand as the reference |
11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
FLAG 内容:
Chr | Flag | Description |
---|---|---|
p | 0x0001 | the read is paired in sequencing |
P | 0x0002 | the read is mapped in a proper pair |
u | 0x0004 | the query sequence itself is unmapped |
U | 0x0008 | the mate is unmapped |
r | 0x0010 | strand of the query (1 for reverse) |
R | 0x0020 | strand of the mate |
1 | 0x0040 | the read is the first read in a pair |
2 | 0x0080 | the read is the second read in a pair |
s | 0x0100 | the alignment is not primary |
f | 0x0200 | QC failure |
d | 0x0400 | optical or PCR duplicate |
BWA 会产生以下内容;X开始是BWA特有的内容。
Tag | Meaning |
---|---|
NM | Edit distance |
MD | Mismatching positions/bases |
AS | Alignment score |
BC | Barcode sequence |
X0 | Number of best hits |
X1 | Number of suboptimal hits found by BWA |
XN | Number of ambiguous bases in the referenece |
XM | Number of mismatches in the alignment |
XO | Number of gap opens |
XG | Number of gap extentions |
XT | Type: Unique/Repeat/N/Mate-sw |
XA | Alternative hits; format: (chr,pos,CIGAR,NM;)* |
XS | Suboptimal alignment score |
XF | Support from forward/reverse alignment |
XE | Number of supporting seeds |
Note that XO and XG are generated by BWT search while the CIGAR string by Smith-Waterman alignment. These two tags may be inconsistent with the CIGAR string. This is not a bug.
##5.8 比对精度
禁用种子比对时,BWA 会找到匹配(最大差异(maxDiff 设置),最大gap数量(maxGapO), read任何一端都剪切超过n bp (nIndelEnd ));如果maxGapE 是正数u,Longer gaps也会被寻找到,但是不保证找到所有的匹配。如果使用种子匹配,
BWA 报账第一个种子长度的序列与reference 的距离不大于maxSeedDiff 。
当不使用gapped alignment时,BWA 将‘N’改为随机核苷酸,并且与之的匹配也会被保留。