生物信息学习生物信息学生物信息学与算法

生信入门:序列比对之diamond

2019-05-14  本文已影响1人  基因的生物信息学分析

宏基因组分析当中得到基因集之后,需要知道这些基因的功能,才能分析样品表型与基因之间的关系,这就需要对基因进行功能注释;无参转录组分析软件得到unigene文件,也需要对转录本进行功能注释。

基因功能注释是将基因序列比对回数据库,根据序列相似性预测基因功能。

为什么不用blast?

太慢。。。

2015年nature methods上发布了一款的比对软件DIAMOND,更快更灵敏,官方测试其性能为blast+的500~20000倍。

https://github.com/bbuchfink/diamond/

1. 安装diamond程序

在diamond下载界面获得下载链接

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
  2. tar xzf diamond-linux64.tar.gz

</pre>

解压结果为一个二进制可执行文件 diamond, 直接运行就可以

下面是对diamond简短介绍:

  1. 有四个主要的程序:
  1. -e 指定E-value,默认0.001,比blast的默认值10更加严格 -f 输出格式,我比较喜欢6,和blast+一样,可以选择输出哪些fields -o 输出到哪个文件中 -p 指定使用的核心数目

2. 基本用法

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. diamond help

</pre>

diamond v0.8.22.84 | by Benjamin Buchfink buchfink@gmail.com Check http://github.com/bbuchfink/diamond for updates.

Syntax: diamond COMMAND [OPTIONS]

Commands: makedb Build DIAMOND database from a FASTA file blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database view View DIAMOND alignment archive (DAA) formatted file help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database file

General options: --threads (-p) number of CPU threads --db (-d) database file --out (-o) output file --outfmt (-f) output format 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; font-size: 10px; line-height: 12px;">

  1. Value 6 may be followed by a space-separated list of these keywords:

  2. qseqid means Query Seq - id

  3. qlen means Query sequence length

  4. sseqid means Subject Seq - id

  5. sallseqid means All subject Seq - id(s), separated by a ';'

  6. slen means Subject sequence length

  7. qstart means Start of alignment in query

  8. qend means End of alignment in query

  9. sstart means Start of alignment in subject

  10. send means End of alignment in subject

  11. qseq means Aligned part of query sequence

  12. sseq means Aligned part of subject sequence

  13. evalue means Expect value

  14. bitscore means Bit score

  15. score means Raw score

  16. length means Alignment length

  17. pident means Percentage of identical matches

  18. nident means Number of identical matches

  19. mismatch means Number of mismatches

  20. positive means Number of positive - scoring matches

  21. gapopen means Number of gap openings

  22. gaps means Total number of gaps

  23. ppos means Percentage of positive - scoring matches

  24. qframe means Query frame

  25. stitle means Subject Title

  26. salltitles means All Subject Title(s), separated by a '<>'

  27. qcovhsp means Query Coverage Per HSP

  28. Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

</pre>

--verbose (-v) verbose console output --log enable debug log --quiet disable console output

Makedb options: --in input reference file in FASTA format

Aligner options: --query (-q) input query file --max-target-seqs (-k) maximum number of target sequences to report alignments for --top report alignments within this percentage range of top alignment score (overrides --max-target-seqs) --compress compression for output files (0=none, 1=gzip) --evalue (-e) maximum e-value to report alignments --min-score minimum bit score to report alignments (overrides e-value setting) --id minimum identity% to report an alignment --query-cover minimum query cover% to report an alignment --sensitive enable sensitive mode (default: fast) --more-sensitive enable more sensitive mode (default: fast) --block-size (-b) sequence block size in billions of letters (default=2.0) --index-chunks (-c) number of chunks for index processing --tmpdir (-t) directory for temporary files --gapopen gap open penalty (default=11 for protein) --gapextend gap extension penalty (default=1 for protein) --matrix score matrix for protein alignment (default=BLOSUM62) --custom-matrix file containing custom scoring matrix --lambda lambda parameter for custom matrix --K K parameter for custom matrix --seg enable SEG masking of queries (yes/no) --salltitles print full subject titles in output files

Advanced options: --run-len (-l) mask runs between stop codons shorter than this length --freq-sd number of standard deviations for ignoring frequent seeds --id2 minimum number of identities for stage 1 hit --window (-w) window size for local hit search --xdrop (-x) xdrop for ungapped alignment --ungapped-score minimum alignment score to continue local extension --hit-band band for hit verification --hit-score minimum score to keep a tentative alignment --gapped-xdrop (-X) xdrop for gapped alignment in bits --band band for dynamic programming computation --shapes (-s) number of seed shapes (0 = all available) --index-mode index mode (0=4x12, 1=16x9) --fetch-size trace point fetch size --rank-factor include subjects within this range of max-target-seqs --rank-ratio include subjects within this ratio of last hit --single-domain Discard secondary domains within one target sequence --dbsize effective database size (in letters) --no-auto-append disable auto appending of DAA and DMND file extensions --target-fetch-size number of target sequences to fetch for seed extension

View options: --daa (-a) DIAMOND alignment archive (DAA) file --forwardonly only show alignments of forward strand

Getseq options: --seq Sequence numbers to display.

1) 建库

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. diamond makedb --in nr.faa -d nr

</pre>

2) 序列比对

上面建库之后会生成一个 nr.dmnd 文件

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. diamond blastx -d nr -q gene.fasta -o matches.m8

</pre>

matches.m8文件内容如下:

image-20190407132541383

每一列是什么呢? BLASTn output format 6

Column headers: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1. qseqid query sequence id
2. sseqid subject (e.g., reference genome) sequence id
3. pident percentage of identical matches
4. length alignment length
5. mismatch number of mismatches
6. gapopen number of gap openings
7. qstart start of alignment in query
8. qend end of alignment in query
9. sstart start of alignment in subject
10. send end of alignment in subject
11. evalue expect value
12. bitscore bit score

3)可以自定义输出结果

加上参数-outfmt,例如:

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. -outfmt "6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore"

</pre>

4)对结果进行筛选

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. #选取identity>80的结果
  2. awk -F"\t" '$3>80' matches.m8 > matches.m8.identity80

</pre>

image-20190407133045033

<pre class="prettyprint linenums prettyprinted" style="box-sizing: border-box; margin: 0px; padding: 8px 0px 6px; font-family: consolas, menlo, courier, monospace, "Microsoft Yahei" !important; background: rgb(241, 239, 238); border: 1px solid rgb(226, 226, 226) !important; display: block; border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; font-size: 10px; line-height: 12px;">

  1. #选取比对最好的一条结果
  2. sort -k1,1 -u matches.m8 > matches.m8.one

</pre>

image-20190407134259166

<center style="box-sizing: border-box; color: rgb(80, 97, 109); font-family: sans-serif, Avenir, -apple-system-font, 微软雅黑; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">感谢您的支持,欢迎点击“在看"和转发!!</center>

<center style="box-sizing: border-box; color: rgb(80, 97, 109); font-family: sans-serif, Avenir, -apple-system-font, 微软雅黑; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 300; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">长按下方二维码,即可关注 “基因的生物信息学分析”。</center>

image

长按下方二维码,即可加入“基因的生物信息学分析”讨论群。

image
上一篇下一篇

猜你喜欢

热点阅读