生物信息学生物信息学与算法生物信息学习

肠道菌群:序列分类之kraken

2019-08-02  本文已影响13人  基因的生物信息学分析

细菌基因组测序完,想看看样本有没有被其他的菌污染?
人的转录组测序完,想快速看看人、微生物的序列的比例?
元/宏基因组测序完,想快速获得样本中物种的丰度信息?

REFERENCE

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

Kraken 1

Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件,能够快速对宏基因样品中的reads进行分类。Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法,采取精确匹配,其核心是Kraken有一种特殊数据库,用以预先计算序列中包含的特殊的Kmer序列。 image 下面是来自kraken官网关于各分类器的测评结果: image.png

Kraken速度很快,精度较低,适用于做微生物检测的预处理。通过一些实际的数据测试发现:与Metaphlan2相比,Kraken速度较快,获得的物种数目较多,相对应的假阳性率也较高。

Usage: kraken [options] <filename(s)>

Options:
  --db NAME               Name for Kraken DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --fasta-input           Input is FASTA format
  --fastq-input           Input is FASTQ format
  --fastq-output          Output in FASTQ format
  --gzip-compressed       Input is gzip compressed
  --bzip2-compressed      Input is bzip2 compressed
  --quick                 Quick operation (use first hit or hits)
  --min-hits NUM          In quick op., number of hits req'd for classification
                          NOTE: this is ignored if --quick is not specified
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --out-fmt FORMAT        Format for [un]classified sequence output. supported
                          options are: {legacy, paired, interleaved}
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --only-classified-output
                          Print no Kraken output for unclassified sequences
  --preload               Loads DB into memory before classification
  --paired                The two filenames provided are paired-end reads
  --check-names           Ensure each pair of reads have names that agree
                          with each other; ignored if --paired is not specified
  --help                  Print this message
  --version               Print version information

If none of the *-input or *-compressed flags are specified, and the
file is a regular file, automatic format detection is attempted.
$ kraken --threads 40 --db minikraken_20171013_4GB --preload --
paired --fastq-input --gzip-compressed  ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab
$ less -S  "$B"_kraken.tab
50.30  27375076        27375076        U       0       unclassified
 49.70  27047187        13633   -       1       root
 49.35  26857009        348     -       131567    cellular organisms
 49.35  26855582        105532  D       2           Bacteria
 33.57  18270135        0       -       1783270       FCB group
 33.57  18269977        5       -       68336           Bacteroidetes/Chlorobi group
 33.57  18269761        107058  P       976               Bacteroidetes
 33.34  18144208        69      C       200643              Bacteroidia
 33.34  18144105        961382  O       171549                Bacteroidales
 30.47  16580771        0       F       815                     Bacteroidaceae
 30.47  16580771        2380340 G       816                       Bacteroides
 22.89  12454849        9623614 S       821                         Bacteroides vulgatus
  5.20  2831235 2831235 -       435590                        Bacteroides vulgatus ATCC 8482
  0.73  396849  0       S       357276                      Bacteroides dorei
  0.73  396849  396849  -       997877                        Bacteroides dorei CL03T12C01
  0.60  326487  326471  S       28116                       Bacteroides ovatus
  0.00  16      16      -       1379690                       Bacteroides ovatus V975
  0.42  227354  141795  S       818                         Bacteroides thetaiotaomicron
  0.16  85559   85559   -       226186                        Bacteroides thetaiotaomicron VPI-5482
  0.38  206113  206113  S       1796613                     Bacteroides caecimuris
  0.33  180286  180286  S       47678                       Bacteroides caccae
  0.29  156976  68554   S       817                         Bacteroides fragilis
  0.10  55847   55847   -       862962                        Bacteroides fragilis 638R
  0.03  17859   17859   -       295405                        Bacteroides fragilis YCH46
  0.03  14716   14716   -       272559                        Bacteroides fragilis NCTC 9343

如上是Kraken的结果,可以看出它没有估算出物种的丰度。

Bracken

这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN),它是一种从元基因组数据中计算物种丰度的高度准确的统计方法。

$ bracken -h
Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD
  MY_DB          location of Kraken database
  INPUT          Kraken REPORT file to use for abundance estimation
  OUTPUT         file name for Bracken default output
  READ_LEN       read length to get all classifications for (default: 100)
  LEVEL          level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)
  THRESHOLD      number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
$ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out
#获得如下结果:
name    taxonomy_id     taxonomy_lvl    kraken_assigned_reads   added_reads     new_est_reads   fraction_total_reads
Uncultured phage WW-nAnB strain 2       1449896 S       388     113     501     0.00007
Uncultured phage WW-nAnB strain 3       1449897 S       349     131     480     0.00007
Aureimonas sp. AU20     1349819 S       108     39      147     0.00002
Phaeobacter piscinae    1580596 S       21      4       25      0.00000
Sinorhizobium sp. CCBAU 05631   794846  S       25      12      37      0.00001
Mucilaginibacter sp. BJC16-A31  1234841 S       31      0       31      0.00000
Arcanobacterium phocae  131112  S       34      0       34      0.00000
Kineococcus radiotolerans       131568  S       82      3       85      0.00001
Actinomyces radingae    131110  S       314     5       319     0.00005
Sediminicola sp. YIK13  1453352 S       19      0       19      0.00000
Methylotenera mobilis   359408  S       42      1       43      0.00001
Stenotrophomonas rhizophila     216778  S       76      52      128     0.00002
Acholeplasma oculi      35623   S       12      0       12      0.00000
Dictyoglomus turgidum   513050  S       12      0       12      0.00000
Chelatococcus sp.

NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.

Kraken 2

与Kraken 1相比,Kraken 2有了很大的改进:
1.更快速的构建数据库
2.数据库的占用存储空间更少

  1. 更快的分类速度
    还能支持 16S Databases包括Greengenes, SILVA,和 RDP.

$ kraken2
Need to specify input filenames!
Usage: kraken2 [options] <filename(s)>

Options:
  --db NAME               Name for Kraken 2 DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --quick                 Quick operation (use first hit or hits)
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --confidence FLOAT      Confidence score threshold (default: 0.0); must be
                          in [0, 1].
  --minimum-base-quality NUM
                          Minimum base quality used in classification (def: 0,
                          only effective with FASTQ input).
  --report FILENAME       Print a report with aggregrate counts/clade to file
  --use-mpa-style         With --report, format report output like Kraken 1's
                          kraken-mpa-report
  --report-zero-counts    With --report, report counts for ALL taxa, even if
                          counts are zero
  --memory-mapping        Avoids loading database into RAM
  --paired                The filenames provided have paired-end reads
  --use-names             Print scientific names instead of just taxids
  --gzip-compressed       Input files are compressed with gzip
  --bzip2-compressed      Input files are compressed with bzip2
  --help                  Print this message
  --version               Print version information

If none of the *-compressed flags are specified, and the filename provided
is a regular file, automatic format detection is attempted.
$ kraken2\
    --db minikraken2_v2_8GB_201904_UPDATE/ \
    --threads 20 \
    --report report \
 --gzip-compressed   --paired \
${B}_1.fastq.gz ${B}_2.fastq.gz

感谢您的阅读,欢迎点赞、评论和转发!!

扫描或长按下方二维码,即可关注公众号: 基因的生物信息学分析

image

相关阅读

细菌基因组:结核杆菌测序耐药位点分析

一文搞定细菌基因组De Novo测序分析

肠道菌群:16S测序分析流程解读

肠道菌群:宏基因组测序分析流程解读(上)

上一篇下一篇

猜你喜欢

热点阅读