基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评
基因组和转录组组装质量评估利器——BUSCO(Benchmarking Universal Single-Copy Orthologs)
一,BUSCO简介
BUSCO软件2015年发表在Bioinformatics杂志上,谷歌学术显示该文章至今已经被引用1192次,足以说明该软件的认可度。 谷歌学术该软件利用OrthoDB数据库提供的保守的单拷贝同源基因作为基准,评价组装的基因组或者转录组的完整性。
二,BUSCO软件的安装
- 手动安装
头疼! 真需要请移步简书上的一个教程BUSCO使用笔记。
2.BUSCO虚拟机安装
软件作者为了大家使用方面,直接将配置好的BUSCO软件打包成 了一个虚拟镜像,将这个虚拟镜像包BUSCO_UBUNTU_VM.ova下载下来,安装在VirtualBox或者是VMware中。虚拟机安装安装虚拟镜像的方法可以去上述软件的官网去学习,或者是百度一下教程茫茫多。后续我也可以分享一下过程,这次不是重点。
3.使用Miniconda安装BUSCO
Miniconda实在不失为生信入门者的必备良品!使用请看Miniconda的安装与使用(以转录组分析软件为例).
#软件库里搜索busco
czh@ubuntu:~$ conda search busco [11:22下午]
Loading channels: done
# Name Version Build Channel
busco 1.2 py27_0 bioconda
busco 1.2 py27_0 anaconda/cloud/bioconda
busco 1.2 py27_1 bioconda
busco 1.2 py27_1 anaconda/cloud/bioconda
busco 1.2 py34_0 bioconda
busco 1.2 py34_0 anaconda/cloud/bioconda
busco 1.2 py34_1 bioconda
busco 1.2 py34_1 anaconda/cloud/bioconda
busco 1.2 py35_0 bioconda
busco 1.2 py35_0 anaconda/cloud/bioconda
busco 1.2 py35_1 bioconda
busco 1.2 py35_1 anaconda/cloud/bioconda
busco 2.0 py27_0 bioconda
busco 2.0 py27_0 anaconda/cloud/bioconda
busco 2.0 py34_0 bioconda
busco 2.0 py34_0 anaconda/cloud/bioconda
busco 2.0 py35_0 bioconda
busco 2.0 py35_0 anaconda/cloud/bioconda
busco 2.0 py36_0 bioconda
busco 2.0 py36_0 anaconda/cloud/bioconda
busco 2.0.1 py27_0 bioconda
busco 2.0.1 py27_0 anaconda/cloud/bioconda
busco 2.0.1 py34_0 bioconda
busco 2.0.1 py34_0 anaconda/cloud/bioconda
busco 2.0.1 py35_0 bioconda
busco 2.0.1 py35_0 anaconda/cloud/bioconda
busco 2.0.1 py36_0 bioconda
busco 2.0.1 py36_0 anaconda/cloud/bioconda
busco 3.0.1 py35_0 bioconda
busco 3.0.1 py35_0 anaconda/cloud/bioconda
busco 3.0.1 py36_0 bioconda
busco 3.0.1 py36_0 anaconda/cloud/bioconda
busco 3.0.2 py35_4 bioconda
busco 3.0.2 py35_4 anaconda/cloud/bioconda
busco 3.0.2 py35_5 bioconda
busco 3.0.2 py35_5 anaconda/cloud/bioconda
busco 3.0.2 py35_6 bioconda
busco 3.0.2 py35_6 anaconda/cloud/bioconda
busco 3.0.2 py35_7 bioconda
busco 3.0.2 py35_7 anaconda/cloud/bioconda
busco 3.0.2 py36_4 bioconda
busco 3.0.2 py36_4 anaconda/cloud/bioconda
busco 3.0.2 py36_5 bioconda
busco 3.0.2 py36_5 anaconda/cloud/bioconda
busco 3.0.2 py36_6 bioconda
busco 3.0.2 py36_6 anaconda/cloud/bioconda
busco 3.0.2 py36_7 bioconda
busco 3.0.2 py36_7 anaconda/cloud/bioconda
从软件库里搜索到BUSCO多个版本,可以根据需要安装指定版本,若不指定版本,默认安装最新版本。
czh@ubuntu:~$ conda create -n busco busco
安装成功后,激活busco这个虚拟环境。
czh@ubuntu:~$ source activate busco
#一般输入软件名 + -h 或者 --help,出现帮助文档说明软件安装成功
(busco) czh@ubuntu:~$ busco -h
No command 'busco' found, did you mean:
Command 'btsco' from package 'bluez-btsco' (universe)
busco: command not found
#报错!好像busco未安装
翻阅一下BUSCO的软件使用手册BUSCO_userguide.pdf,查看一下软件使用的基本命令。
BUSCO是python写的软件,使用该软件是调用run_BUSCO.py这个文件。
#查找run_BUSCO.py
(busco) czh@ubuntu:~$ which run_BUSCO.py
/home/czh/miniconda3/envs/busco/bin/run_BUSCO.py
(busco) czh@ubuntu:~$ run_BUSCO.py -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
Welcome to BUSCO 3.0.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide.
optional arguments:
-i FASTA FILE, --in FASTA FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_path LINEAGE
Specify location of the BUSCO lineage data to be used.
Visit http://busco.ezlab.org for available lineages.
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-r, --restart Restart an uncompleted run. Not available for the protein mode
-sp SPECIES, --species SPECIES
Name of existing Augustus species gene finding parameters. See Augustus documentation for available options.
--augustus_parameters AUGUSTUS_PARAMETERS
Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option.
Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options.
-t PATH, --tmp_path PATH
Where to store temporary files (Default: ./tmp/)
--limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--long Optimization mode Augustus self-training (Default: Off) adds considerably to the run time, but can improve results for some non-model organisms
-q, --quiet Disable the info logs, displays only errors
-z, --tarzip Tarzip the output folders likely to contain thousands of files
--blast_single_core Force tblastn to run on a single core and ignore the --cpu argument for this step only. Useful if inconsistencies when using multiple threads are noticed
-v, --version Show this version and exit
-h, --help Show this help message and exit
三,使用BUSCO评估4个大肠杆菌基因组组装质量
1.下载大肠杆菌基因组数据
从NCBI数据库中下载了Escherichia coli IAI39(Complete),E. coli UMN026 (Chromosome),E. coli IMT2125(Scaffold)和E. coli TW10722 (Contig)四个不同组装水平的基因组核酸序列。
#文件下载后放在test文件夹中,进入test文件夹并查看文件夹中的文件
(busco) czh@ubuntu:~$ cd '/home/czh/Desktop/test'
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39.gz Escherichia_coli_TW10722.gz
Escherichia_coli_IMT2125.gz Escherichia_coli_UMN026.gz
#下载的基因组文件为压缩文件,解压文件
(busco) czh@ubuntu:~/Desktop/test$ gzip -d *.gz
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39 Escherichia_coli_TW10722
Escherichia_coli_IMT2125 Escherichia_coli_UMN026
#busco要求fasta格式文件,虽然文件里面是fasta格式,但没有fasta后缀,busco不识别。
(busco) czh@ubuntu:~/Desktop/test$(busco) czh@ubuntu:~/Desktop/test$ rename 's/$/\.fasta/' *
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39.fasta Escherichia_coli_TW10722.fasta
Escherichia_coli_IMT2125.fasta Escherichia_coli_UMN026.fasta
2.下载恰当的OrthoDB数据库数据
现在需要从BUSCO官网中的OrthoDB数据库中下载合适的数据文件。
大肠杆菌的分类地位:细菌界(Bacteria),变形菌门(Proteobacteria), γ-变形菌纲(Gammaproteobacteria),肠杆菌目(Enterobacteriales), 肠杆菌科(Enterobacteriaceae),埃希氏菌属(Escherichia)。 细菌保守的单拷贝同源序列数据库
因此可以选择下载Enterobacteriales odb9, Gammaproteobacteria obd9 和Proteobacteria odb9。
我之前把细菌界的所有odb9数据库都下载并解压好了,放置在自己新建的busco_database文件夹中。
(busco) czh@ubuntu:~/Desktop/test$ cd '/home/czh/Desktop/busco_database'
(busco) czh@ubuntu:~/Desktop/busco_database$ ls
actinobacteria_odb9.tar.gz enterobacteriales_odb9.tar.gz
bacillales_odb9.tar.gz firmicutes_odb9.tar.gz
bacteria_odb9 gammaproteobacteria_odb9
bacteria_odb9.tar.gz gammaproteobacteria_odb9.tar.gz
bacteroidetes_odb9.tar.gz lactobacillales_odb9.tar.gz
betaproteobacteria_odb9.tar.gz proteobacteria_odb9
clostridia_odb9.tar.gz proteobacteria_odb9.tar.gz
cyanobacteria_odb9.tar.gz rhizobiales_odb9.tar.gz
deltaepsilonsub_odb9.tar.gz spirochaetes_odb9.tar.gz
enterobacteriales_odb9 tenericutes_odb9.tar.gz
#查看帮助文档,了解软件运行的基本命令
(busco) czh@ubuntu:~/Desktop/test$ run_BUSCO.py -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
# -i参数:指定输入文件,-l参数:指定比对的obd9数据库,-o参数:指定输出文件夹, -m参数:指定待评估的数据的类型,-m geno 为基因组,-m tran为转录组。
- 对4个大肠杆菌基因组进行组装质量评估
有4个菌需要评估,在此使用for...do...done循环函数进行,数据库使用肠杆菌科enterobacteriales_odb9。
(busco) czh@ubuntu:~/Desktop/test$ for i in Escherichia_coli_TW10722.fasta Escherichia_coli_IAI39.fasta Escherichia_coli_UMN026.fasta Escherichia_coli_IMT2125.fasta; do /home/czh/miniconda3/envs/busco/bin/run_BUSCO.py -i $i -l '/home/czh/Desktop/busco_database/enterobacteriales_odb9' -o $i -m geno -c 2; done
INFO ****************** Start a BUSCO 3.0.2 analysis, current time: 10/12/2018 23:54:48 ******************
INFO Configuration loaded from /home/czh/miniconda3/envs/busco/bin/../config/config.ini
INFO Init tools...
INFO Check dependencies...
INFO Check input file...
INFO To reproduce this run: python /home/czh/miniconda3/envs/busco/bin/run_BUSCO.py -i Escherichia_coli_TW10722.fasta -o Escherichia_coli_TW10722.fasta -l /home/czh/Desktop/busco_database/enterobacteriales_odb9/ -m genome -c 2 -sp E_coli_K12
INFO Mode is: genome
INFO The lineage dataset is: enterobacteriales_odb9 (prokaryota)
INFO Temp directory is ./tmp/
INFO ****** Phase 1 of 2, initial predictions ******
INFO ****** Step 1/3, current time: 10/12/2018 23:54:48 ******
INFO Create blast database...
INFO [makeblastdb] Building a new DB, current time: 10/12/2018 23:54:48
INFO [makeblastdb] New DB name: /home/czh/Desktop/test/tmp/Escherichia_coli_TW10722.fasta_3135827084
INFO [makeblastdb] New DB title: Escherichia_coli_TW10722.fasta
INFO [makeblastdb] Sequence type: Nucleotide
INFO [makeblastdb] Keep MBits: T
INFO [makeblastdb] Maximum file size: 1000000000B
INFO [makeblastdb] Adding sequences from FASTA; added 405 sequences in 0.165066 seconds.
INFO [makeblastdb] 1 of 1 task(s) completed at 10/12/2018 23:54:48
INFO Running tblastn, writing output to /home/czh/Desktop/test/run_Escherichia_coli_TW10722.fasta/blast_output/tblastn_Escherichia_coli_TW10722.fasta.tsv...
INFO [tblastn] 1 of 1 task(s) completed at 10/12/2018 23:55:10
INFO ****** Step 2/3, current time: 10/12/2018 23:55:10 ******
INFO Maximum number of candidate contig per BUSCO limited to: 3
INFO Getting coordinates for candidate regions...
INFO Pre-Augustus scaffold extraction...
INFO Running Augustus prediction using E_coli_K12 as species:
INFO [augustus] Please find all logs related to Augustus errors here: /home/czh/Desktop/test/run_Escherichia_coli_TW10722.fasta/augustus_output/augustus.log
INFO [augustus] 101 of 1008 task(s) completed at 10/12/2018 23:57:03
INFO [augustus] 202 of 1008 task(s) completed at 10/12/2018 23:58:47
INFO [augustus] 303 of 1008 task(s) completed at 10/13/2018 00:00:36
INFO [augustus] 404 of 1008 task(s) completed at 10/13/2018 00:02:02
INFO [augustus] 504 of 1008 task(s) completed at 10/13/2018 00:03:48
INFO [augustus] 605 of 1008 task(s) completed at 10/13/2018 00:05:14
INFO [augustus] 706 of 1008 task(s) completed at 10/13/2018 00:06:57
INFO [augustus] 807 of 1008 task(s) completed at 10/13/2018 00:08:41
INFO [augustus] 908 of 1008 task(s) completed at 10/13/2018 00:10:14
INFO [augustus] 1008 of 1008 task(s) completed at 10/13/2018 00:11:48
INFO Extracting predicted proteins...
INFO ****** Step 3/3, current time: 10/13/2018 00:12:00 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO [hmmsearch] 101 of 1002 task(s) completed at 10/13/2018 00:12:03
INFO [hmmsearch] 201 of 1002 task(s) completed at 10/13/2018 00:12:07
INFO [hmmsearch] 301 of 1002 task(s) completed at 10/13/2018 00:12:12
INFO [hmmsearch] 401 of 1002 task(s) completed at 10/13/2018 00:12:15
INFO [hmmsearch] 502 of 1002 task(s) completed at 10/13/2018 00:12:20
INFO [hmmsearch] 602 of 1002 task(s) completed at 10/13/2018 00:12:24
INFO [hmmsearch] 702 of 1002 task(s) completed at 10/13/2018 00:12:29
INFO [hmmsearch] 802 of 1002 task(s) completed at 10/13/2018 00:12:34
INFO [hmmsearch] 902 of 1002 task(s) completed at 10/13/2018 00:12:38
INFO [hmmsearch] 1002 of 1002 task(s) completed at 10/13/2018 00:12:42
INFO Results:
INFO C:98.0%[S:97.6%,D:0.4%],F:1.5%,M:0.5%,n:781
INFO 765 Complete BUSCOs (C)
INFO 762 Complete and single-copy BUSCOs (S)
INFO 3 Complete and duplicated BUSCOs (D)
INFO 12 Fragmented BUSCOs (F)
INFO 4 Missing BUSCOs (M)
INFO 781 Total BUSCO groups searched
..........
4.评估结果与解读
使用肠杆菌科的保守单拷贝同源基因数据库对四个大肠杆菌基因组组装评估如下: Escherichia_coli_IAI39.fasta (Complete)
C:99.9%[S:99.9%,D:0.0%],F:0.1%,M:-0.0%,n:781
780 Complete BUSCOs (C)
780 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
1 Fragmented BUSCOs (F)
0 Missing BUSCOs (M)
781 Total BUSCO groups searched
Escherichia_coli_UMN026.fasta(Chromosome)
C:99.9%[S:99.9%,D:0.0%],F:0.1%,M:-0.0%,n:781
780 Complete BUSCOs (C)
780 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
1 Fragmented BUSCOs (F)
0 Missing BUSCOs (M)
781 Total BUSCO groups searched
Escherichia_coli_IMT2125.fasta(Scaffold)
C:97.8%[S:97.8%,D:0.0%],F:2.2%,M:0.0%,n:781
764 Complete BUSCOs (C)
764 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
17 Fragmented BUSCOs (F)
0 Missing BUSCOs (M)
781 Total BUSCO groups searched
Escherichia_coli_TW10722.fasta(Contig)
C:98.0%[S:97.6%,D:0.4%],F:1.5%,M:0.5%,n:781
765 Complete BUSCOs (C)
762 Complete and single-copy BUSCOs (S)
3 Complete and duplicated BUSCOs (D)
12 Fragmented BUSCOs (F)
4 Missing BUSCOs (M)
781 Total BUSCO groups searched
结果解读
简书上有一个教程BUSCO组装质量评估软件对BUSCO结果进行了比较好的中文解释 。
大体如下:
- C : 多少个BUSCO测试基因被覆盖。
- S : 多少个基因经过比对发现是单拷贝。
- D : 多少个基因经过比对发现是多拷贝。
- F : 多少个基因经过比对覆盖不完全,只是部分比对上。
- M : 没有得到比对结果的基因数。
- Total : 总共测试的基因条目。
Genome | C | S | D | F | M | Total |
---|---|---|---|---|---|---|
Genome | Complete | single-copy | duplicated | Fragment | Miss | Total |
Escherichia_coli_IAI39.fasta (Complete) | 99.9% | 99.9% | 0.0% | 0.1% | 0.0% | 781 |
Escherichia_coli_UMN026.fasta(Chromosome) | 99.9% | 99.9% | 0.0% | 0.1% | 0.0% | 781 |
Escherichia_coli_IMT2125.fasta(Scaffold) | 97.8% | 97.8% | 0.0% | 2.2% | 0.0% | 781 |
Escherichia_coli_TW10722.fasta(Contig) | 98.0% | 97.6% | 0.4% | 1.5% | 0.5% | 781 |
Complete和Chromosome组装水平的基因组完整度99.9%,而Scaffold和Contig的基因组完整度分别为97.8%和98%,但contig基因组中发现理论上应该是单拷贝基因的被检测到有多个拷贝,可能是组装错误造成的,因此contig的组装质量并不高于Scaffold。
那么也就是说一般来看,按照我个人的看法。S似乎越大越好,M越小越好,说明组装的越完整。但是D与F这两个数值越大不见得就是好的,因为组装饿错误可能会带来这两个值的增大。究竟应该如何评判,我觉得不能仅仅只是通过这一个软件来判定。比如还可以借助QUAST和常规指标N50、总的核酸量、对角线图等等多个评判标准来进行(引自BUSCO - 组装质量评估)。
参考资料