基因注释：MAKER全记录（参照徐洲更老师）

2022-01-06 本文已影响0人橙子_orange

MAKER 目前支持

🔹SNAP (效果好、易于训练，但在长内含子的基因组上表现不如其他软件）
🔹Augustus (效果好、训练麻烦但提升效果好)
🔹GeneMark (自我训练，在碎片化基因组或长内含子上表现不佳）
🔹FGENESH (效果好，但是训练需要花钱)

1.AUGUSTUS

将PASA获得的database.sqlite.assemblies.fasta.transdecoder.genome.gff3转为genbank格式

augustus/scripts/gff2gbSmallDNA.pl database.sqlite.assemblies.fasta.transdecoder.genome.gff3 genome.fasta 1000 trans.gene.raw.gb

过滤可能错误的基因结构

#创建物种初始化hmm文件
augustus/scripts/new_species.pl  --species=new 
#训练，捕捉错误
augustus/bin/etraining --species=new --stopCodonExcludedFromCDS=false trans.gene.raw.gb 2> train.err 
#过滤错误
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes_list.txt
augustus/scripts/filterGenes.pl badgenes_list.txt trans.gene.raw.gb > trans.genes.gb

初步训练

将过滤后的trans.genes.gb随机分为训练集trans.genes.gb.train和测试集trans.genes.gb.test ，为了使测试准确性具有统计学意义，测试集应该足够大（包括100~200个loci），剩余loci包含在训练集中。

augustus/scripts/randomSplit.pl trans.genes.gb 100 #100是测试集的loci数

创建初始化HMM参数文件，并进行训练，从头预测trans.genes.gb.train中的基因

augustus/scripts/new_species.pl --species=new
augustus/bin/etraining --species=new trans.genes.gb.train | tee firsttest.out
grep -A 22 Evaluation firsttest.out #查看报告
*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.873 |       0.626 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
----------------------------------------------------------------------------------------------------------|
           |        |        |      |                253 |                101 |             |             |
exon level |    484 |    332 |  231 | ------------------ | ------------------ |       0.696 |       0.477 |
           |    484 |    332 |      |   35 |    0 |  218 |   36 |    0 |   65 |             |             |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level |   156 |   100 |   47 |  109 |   53 |        0.47 |       0.301 |
----------------------------------------------------------------------------/

100个基因中有47个被准确预测

69.6%的外显子被准确预测

47.7%预测的外显子在测试集中确切存在

使用optimize_augustus.pl可以将预测准确度提高几个百分点，但极其耗时

augustus/scripts/optimize_augustus.pl --species=new trans.genes.gb.train

在optimize_augustus.pl完成后，重新使用它设置的参数进行训练

augustus/bin/etraining --species=new trans.genes.gb.train

用测试集检查预测准确度

augustus/bin/augustus --species=new trans.genes.test.gb

如果sensitivity低于20%，可能是训练集不够大、基因组质量不高或者物种在某种程度上很“特殊”

2.SNAP模型训练

①SNAP是从基因组中从头预测基因的软件，首先训练一个包含转录组序列与基因组比对结果的SNAP模型

maker -CTL 
vi maker_opts.ctl
 genome=genome.fasta #基因组序列
 est=trinity.fasta #从头组装的转录组序列
 protein=protein.fasta #t同源蛋白序列
 est2genome=1
 protein2genome=1
 cpus=12
maker &> maker.log &

②将上一步生成的结果转换为构建SNAP模型的输入文件

mkdir SNAP && cd SNAP
maker/bin/gff3_merge -d ../genome.fasta.maker.output/genome.fasta_master_datastore_index.log
maker/bin/maker2zff -c 0.8 -e 0.8 -o 0.8 -x 0.2 genome.fasta.all.gff

🔹maker2zff生成一个ZFF格式文件（genome.ann）和一个FASTA格式文件（genome.dna），过滤用于再次训练的高置信度基因，共有7个选项：
-c 由EST/mRNA-Seq比对确定的剪接位点的比例，默认0.5
-e 与EST/mRNA-Seq比对重叠的外显子的比例，默认0.5
-o 和任何证据（EST或者蛋白）重叠的外显子的比例，默认0.5
-a 从头预测证实的剪接位点的比例，默认0
-t 和从头预测结果重叠的外显子的比例，默认0
-l mRNA翻译的蛋白质序列的最短长度
-x 最大AED值，默认0.5
-n 不过滤
🔸AED值：maker2使用注释编辑距离（AED）来评估基因组注释的准确性，AED是一个介于 0 和 1 之间的数字，衡量注释与支持它的evidence的拟合优度，0 表示与可用证据完全一致，1 表示缺乏对注释基因模型的支持

③接着构建模型，生成输入maker2的hmm文件

SNAP/fathom -categorize 1000 genome.ann genome.dna
SNAP/fathom -export 1000 -plus uni.ann uni.dna
SNAP/forge export.ann export.dna
SNAP/hmm-assembler.pl snap . > snap.hmm

④回到maker2的文件夹，编辑maker_opts.ctl，重新运行maker2

vi maker_opts.ctl
 genome=genome.fasta
 est=trinity.fasta
 protein=protein.fasta
 snap=snap.hmm
 est2genome=0 #直接从EST中预测基因，1=yes，0=no
 protein2genome=0
maker &> maker.log &

②-④重复2~3次

ERROR: Could not determine if RepBase is installed

参考ERROR: Could not determine if RepBase is installed · Issue #16501 · bioconda/bioconda-recipes · GitHub
将maker_opts.ctl文件中的model_org设为空选项，不会检查RepBase有没有安装

ERROR：Comparison failed. Retrying with larger minmatch (10)

根据custom de novo library failing - batch failures and "Comparison failed. Retrying with larger minmatch (10)" · Issue #124 · rmhubley/RepeatMasker · GitHub，暂无解决办法
参考文章：
MAKER Tutorial for WGS Assembly and Annotation Winter School 2018 - MAKER Wiki (utah.edu)
http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf
https://vcru.wisc.edu/simonlab/bioinformatics/programs/augustus/docs/tutorial2015/training.html
使用MAKER进行基因注释(高级篇之AUGUSTUS模型训练） - 简书 (jianshu.com)
https://biohpc.cornell.edu/doc/annotation_2018_exercises1.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3280279/
https://www.nature.com/articles/s41598-018-26416-2
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/The_MAKER_control_files_explained