EVM 对预测结果进行整合

2020-05-07 本文已影响0人斩毛毛

从头预测，同源注释和转录组整合都会得到一个预测结果，EVidenceModeler(EVM) 可以对上述结果进整合

软件安装

wget -4 https://github.com/EVidenceModeler/EVidenceModeler/archive/v1.1.1.tar.gz
tar xf v1.1.1.tar.gz
# 添加环境变量

使用流程

所需数据

gene_prediction.gff3
标准的gff3格式，必须要有gene, mRNA, exon, CDS这些特征
protein_alignments.gff3: 标准的GFF3格式，第9列要有ID信和和target信息, 标明是比对结果
transcript_alignments.gff3:标准的GFF3格式，第9列要有ID信和和target信息，标明是比对结果

$\color{red}{若有多个同类型注释结果，可以cat在一起}$

运行EVM

创建权重文件

# copy /EVidenceModeler-1.1.1/simple_example 下的weights.txt进行修改
cp ~ /EVidenceModeler-1.1.1/simple_example/weights.txt ./
vi weights.txt
ABINITIO_PREDICTION      augustus       4
TRANSCRIPT      assembler-database.sqlite      7
OTHER_PREDICTION  transdecoder  8
## 第一列为来源类型；分为：ABINITIO_PREDICTION, PROTEIN, TRANSCRIPT
## 第二列对应着gff3文件第二列
## 第三列为权重

分割原始数据, 用于后续并行

/EVidenceModeler-1.1.1/EvmUtils/partition_EVM_inputs.pl \
           --genome ref.fa \
           --gene_predictions gene_predictions.gff3 \
           --transcript_alignments transcript_alignments.gff3 \
           --segmentSize 100000 --overlapSize 10000 \
           --partition_listing partitions_list.out
# 参数
--genome： fasta file containing all genome sequences
--gene_predictions：* file containing gene predictions
--protein_alignments: file containing protein alignments
--transcript_alignments：file containing transcript alignments
--segmentSize：* :length of a single sequence for running EVM
--overlapSize： * :length of sequence overlap between segmented sequences
--partition_listing     * :name of output file to be created that contains the list of partitions

--segmentsSize设置的大小需要少于1Mb(这里是100k)， --overlapSize的不能太小，如果数学好，可用设置成基因平均长度加上2个标准差，数学不好，就设置成10K吧

创建并行运算命令并且执行

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/write_EVM_commands.pl --genome ref.fa --weights `pwd`/weights.txt \
      --gene_predictions gene_predictions.gff3 \
      --transcript_alignments transcript_alignments.gff3 \
      --output_file_name evm.out  --partitions partitions_list.out >  commands.list
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/execute_EVM_commands.pl commands.list


#参数
 --weights          | -w    weights for evidence types file

合并运行结果

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out

结果转换成GFF3

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome ref.fa
find . -regex ".*evm.out.gff3" -exec cat {} \; | bedtools sort -i - > EVM.all.gff

过滤gff文件

注释过滤：对于初步预测得到的基因，还可以稍微优化一下，例如剔除编码少于50个AA的预测结果，将转座子单独放到一个文件中(软件有TransposonPSI)。

这里基于gffread先根据注释信息提取所有的CDS序列，过滤出长度不足50AA的序列，基于这些序列过滤原来的的注释

gffread EVM.all.gff -g input/genome.fa -y tr_cds.fa
bioawk -c fastx '$seq < 50 {print $comment}' tr_cds.fa | cut -d '=' -f 2 > short_aa_gene_list.txt
grep -v -w -f short_aa_gene_list.txt EvM.all.gff > filter.gff

参考

如何对基因组序列进行注释

EVM 对预测结果进行整合

软件安装

使用流程

所需数据

运行EVM

创建权重文件

分割原始数据, 用于后续并行

创建并行运算命令并且执行

合并运行结果

结果转换成GFF3

过滤gff文件

参考

猜你喜欢

热点阅读