结构注释

EVM 对预测结果进行整合

2020-05-07  本文已影响0人  斩毛毛

从头预测,同源注释和转录组整合都会得到一个预测结果,EVidenceModeler(EVM) 可以对上述结果进整合

软件安装

wget -4 https://github.com/EVidenceModeler/EVidenceModeler/archive/v1.1.1.tar.gz
tar xf v1.1.1.tar.gz
# 添加环境变量

使用流程

  1. 所需数据

\color{red}{若有多个同类型注释结果,可以cat在一起}

  1. 运行EVM

# copy /EVidenceModeler-1.1.1/simple_example 下的weights.txt进行修改
cp ~ /EVidenceModeler-1.1.1/simple_example/weights.txt ./
vi weights.txt
ABINITIO_PREDICTION      augustus       4
TRANSCRIPT      assembler-database.sqlite      7
OTHER_PREDICTION  transdecoder  8
## 第一列为来源类型;分为:ABINITIO_PREDICTION, PROTEIN, TRANSCRIPT
## 第二列对应着gff3文件第二列
## 第三列为权重
/EVidenceModeler-1.1.1/EvmUtils/partition_EVM_inputs.pl \
           --genome ref.fa \
           --gene_predictions gene_predictions.gff3 \
           --transcript_alignments transcript_alignments.gff3 \
           --segmentSize 100000 --overlapSize 10000 \
           --partition_listing partitions_list.out
# 参数
--genome: fasta file containing all genome sequences
--gene_predictions:* file containing gene predictions
--protein_alignments: file containing protein alignments
--transcript_alignments:file containing transcript alignments
--segmentSize:* :length of a single sequence for running EVM
--overlapSize: * :length of sequence overlap between segmented sequences
--partition_listing     * :name of output file to be created that contains the list of partitions

--segmentsSize设置的大小需要少于1Mb(这里是100k), --overlapSize的不能太小,如果数学好,可用设置成基因平均长度加上2个标准差,数学不好,就设置成10K吧

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/write_EVM_commands.pl --genome ref.fa --weights `pwd`/weights.txt \
      --gene_predictions gene_predictions.gff3 \
      --transcript_alignments transcript_alignments.gff3 \
      --output_file_name evm.out  --partitions partitions_list.out >  commands.list
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/execute_EVM_commands.pl commands.list


#参数
 --weights          | -w    weights for evidence types file
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome ref.fa
find . -regex ".*evm.out.gff3" -exec cat {} \; | bedtools sort -i - > EVM.all.gff

过滤gff文件

注释过滤:对于初步预测得到的基因,还可以稍微优化一下,例如剔除编码少于50个AA的预测结果,将转座子单独放到一个文件中(软件有TransposonPSI)。

这里基于gffread先根据注释信息提取所有的CDS序列,过滤出长度不足50AA的序列,基于这些序列过滤原来的的注释

gffread EVM.all.gff -g input/genome.fa -y tr_cds.fa
bioawk -c fastx '$seq < 50 {print $comment}' tr_cds.fa | cut -d '=' -f 2 > short_aa_gene_list.txt
grep -v -w -f short_aa_gene_list.txt EvM.all.gff > filter.gff

参考

上一篇 下一篇

猜你喜欢

热点阅读