【基因组注释】ncRNA注释
1. ncRNA
非编码RNA(Non-coding RNA, ncRNA) 包括rRNA,tRNA,snRNA,snoRNA 和microRNA 等不编码蛋白质的RNA,它们转录后直接在RNA 水平上就能行使各自的生物学功能,并不需要翻译成蛋白质。
2. 软件
tRNA注释
一般用tRNAscan-SE,老牌软件。
rRNA注释
可用RNAMMER注释。但需要edu邮箱下载。
或者用barrnap,开源软件,也非常好用。
其他ncRNA注释
Infernal软件+Rfam数据库是注释其他复杂ncRNA的常用组合,它的结果同时包含了tRNA,rRNA,snRNA,snoRNA和miRNA等。
对比以上两款软件的tRNA和rRNA结果,Infernal得到的结果差不太多。所以如果不重点研究ncRNA,用这个组合也可。
3. 注释
tRNA
以tRNAscan为例:
tRNAscan-SE genome.fa -o tRNA.out -f tRNA.ss -m tRNA.stats
tRNA.stats文件中包含了tRNA预测的结果统计。
image.png
可通过脚本将tRNA.out整理为gff3。
perl convert_tRNAScanSE_to_gff3.pl -i tRNA.out >tRNAscan.gff3
PS:convert_tRNAScanSE_to_gff3.pl 可参考:github: https://github.com/jorvis/biocode/blob/master/gff/convert_tRNAScanSE_to_gff3.pl
rRNA
以barrnap为例:
barrnap-master/bin/barrnap \
--kingdom euk \ #euk bac arc mito
--threads 10 \
--outseq rRNA.fa \ #rRNA结果序列
genome.fa >rRNA.gff3 #rRNA gff结果
PS:不建议用conda安装使用,会报错。如
[barrnap] ERROR: nhmmer failed to run - Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.
得到的结果可进一步统计各核糖体亚基数目。
- 细菌bacteria (5S,23S,16S)
- 古菌archaea (5S,5.8S,23S,16S)
- 多细胞生物线粒体metazoan mitochondria (12S,16S)
- 真核生物eukaryotes (5S,5.8S,28S,18S)
snRNA、miRNA等
首次下载安装infernal和Rfam数据库。infernal使用conda方便,Rfam数据库重点是Rfam.cm和Rfam.clanin的下载:
# install the package with conda
conda install -c bioconda infernal
# download Rfam data
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
# cmpress 索引
cmpress Rfam.cm
然后使用cmscan注释ncRNA:
genome_total=400000000 #基因组大小
CMnumber=`grep "ACC" /database/Rfam/Rfam.cm|wc -l`
Z=`echo $genome_total*2*$CMnumber/1000000 |bc`
#echo $Z
cmscan -Z $Z --cut_ga --rfam --nohmmonly --tblout genome.tblout \
--fmt 2 --cpu 30 --clanin /database/Rfam/Rfam.clanin \
/database/Rfam/Rfam.cm genome.fa > genome.cmscan
主要有两个结果文件genome.cmscan(比对结果)和genome.tblout(table格式)。要想得到进一步分类统计结果,需要做一些处理。
4. snRNA、miRNA等结果的统计
首先,提取必须的列和非重叠区域或重叠区域中得分高的区域。
awk 'BEGIN{OFS="\t";}{if(FNR==1) print "target_name\taccession\tquery_name\tquery_start\tquery_end\tstrand\tscore\tEvalue"; if(FNR>2 && $20!="=" && $0!~/^#/) print $2,$3,$4,$10,$11,$12,$17,$18; }' my-genome.tblout >genome.tblout.final.xls
image.png
然后,需要根据Rfam数据库的注释来对结果进行分类。Rfam的注释可从官网拷贝(命名tmp):https://rfam.xfam.org/search/type
image.pngimage.png image.png
# 拆分第三列
less tmp | awk 'BEGIN {FS=OFS="\t"}{split($3,x,";");class=x[2];print $1,$2,$3,$4,class}' > rfam_anno.txt
最后统计各类ncRNA注释结果。
awk 'BEGIN{OFS=FS="\t"}ARGIND==1{a[$2]=$5;}ARGIND==2{type=a[$1]; if(type=="") type="Others"; count[type]+=1;}END{for(type in count) print type, count[type];}' rfam_anno.txt genome.tblout.final.xls
统计结果如下:
riboswitch 1
tRNA 813
ribozyme 1
Others 217
miRNA 1948
antisense 7
rRNA 1451
sRNA 1
snRNA 667
也可将注释结果整理为gff3文件:
perl infernal-tblout2gff.pl --cmscan --fmt2 genome.tblout >genome.infernal.ncRNA.gff3
附infernal-tblout2gff.pl
#!/usr/bin/env perl
# infernal-tblout2gff.pl: convert cmsearch or cmscan tblout files to GFF format
# an important point of above ref:
# "Start is always less than or equal to end"
# EPN, Fri Jun 7 11:07:38 2019
#
#
use strict;
use warnings;
use Getopt::Long;
my $in_tblout = ""; # name of input tblout file
my $usage;
$usage = "infernal-tblout2gff.pl\n\n";
$usage .= "Usage:\n\n";
$usage .= "infernal-tblout2gff.pl [OPTIONS] <cmsearch tblout file>\n\tOR\n";
$usage .= "infernal-tblout2gff.pl --cmscan [OPTIONS] <cmscan tblout file>\n\tOR\n";
$usage .= "infernal-tblout2gff.pl --cmscan --fmt2 [OPTIONS] <cmscan --fmt 2 tblout file>\n\n";
$usage .= "\tOPTIONS:\n";
$usage .= "\t\t-T <n> : minimum bit score to include is <n>\n";
$usage .= "\t\t-E <x> : maximum E-value to include is <x>\n";
$usage .= "\t\t--cmscan : tblout file was created by cmscan\n";
$usage .= "\t\t--source <s> : specify 'source' field should be <s> (e.g. tRNAscan-SE)\n";
$usage .= "\t\t--fmt2 : tblout file was created with cmscan --fmt 2 option\n";
$usage .= "\t\t--all : output all info in 'attributes' column [default: E-value]\n";
$usage .= "\t\t--none : output no info in 'attributes' column [default: E-value]\n";
$usage .= "\t\t--desc : output desc field in 'attributes' column [default: E-value]\n";
$usage .= "\t\t--version <s>: append \"-<s>\" to 'source' column\n";
$usage .= "\t\t--extra <s> : append \"<s>;\" to 'attributes' column\n";
$usage .= "\t\t--hidedesc : do not includ \"desc\:\" prior to desc value in 'attributes' column\n";
my $do_minscore = 0; # set to '1' if -T used
my $do_maxevalue = 0; # set to '1' if -E used
my $minscore = undef; # defined if if -T used
my $maxevalue = undef; # defined if -E used
my $do_cmscan = 0; # set to '1' if --cmscan used
my $do_fmt2 = 0; # set to '1' if --fmt
my $do_all_attributes = 0; # set to '1' if --all
my $do_no_attributes = 0; # set to '1' if --none
my $do_de_attributes = 0; # set to '1' if --desc
my $version = undef; # defined if --version used
my $extra = undef; # defined if --extra used
my $do_hidedesc = 0; # set to 1 if --hidedesc used
my $opt_source = undef; # set to <s> if --source <s> used
&GetOptions( "T=s" => \$minscore,
"E=s" => \$maxevalue,
"cmscan" => \$do_cmscan,
"source=s" => \$opt_source,
"fmt2" => \$do_fmt2,
"all" => \$do_all_attributes,
"none" => \$do_no_attributes,
"desc" => \$do_de_attributes,
"version=s" => \$version,
"extra=s" => \$extra,
"hidedesc" => \$do_hidedesc);
if(scalar(@ARGV) != 1) { die $usage; }
my ($tblout_file) = @ARGV;
if(defined $minscore) { $do_minscore = 1; }
if(defined $maxevalue) { $do_maxevalue = 1; }
if($do_minscore && $do_maxevalue) {
die "ERROR, -T and -E cannot be used in combination. Pick one.";
}
if(($do_all_attributes) && ($do_no_attributes)) {
die "ERROR, --all and --none cannot be used in combination. Pick one.";
}
if(($do_all_attributes) && ($do_de_attributes)) {
die "ERROR, --all and --desc cannot be used in combination. Pick one.";
}
if(($do_no_attributes) && ($do_de_attributes)) {
die "ERROR, --none and --desc cannot be used in combination. Pick one.";
}
if(($do_fmt2) && (! $do_cmscan)) {
die "ERROR, --fmt2 only makes sense in combination with --cmscan";
}
if((defined $opt_source) && (defined $version)) {
die "ERROR, --source and --version are incompatible";
}
if(! -e $tblout_file) { die "ERROR tblout file $tblout_file does not exist"; }
if(! -s $tblout_file) { die "ERROR tblout file $tblout_file is empty"; }
my $source = ($do_cmscan) ? "cmscan" : "cmsearch";
if(defined $version) { $source .= "-" . $version; }
if(defined $opt_source) { $source = $opt_source; }
open(IN, $tblout_file) || die "ERROR unable to open $tblout_file for reading";
my $line;
my $i;
while($line = <IN>) {
if($line !~ m/^\#/) {
chomp $line;
my @el_A = split(/\s+/, $line);
my $nfields = scalar(@el_A);
if((! $do_fmt2) && ($nfields < 18)) {
die "ERROR expected at least 18 space delimited fields in tblout line (fmt 1, default) but got $nfields on line:\n$line\n";
}
if(($do_fmt2) && ($nfields < 27)) {
die "ERROR expected at least 27 space delimited fields in tblout line (fmt 2, --fmt2) but got $nfields on line:\n$line\n";
}
# ref Infernal 1.1.2 user guide, pages 59-61
my $idx = undef;
my $seqname = undef;
my $seqaccn = undef;
my $mdlname = undef;
my $mdlaccn = undef;
my $clan = undef;
my $mdl = undef;
my $mdlfrom = undef;
my $mdlto = undef;
my $seqfrom = undef;
my $seqto = undef;
my $strand = undef;
my $trunc = undef;
my $pass = undef;
my $gc = undef;
my $bias = undef;
my $score = undef;
my $evalue = undef;
my $inc = undef;
my $olp = undef;
my $anyidx = undef;
my $anyfrct1= undef;
my $anyfrct2= undef;
my $winidx = undef;
my $winfrct1= undef;
my $winfrct2= undef;
my $desc = undef;
if($do_fmt2) { # 27 fields
$idx = $el_A[0];
$seqname = ($do_cmscan) ? $el_A[3] : $el_A[1];
$seqaccn = ($do_cmscan) ? $el_A[4] : $el_A[2];
$mdlname = ($do_cmscan) ? $el_A[1] : $el_A[3];
$mdlaccn = ($do_cmscan) ? $el_A[2] : $el_A[4];
$clan = $el_A[5];
$mdl = $el_A[6];
$mdlfrom = $el_A[7];
$mdlto = $el_A[8];
$seqfrom = $el_A[9];
$seqto = $el_A[10];
$strand = $el_A[11];
$trunc = $el_A[12];
$pass = $el_A[13];
$gc = $el_A[14];
$bias = $el_A[15];
$score = $el_A[16];
$evalue = $el_A[17];
$inc = $el_A[18];
$olp = $el_A[19];
$anyidx = $el_A[20];
$anyfrct1= $el_A[21];
$anyfrct2= $el_A[22];
$winidx = $el_A[23];
$winfrct1= $el_A[24];
$winfrct2= $el_A[25];
$desc = $el_A[26];
for($i = 27; $i < $nfields; $i++) { $desc .= "_" . $el_A[$i]; }
}
else { # fmt 1, default
$seqname = ($do_cmscan) ? $el_A[2] : $el_A[0];
$seqaccn = ($do_cmscan) ? $el_A[3] : $el_A[1];
$mdlname = ($do_cmscan) ? $el_A[0] : $el_A[2];
$mdlaccn = ($do_cmscan) ? $el_A[1] : $el_A[3];
$mdl = $el_A[4];
$mdlfrom = $el_A[5];
$mdlto = $el_A[6];
$seqfrom = $el_A[7];
$seqto = $el_A[8];
$strand = $el_A[9];
$trunc = $el_A[10];
$pass = $el_A[11];
$gc = $el_A[12];
$bias = $el_A[13];
$score = $el_A[14];
$evalue = $el_A[15];
$inc = $el_A[16];
$desc = $el_A[17];
for($i = 18; $i < $nfields; $i++) { $desc .= "_" . $el_A[$i]; }
}
# one sanity check, strand should make sense
if(($strand ne "+") && ($strand ne "-")) {
if(($do_fmt2) && (($seqfrom eq "+") || ($seqfrom eq "-"))) {
die "ERROR problem parsing, you specified --fmt2 but tblout file appears to have NOT been created with --fmt 2, retry without --fmt2\nproblematic line:\n$line\n";
}
if((! $do_fmt2) && (($pass eq "+") || ($pass eq "-"))) {
die "ERROR problem parsing, you did not specify --fmt2 but tblout file appears to have been created with --fmt 2, retry with --fmt2\nproblematic line:\n$line\n";
}
die "ERROR unable to parse, problematic line:\n$line\n";
}
if(($do_minscore) && ($score < $minscore)) {
; # skip
}
elsif(($do_maxevalue) && ($evalue > $maxevalue)) {
; # skip
}
else {
my $attributes = "evalue=" . $evalue; # default to just evalue
if($do_all_attributes) {
if($do_fmt2) {
$attributes .= sprintf(";idx=$idx;seqaccn=$seqaccn;mdlaccn=$mdlaccn;clan=$clan;mdl=$mdl;mdlfrom=$mdlfrom;mdlto=$mdlto;trunc=$trunc;pass=$pass;gc=$gc;bias=$bias;inc=$inc;olp=$olp;anyidx=$anyidx;anyfrct1=$anyfrct1;anyfrct2=$anyfrct2;winidx=$winidx;winfrct1=$winfrct1;winfrct2=$winfrct2;%s$desc", ($do_hidedesc ? "" : "desc="));
}
else {
$attributes .= sprintf(";seqaccn=$seqaccn;mdlaccn=$mdlaccn;mdl=$mdl;mdlfrom=$mdlfrom;mdlto=$mdlto;trunc=$trunc;pass=$pass;gc=$gc;bias=$bias;inc=$inc;%s$desc", ($do_hidedesc ? "" : "desc="));
}
}
elsif($do_no_attributes) {
$attributes = "-";
}
elsif($do_de_attributes) {
$attributes = $desc;
}
if(defined $extra) {
if($attributes eq "-") { $attributes = ""; }
elsif($attributes !~ m/\;$/) { $attributes .= ";"; }
$attributes .= $extra . ";";
}
printf("%s\t%s\t%s\t%d\t%d\t%.1f\t%s\t%s\t%s\n",
$seqname, # token 1: 'sequence' (sequence name)
$source, # token 2: 'source'
$mdlname, # token 3: 'feature' (model name) you may want to change this to 'ncRNA'
($strand eq "+") ? $seqfrom : $seqto, # token 4: 'start' in coordinate space [1..seqlen], must be <= 'end'
($strand eq "+") ? $seqto : $seqfrom, # token 5: 'end' in coordinate space [1..seqlen], must be >= 'start'
$score, # token 6: 'score' bit score
$strand, # token 7: 'strand' ('+' or '-')
".", # token 8: 'phase' irrelevant for noncoding RNAs
$attributes); # token 9: attributes, currently only E-value, unless --all, --none or --desc
}
}
}
附convert_tRNAScanSE_to_gff3.pl
#!/usr/bin/env perl
=head1 NAME
github:jorvis/biocode
tRNAScan_SE_to_gff3.pl - convert raw output of tRNAScan-SE to gff3
=head1 SYNOPSIS
USAGE: convert_tRNAScanSE_to_gff3.pl
--input=/path/to/some_file.out
=head1 OPTIONS
B<--input,-i>
The raw output from tRNAScan-SE:
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
Name tRNA # Begin End Type Codon Begin End Score
-------- ------ ---- ------ ---- ----- ----- ---- ------
tp.assembly.567468735.1 1 91820 91902 Tyr GTA 91857 91866 66.58
tp.assembly.567468735.1 2 171777 171849 Phe GAA 0 0 70.28
tp.assembly.567468735.1 3 172144 172215 His GTG 0 0 64.04
tp.assembly.567468735.1 4 852847 852919 Thr AGT 0 0 75.69
tp.assembly.567468735.1 5 877291 877362 Trp CCA 0 0 68.97
tp.assembly.567468735.1 6 1468229 1468300 Cys GCA 0 0 72.10
tp.assembly.567468735.1 7 2507459 2507530 Pro AGG 0 0 62.33
tp.assembly.567468735.1 8 2507198 2507127 Pro CGG 0 0 65.73
tp.assembly.567468735.1 9 2506317 2506246 Pro TGG 0 0 66.60
tp.assembly.567468735.1 10 2463785 2463713 Lys TTT 0 0 79.47
tp.assembly.567468735.1 11 2191149 2191069 Leu CAG 0 0 57.47
tp.assembly.567468735.1 12 1633307 1633237 Gly CCC 0 0 65.52
tp.assembly.567468735.1 13 1255051 1254968 Leu CAA 0 0 60.46
tp.assembly.567468735.1 14 251108 251037 Asp GTC 0 0 59.48
tp.assembly.567468735.1 15 250520 250449 Asp GTC 0 0 59.48
B<--log,-l>
Log file
B<--ignoreIntrons,-g>
Flag to ignore introns. tRNAs split by introns are usually flagged by NCBI's table2asn as too short.
B<--help,-h>
This help message
=head1 DESCRIPTION
File converter
=head1 INPUT
Input above.
=head1 OUTPUT
GFF3 to STDOUT
=head1 CONTACT
Kyle Tretina
kyletretina@gmail.com
=cut
use warnings;
use strict;
use Getopt::Long qw(:config no_ignore_case no_auto_abbrev pass_through);
use Pod::Usage;
my %options = ();
my $results = GetOptions (\%options,
'input|i=s',
'log|l=s',
'ignoreIntrons|g',
'help|h') || pod2usage();
## display documentation
if( $options{'help'} ){
pod2usage( {-exitval => 0, -verbose => 2, -output => \*STDERR} );
}
## make sure everything passed was peachy
&check_parameters(\%options);
## open the log if requested
my $logfh;
if (defined $options{log}) {
open($logfh, ">$options{log}") || die "can't create log file: $!";
}
## set ignoreIntrons flag
my $ignoreIntrons = '';
if (defined $options{ignoreIntrons}) {
$ignoreIntrons = 1;
}
## open the input file
my $ifh;
open($ifh, "<$options{input}") || die "can't open input file: $!";
# all output needs the gff header
print "##gff-version 3\n";
## globals
my $tRNAGenNum=1;
my $tRNAExonNum=1;
## parse the file
foreach my $line (<$ifh>){
my @cols = split /[\t]/, $line;
chomp @cols;
my $contig = $cols[0];
if ($contig =~ /^(.+?)\s+$/) {
$contig = $1;
}
## skip the header lines
next if $contig eq 'Sequence' || $contig eq 'Name' || $contig eq '--------';
my $start = $cols[2] + 0;
my $stop = $cols[3] + 0;
my $trna = $cols[4];
my $score = $cols[8];
my $pseudo = "";
my $anticodon = $cols[5];
my $intron_start = $cols[6]-1; ## offset intron boundaries
my $intron_end = $cols[7]+1;
$pseudo = ";pseudogene=unknown" if $cols[9] =~ m/pseudo/;
# Suppressor (sup) tRNAs are not recognized by NCBI's table2asn
$trna = "Xxx" if $trna =~ m/Sup/ or $trna =~ m/Undet/;
## Check if tRNAScan-SE predicted a possible intron (value in column 6 > 0)
if ($cols[6] > 0 and not $ignoreIntrons) {
if ($start > $stop){
my $intron_end = $cols[6]+1; ## offset intron boundaries
my $intron_start = $cols[7]-1;
($start, $stop) = ($stop, $start);
}
print "$contig\ttRNAScan-SE\tgene\t$start\t$stop\t$score\t-\t.\tID=tRNA-$trna$tRNAGenNum\_gene$pseudo\n";
print "$contig\ttRNAScan-SE\ttRNA\t$start\t$intron_start\t$score\t-\t.\tID=tRNA-$trna$tRNAExonNum\_tRNA;Parent=tRNA-$trna$tRNAGenNum\_gene;product=tRNA-$trna;anticodon=$anticodon;\n";
$tRNAExonNum++;
print "$contig\ttRNAScan-SE\ttRNA\t$intron_end\t$stop\t$score\t-\t.\tID=tRNA-$trna$tRNAExonNum\_tRNA;Parent=tRNA-$trna$tRNAGenNum\_gene;product=tRNA-$trna;anticodon=$anticodon;\n";
$tRNAGenNum++;
$tRNAExonNum++;
} else{
($start, $stop) = ($stop, $start) if ($start > $stop);
print "$contig\ttRNAScan-SE\tgene\t$start\t$stop\t$score\t-\t.\tID=tRNA-$trna$tRNAGenNum\_gene$pseudo\n";
print "$contig\ttRNAScan-SE\ttRNA\t$start\t$stop\t$score\t-\t.\tID=tRNA-$trna$tRNAExonNum\_tRNA;product=tRNA-$trna;anticodon=$anticodon;Parent=tRNA-$trna$tRNAGenNum\_gene\n";
$tRNAGenNum++;
$tRNAExonNum++;
}
}
exit(0);
sub _log {
my $msg = shift;
print $logfh "$msg\n" if $logfh;
}
sub check_parameters {
my $options = shift;
## make sure required arguments were passed
my @required = qw( input );
for my $option ( @required ) {
unless ( defined $$options{$option} ) {
die "--$option is a required option";
}
}
## handle some defaults
$options{optional_argument2} = 'foo' unless ($options{optional_argument2});
}
https://www.jianshu.com/p/8f4061c38508
http://blog.genesino.com/2017/06/Rfam/