linux

RIsearch2使用方法-预测RNA-RNA互作(sRNA的靶

2018-09-21  本文已影响278人  Y大宽

背景

非编码RNA经常和其它RNAs形成配对(双链)发挥其作用。这些RNA-RNA相互作用都是建立在碱基互补配对的基础上,两个RNA序列之间的高度互补是这种相互作用的强有力预测基础。RIsearch2是RNA-RNA相互作用预测工具,可以在给定的query和target序列之间形成互补定位。使用基于suffix arrays的seed-and-extend框架,RIsearch2可以发现RNA-RNA相互作用关系,这种发现可以基于基因组或转录组。类似之前的 RIsearch,RIsearch2也使用基于di-nucleotides to approximate nearest-neighbor energy parameters的修正Smith-Waterman-Gotoh algorithm算法。然而,不是执行整个序列比对,RIsearch2关注种子区域的完美互补并且向两端延伸。 用户定义的seed and extension constraints 使得 RIsearch2 可应用于所有类型的RNA-RNA相互作用预测。


1下载安装,设置环境变量

cd biosoft
wget https://rth.dk/resources/risearch/RIsearch-2.1.tar.gz
tar -xzvf RIsearch-2.1.tar.gz
cd RIsearch-2.1
less README
less INSTALL

加入环境变量

cp /home/kelly/biosoft/RIsearch-2.1/bin/risearch2.x /home/kelly/bin/.

查看用法

risearch2.x
================================ RIsearch v2.1 ===============================
================ Energy based RNA-RNA interaction predictions ================

Usage: risearch2.x [options]

  -h,         --help
                 show this message
------------------------------------------------------------------------------
--------------------------- SUFFIX ARRAY CREATION ----------------------------
  -c <FILE>,  --create=FILE (.fa or .fa.gz)
                 create suffix array for target sequence(s) together with
                 their reverse complements, FASTA format, use '-' for stdin
  -o <FILE>,  --output=FILE
                 save created suffix array to given index file path
------------------------------------------------------------------------------
--------------------------- INTERACTION PREDICTION ---------------------------
  -q <FILE>,  --query=FILE (.fa or .fa.gz)
                 FASTA file for query sequence(s), use '-' for stdin
  -i <FILE>,  --index=FILE
                 pregenerated suffix array file for target sequence(s)
  -s n:m/l,   --seed=n:m/l
                 set seed length (-s l = length only; -s n:m = full interval;
                 -s n:m/l = length in interval; default -s 6)
  -l <int>,   --extension=L
                 max extension length(L) on the seed (do DP for max this length
                 up- and downstream of seed) (default L=20)
  -e <float>, --energy=dG
                 set deltaG energy threshold (in kcal/mol) to filter predictions
                 (default=-20)
  -z mat,     --matrix=mat
                 set energy matrix to t99 or t04 (default) for RNA-RNA duplexes
  -d <int>,   --penalty=dP
                 per-nucleotide extension penalty given in dacal/mol
                 (recommended: 30, default: 0)
  -t <int>,   --threads=N
                 set maximum number of threads to use (default=1)
  -p,         --report_alignment
                 report predictions in detailed format
  -p2,        --report_alignment=2
                 report predictions in a simple format together with CIGAR-like
                 string for interaction structure
  -p3,        --report_alignment=3
                 report predictions in a simple format together with
                 binding site (3'->5'), flanking 5'end (3'->5') and
                 flanking 3'end (5'->3') sequences of the target
                 (required for post-processing of CRISPR off-target predictions)
  --noGUseed     consider G-U wobble pairs as mismatch within the seed
                 (only for locating seeds, energy model is not affected)
  --verbose      verbose output
------------------------------------------------------------------------------
---------------------------- EXPERIMENTAL OPTIONS ----------------------------
  -m c:p,     --mismatch=c:p
                 introduce mismatched seeds
                 Set the max num of mismatches (c) allowed in the seed and
                 min num of consecutive matches required at seed start/end (p)
                 ! These seeds will not overlap with perfect complementary seeds
                 (default -m 0:0  (no mismatch);
                 if you set c>0, please also set p>0 to avoid overlaps)
  -x <float>, --seed_energy=F
                 set energy per length threshold that filters seeds (default=0)

至此,已经完成安装,可以使用了。

2 如何使用:官方示例文件

和其它比对工具一样,RIsearch2也需要预先准备好的target 序列的index文件。所以先看RIsearch2如何产生index文件

2.1 为target序列产生index structure

目标序列只接受FASTA格式(或gzip压缩的FASTA文件),并且这些序列总是5'-3'格式。构建好的index会包括反向互补序列。下面这个命令展示如何产生目标序列(示例文件target.fa为例)的的index文件

risearch2.x -c target.fa -o target.suf

这样会产生target.suf文件。
当然,也可以支持多个文件

cat target*.fa.gz | risearch2.x -c - -o target.suf

2.2 RIsearch2进行相互作用预测

risearch2.x -q query.fa -i target.suf

这会产生risearch_query1.out.gz文件

risearch2.x -c target.fa -o target.suf
risearch2.x -q query.fa -i target.suf -s 7 -e -13 -l 5
zcat risearch_query1.out.gz
zcat risearch_query2.out.gz

zcat risearch_query1.out.gz结果如下:

query1  9       21      ENST00000436685 451     463     +       -20.19

zcat risearch_query2.out.gz结果如下:

query2  9       22      ENST00000534717 281     294     -       -14.54
query2  9       22      ENST00000436685 156     169     -       -14.54

在上表中,

risearch2.x -q query.fa -i target.suf -s 7 -e -13 -l 5 -p
zcat risearch_query1.out.gz

结果显示如下:

UUGAGAGGG
::||:||::
ggcuuucuu
query1  11      19      ENST00000534717 371     379     -       -5.31
UUGAGAGGG
::||:||::
ggcuuucuu
query1  11      19      ENST00000436685 246     254     -       -5.31
UUGAGAGGGCGA
:|||||:|  ||
gacucuucaccu
query1  11      22      ENST00000534717 110     121     -       -8.89
UUGAGAGGGCGA
:|||||:|  ||
gacucuucaccu
query1  11      22      ENST00000436685 10      21      -       -8.89
UGGGUUGAGAGGG
::|:::||: |||
gucuggcuugccc
query1  7       19      ENST00000534717 4       16      +       -8.46
UGGGUUG
:|::::|
gcuuggc
query1  7       13      ENST00000534717 35      41      -       0.54

对于更易于处理的格式(较小的文件大小和简单的解析),但仍提供有关结合位点本身的信息,可以用-p2, 具有压缩互作结构的额外列添加到默认输出表中。 它基本上是长格式第二行的记录,同时gap的信息使用字母编码如下:

$ risearch2.x -c target.fa -o target.suf
$ risearch2.x -q query.fa -i target.suf -s 7 -e  -13 -l 5 -p2
$ zcat risearch_query1.out.gz
$ zcat risearch_query2.out.gz

zcat risearch_query1.out.gz结果如下:

query1  9       21      ENST00000436685 451     463     +       -20.19  PPUUPPPPPPPPP

zcat risearch_query2.out.gz结果如下:

query2  9       22      ENST00000534717 281     294     -       -14.54  PPUUPPPPWPPPWP
query2  9       22      ENST00000436685 156     169     -       -14.54  PPUUPPPPWPPPWP

更详细的-p3格式另外包括目标交互位点的序列以及其上游和下游区域的序列。 当选择-p3时,程序比-p2格式输出预测增加三列:

$ risearch2.x -c target.fa -o target.suf
$ risearch2.x -q query.fa -i target.suf -s 7 -e -13 -l 5 -p3
$ zcat risearch_query1.out.gz
$ zcat risearch_query2.out.gz

zcat risearch_query1.out.gz结果如下:

query1  9       21      ENST00000436685 451     463     +       -20.19  PPUUPPPPPPPPP   ccuucucucccgc   ccgagaccucgacucuacuu    ugcagccugggaacuucagc

zcat risearch_query2.out.gz结果如下:

query2  9       22      ENST00000534717 281     294     -       -14.54  PPUUPPPPWPPPWP  acgccggagagagg  augggccugugacuacagua    gucgaucauagucuuccugc
query2  9       22      ENST00000436685 156     169     -       -14.54  PPUUPPPPWPPPWP  acgccggagagagg  augggccugugacuacagua    gucgaucauagucuuccugc

更加具体的信息请参考官方手册RIsearch2: User manual

上一篇 下一篇

猜你喜欢

热点阅读