motif分析-MEME

2020-03-29 本文已影响0人 bioYIYI

1 什么是motif分析

在DNA或蛋白的同源序列中，不同位点的保守程度是不一样的，一般来说，对DNA或蛋白质功能和结构影响比较大的位点会比较保守，其它位点则不是很保守。这些保守的位点就称为“模体（motif）”。motif最先是通过实验的方法发现的。motif这个单词就是形容一种反复出现的模式，而序列motif往往是DNA上的反复出现的模式，并被假设拥有生物学功能。而且，经常是一些具有序列特异性的蛋白的结合位点（如，转录因子）或者是涉及到重要生物过程的（如，RNA 起始，RNA 终止， RNA 剪切等等）。目前被人们识别出来的motif也越来越多，如TRANSFAC和JASPAR数据库都有着大量转录因子的motif。

2 分析motif的软件

分析motif发软件很多，如常见的有motif-x、、MochiView、CisGenome等。但这些软件中大部分都是网页版的，无法批量化进行分析，也很难实现自动化。MEME是一款比较经典的motif分析的软件，除了在线版本外MEME还有适用于可适用于dna、rna和蛋白序列。这款软件包含多种功能，包括motif预测、motif富集分析、motif比较分析等。
MEME网址：

2.1MEME分析原理

MEME是一个工具包，包含多个软件。其中MEME是进行motif挖掘的软件，MEME不允许模体中有空位。MAST是在通过MEME得到一个motif之后，在其它的序列中查找这个motif，是MEME的一个后续的分析，可以在MEME运行结束后，通过超级链接继续，也可以通过保存meme的文本格式文件。GLAM2类似于MEME，但允许莫提中有空位。GLAM2SCAN类似于MAST，MAST不允许模体中有空位，GLAM2SCAN允许模体中有空位。MEME有web和Linux两个版本，web版地址：。整个工具包设计逻辑如下：

image.png

2.2MEME实现方法

2.2.1使用示例

meme test.fa -protein -oc result -nostatus -time 1800000 -mod zoops -nmotifs 3 -minw 6 -maxw 13 -objfun classic -markov_order 0（同web版参数）

2.2.2程序说明

-protein 待预测的为蛋白序列
-oc result 输出路径
-nostatus 不将软件计算过程输出到屏幕上
-time 1800000 CPU消耗时间达到<time>后停止计算
-mod zoops motif的分布类型
· oops 每个功能域在每一段序列中都会出现一次，而且只出现一次。这种模式是运算速度最快，而且最为敏感的。但是如果并不是每个序列都包含功能域，那就可能会有不正确的结果。
· zoops 每个功能域在每一段序列中至多只出现一次，可能不出现。这种模式运算速度较快，敏感性稍弱。
· anr 每个功能域在每一段序列中出现的次数不定。这种模式运算速度最慢，可能会多花十倍以上的时间。但是对于功能分布的情况完全未知的情况下，这一参数可能会有帮助
-nmotifs 3 检测到的motif的最大限制
-minw 6 motif最大长度
-maxw 13 motif最小长度
-objfun classic motif检测的函数算法
-markov_order 0 马尔科夫模型使用的顺序

2.2.3软件参数详细说明

Usage: meme <dataset> [optional arguments]
<dataset> file containing sequences in FASTA format
[-h] print this message
[-o <output dir>] name of directory for output files，will not replace existing directory
[-oc <output dir>] name of directory for output files，will replace existing directory
[-text] output in text format (default is HTML)
[-objfun classic|de|se|cd|ce] obxxxxjective function (default: classic)
[-test mhg|mbn|mrs] statistical test type (default: mhg)
[-use_llr] use LLR in search for starts in Classic mode
[-neg <negdataset>] file containing control sequences
[-shuf <kmer>] preserve frequencies of k-mers of size <kmer> ，when shuffling (default: 2)
[-hsfrac <hsfrac>] fraction of primary sequences in holdout set (default: 0.5)
[-cefrac <cefrac>] fraction sequence length for CE region (default: 0.25)
[-searchsize <ssize>]maximum portion of primary dataset to use，for motif search (in characters)
[-maxsize <maxsize>] maximum dataset size in characters
[-norand] do not randomize the order of the input ，sequences with -searchsize
[-csites <csites>] maximum number of sites for EM in Classic mode
[-seed <seed>] random seed for shuffling and sampling
[-dna] sequences use DNA alphabet
[-rna] sequences use RNA alphabet
[-protein] sequences use protein alphabet
[-alph <alph file>] sequences use custom alphabet
[-revcomp] allow sites on + or - DNA strands
[-pal] force palindromes (requires -dna)
[-mod oops|zoops|anr] distribution of motifs
[-nmotifs <nmotifs>] maximum number of motifs to find
[-evt <ev>] stop if motif E-value greater than <evt>
[-time <t>] quit before <t> CPU seconds consumed
[-nsites <sites>] number of sites for each motif
[-minsites <minsites>] minimum number of sites for each motif
[-maxsites <maxsites>] maximum number of sites for each motif
[-wnsites <wnsites>] weight on expected number of sites
[-w <w>] motif width
[-minw <minw>]     minimum motif width
[-maxw <maxw>] maximum motif width
[-allw] test starts of all widths from minw to maxw
[-nomatrim] do not adjust motif width using multiple
 alignment
[-wg <wg>] gap opening cost for multiple alignments
[-ws <ws>] gap extension cost for multiple alignments
[-noendgaps] do not count end gaps in multiple alignments
[-bfile <bfile>] name of background Markov model file
[-markov_order <order>] (maximum) order of Markov model to use or create
[-psp <pspfile>] name of positional priors file
[-maxiter <maxiter>] maximum EM iterations to run
[-distance <distance>] EM convergence criterion
[-prior dirichlet|dmix|mega|megap|addone] type of prior to use
[-b <b>] strength of the prior
[-plib <plib>] name of Dirichlet prior file
[-spfuzz <spfuzz>] fuzziness of sequence to theta mapping
[-spmap uni|pam] starting point seq to theta mapping type
[-cons <cons>] consensus sequence to start EM from
[-brief <n>] omit sites and sequence tables in output if more than <n> primary sequences
[-nostatus] do not print progress reports to terminal
[-p <np>] use parallel version with <np> processors
[-sf <sf>] print <sf> as name of sequence file
[-V] verbose mode
[-version] display the version number and exit

2.2.4结果展示及说明

meme.html -交互式的、可读性强的HTML格式展示的结果
meme.txt -兼容早期MEME版本的纯文本文件结果
meme.xmxxxxl -为机器处理设计的xmxxxxl格式的结果文件
logoN.png.eps - PNG and EPS 格式的miotif logos文件

image.png

注：氨基酸字符大小表示该位点出现8某种氨基酸频率的高低

2.3 注意事项

a)MEME不支持motif中有gap。
b)Linux下Motif检测使用的参数同web版MEME

2.4软件相关文献引用

Timothy L. Bailey and Charles Elkan "Fitting a mixture model by expectation maximization to discover motifs in biopolymers" Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology pp. 28-36 AAAI Press Menlo Park California 1994.

原创文字，如果觉得对你有帮助留下你的赞哦~