n-gram n元语法

2017-04-18 本文已影响735人奔向超级开发者xAI

NLP刚入门或还未入门，搜资料时经常碰到的概念就是n-gram，特别是bigram，更加常见。了解它，会省不少事~
维基百科的定义：

n元语法（英语：n-gram）指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型，通过n个语词出现的概率来推断语句的结构。
当n分别为1、2、3时，又分别称为一元语法（unigram）、二元语法（bigram）与三元语法（trigram）

所以概念本身非常简单，就是把文本连续出现的n个词都找出来。
举例：
文本：我是一个好人
先做分词：我是一个好人
unigram:
我
是
一个
好人

bigram：
我是
是一个
一个好人

trigram：
我是一个
是一个好人

你可能会问，最后面词语的个数不够n个呢？这样的情况，就需要由你确定是在左边补齐还是在右边补齐了。
nltk的实现挺好的，可以参考它的代码，在此摘录一下

# 此方法用来做补齐
def pad_sequence(sequence, n, pad_left=False, pad_right=False,
              left_pad_symbol=None, right_pad_symbol=None):
      """
      Returns a padded sequence of items before ngram extraction.
          >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
         ['<s>', 1, 2, 3, 4, 5, '</s>']
         >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
         ['<s>', 1, 2, 3, 4, 5]
         >>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
         [1, 2, 3, 4, 5, '</s>']

     :param sequence: the source data to be padded
     :type sequence: sequence or iter
     :param n: the degree of the ngrams
     :type n: int
     :param pad_left: whether the ngrams should be left-padded
     :type pad_left: bool
     :param pad_right: whether the ngrams should be right-padded
     :type pad_right: bool
     :param left_pad_symbol: the symbol to use for left padding (default is None)
     :type left_pad_symbol: any
     :param right_pad_symbol: the symbol to use for right padding (default is None)
     :type right_pad_symbol: any
     :rtype: sequence or iter
     """
     sequence = iter(sequence)
     if pad_left:
         sequence = chain((left_pad_symbol,) * (n-1), sequence)
     if pad_right:
         sequence = chain(sequence, (right_pad_symbol,) * (n-1))
     return sequence



def ngrams(sequence, n, pad_left=False, pad_right=False,
       left_pad_symbol=None, right_pad_symbol=None):
    """
    Return the ngrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import ngrams
        >>> list(ngrams([1,2,3,4,5], 3))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    Wrap with list for a list version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
    :param sequence: the source data to be converted into ngrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param pad_left: whether the ngrams should be left-padded
    :type pad_left: bool
    :param pad_right: whether the ngrams should be right-padded
    :type pad_right: bool
    :param left_pad_symbol: the symbol to use for left padding (default is None)
    :type left_pad_symbol: any
    :param right_pad_symbol: the symbol to use for right padding (default is None)
    :type right_pad_symbol: any
    :rtype: sequence or iter
    """
    sequence = pad_sequence(sequence, n, pad_left, pad_right,
                        left_pad_symbol, right_pad_symbol)

    history = []
    while n > 1:
        history.append(next(sequence))
        n -= 1
    for item in sequence:
        history.append(item)
        yield tuple(history)
        del history[0]


def bigrams(sequence, **kwargs):
    """
    Return the bigrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import bigrams
        >>> list(bigrams([1,2,3,4,5]))
        [(1, 2), (2, 3), (3, 4), (4, 5)]
    Use bigrams for a list version of this function.
    :param sequence: the source data to be converted into bigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)
    """

    for item in ngrams(sequence, 2, **kwargs):
        yield item

def trigrams(sequence, **kwargs):
    """
    Return the trigrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import trigrams
        >>> list(trigrams([1,2,3,4,5]))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    Use trigrams for a list version of this function.
    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)
    """

    for item in ngrams(sequence, 3, **kwargs):
        yield item

n-gram n元语法

猜你喜欢

热点阅读