文本浅层分析shallow parsing

2021-08-14  本文已影响0人  ltochange

Shallow parsing 又叫Chunking(分块)是介于词性标注和Constituency parsing 之间的一种浅层分析方法。用于识别文本中最小短语块,例如名词短语NP,动词短语VP以及介词短语PP等。

介绍

在这里插入图片描述

例如上图中,从文本 "We saw the yellow dog",提取出名词短语块,称为NP-chunk。最后得到相应的浅层句法结构

在这里插入图片描述

从解决方法上看与命名实体识别NER相似,都是序列标注的问题,常用的标签有BMESBIOBIOE。标签与相应的块名称X组合, 例如B-NP 代表块名词短语的开头。

序列标注的标签
图片来自博客

句子中的短语块,一般有以下几种类型:


在这里插入图片描述

但是现有的工具(spacytextblob等),一般只关注NP-chunking任务,仅仅提取文本序列中的名词短语块。conll2000-chunking任务提取NP, VP以及PP短语块,这里也提供了相应的数据集

实践

可以使用基于规则的方法和基于机器学习的方法

基于规则的方法

基于规则的方法需要手动定义chunking的文法,并且需要注意嵌套

def preprocess(doc):
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences


sentence = "The blogger taught the reader to chunk"
sentence = preprocess(sentence)
print(sentence)

grammar = "NP: {<DT>?<JJ>*<NN>}" 
# 匹配模式,限定词(0或1个) + 形容词(0个以上) + 名词
NPChunker = nltk.RegexpParser(grammar)
result = NPChunker.parse(sentence[0])
print(result)

输出:

[[('The', 'DT'), ('blogger', 'NN'), ('taught', 'VBD'), ('the', 'DT'), ('reader', 'NN'), ('to', 'TO'), ('chunk', 'VB')]]
(S
  (NP The/DT blogger/NN)
  taught/VBD
  (NP the/DT reader/NN)
  to/TO
  chunk/VB)

基于机器学习的方法(最大熵分类器)

输入有两种形式,一是原始的文本,二是原始文本+词性标注(准确率相比前者会高很多)

这里使用nltk中自带的语料conll2000,可使用如下命令下载,训练最大熵分类器,自动提取文本中的名词短语块NP,动词短语块VP和介词短语块PP:

import nltk
nltk.download("conll2000")

代码如下:

def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))


def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "&lt;START&gt;", "&lt;START&gt;"
    else:
        prevword, prevpos = sentence[i - 1]
    if i == len(sentence) - 1:
        nextword, nextpos = "&lt;END&gt;", "&lt;END&gt;"
    else:
        nextword, nextpos = sentence[i + 1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevword": prevword,
            "nextword": nextword,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "prevpos+pos+nextpos": "%s+%s+%s" % (prevpos, pos, nextpos),
            "prevword+word+nextword": "%s+%s+%s" % (prevword, word, nextword),
            "tags-since-dt": tags_since_dt(sentence, i)}


class ConsecutiveNPChunkTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(
            train_set, algorithm='IIS', trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

# 模型及特征构建
class ConsecutiveNPChunker(nltk.ChunkParserI):

    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        # 词->词性->chunk标签
        # iob_tagged = tree2conlltags(chunked_sentence)
        # chunk_tree = conlltags2tree(iob_tagged)
        # len(conll2000.chunked_sents())  # 10948
        # len(conll2000.chunked_words())  # 166433
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)


from nltk.corpus import conll2000

# 获取训练和测试数据
train_sents = conll2000.chunked_sents('train.txt')
chunked_sentence = conll2000.chunked_sents()[0]
test_sents = conll2000.chunked_sents('test.txt')
# 训练模型
chunker = ConsecutiveNPChunker(train_sents)
# 测试
print(chunker.evaluate(test_sents))


# 保存模型
import pickle
pickle.dump(chunker, open("chunker.bin", "wb"))

# 加载模型
chunker = pickle.load(open("chunker.bin", "rb"))

# 测试样例
sentence = 'It is the 2019 novel coronavirus that has breaks out worldwide.'
test_sent_words = nltk.word_tokenize(sentence)
test_sent_pos = nltk.pos_tag(test_sent_words)
test_sent = [(word, pos) for word, pos in zip(test_sent_words, test_sent_pos)]
print(chunker.parse(test_sent_pos))

输出:

ChunkParse score:
    IOB Accuracy:  93.9%%
    Precision:     89.0%%
    Recall:        92.1%%
    F-Measure:     90.5%%
(S
  (NP It/PRP)
  (VP is/VBZ)
  (NP the/DT 2019/CD novel/NN coronavirus/NN)
  (NP that/WDT)
  (VP has/VBZ breaks/VBN)
  out/RP
  (NP worldwide/NN)
  ./.)
上一篇下一篇

猜你喜欢

热点阅读