2020-10-08

2020-10-08 本文已影响0人小小程序员一枚

中文分词组件jieba：https://github.com/fxsjy/jieba

CoNLL2003 语料库官网：https://www.clips.uantwerpen.be/conll2003/ner/

CoNLL2003是命名实体识别中最常见的公开数据集，具体介绍见上述官网。

flair framework的用法：

from flair.data import Sentence
from flair.models import SequenceTagger

sentence = Sentence('I love Berlin,the capital of Germany.')

tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

print(sentence)
print('The following NER tags are found:')

for entity in sentence.get_spans('ner'):
    print(entity)

反思：为什么会将 'Berlin,the '一起识别成location ????
我明白了，它对于'，'分割两个单词分割不了，因此将'Berlin,the'识别成一个token，并标记为location。

用flair framework在CoNLL2003数据集上训练并测试，F1值达到了93.16，以下是训练的具体代码：
（补充说明：需要事先下载CoNLL2003语料库中的文件，并放在resources/tasks文件夹中）

resources/tasks/conll_03/eng.testa
resources/tasks/conll_03/eng.testb
resources/tasks/conll_03/eng.train

from flair.data import Corpus
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings
from typing import List

# 1. get the corpus
corpus: Corpus = CONLL_03(base_path='resources/tasks')

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    # GloVe embeddings
    WordEmbeddings('glove'),

    # contextual string embeddings, forward
    PooledFlairEmbeddings('news-forward', pooling='min'),

    # contextual string embeddings, backward
    PooledFlairEmbeddings('news-backward', pooling='min'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
              train_with_dev=True,  
              max_epochs=150)

读论文

1.Pooled Contextualized Embeddings for Named Entity Recognition

这篇文章叫做池化上下文嵌入，如上图1所示，Indra属于罕见的词，因此在做上下文字符嵌入时将其标注成了ORG，它原本应该是PER。为了解决这个问题，作者提出了一种动态聚合上下文嵌入的方法，对于每一个遇到的特殊的单词（例如：Indra），作者使用池操作从所有的上下文实例中提取全局单词表示，具体过程见上图2。

未完。。。。

2020-10-08

读论文

猜你喜欢

热点阅读