2020-10-08

2020-10-08  本文已影响0人  小小程序员一枚

中文分词组件jieba:https://github.com/fxsjy/jieba

CoNLL2003 语料库官网:https://www.clips.uantwerpen.be/conll2003/ner/

CoNLL2003是命名实体识别中最常见的公开数据集,具体介绍见上述官网。

flair framework的用法:

from flair.data import Sentence
from flair.models import SequenceTagger

sentence = Sentence('I love Berlin,the capital of Germany.')

tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

print(sentence)
print('The following NER tags are found:')

for entity in sentence.get_spans('ner'):
    print(entity)

反思:为什么会将 'Berlin,the '一起识别成location ????
我明白了,它对于','分割两个单词分割不了,因此将'Berlin,the'识别成一个token,并标记为location。

flair framework在CoNLL2003数据集上训练并测试,F1值达到了93.16,以下是训练的具体代码:
(补充说明:需要事先下载CoNLL2003语料库中的文件,并放在resources/tasks文件夹中)

resources/tasks/conll_03/eng.testa
resources/tasks/conll_03/eng.testb
resources/tasks/conll_03/eng.train

from flair.data import Corpus
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings
from typing import List

# 1. get the corpus
corpus: Corpus = CONLL_03(base_path='resources/tasks')

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    # GloVe embeddings
    WordEmbeddings('glove'),

    # contextual string embeddings, forward
    PooledFlairEmbeddings('news-forward', pooling='min'),

    # contextual string embeddings, backward
    PooledFlairEmbeddings('news-backward', pooling='min'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
              train_with_dev=True,  
              max_epochs=150)

读论文

1.Pooled Contextualized Embeddings for Named Entity Recognition

这篇文章叫做池化上下文嵌入,如上图1所示,Indra属于罕见的词,因此在做上下文字符嵌入时将其标注成了ORG,它原本应该是PER。为了解决这个问题,作者提出了一种动态聚合上下文嵌入的方法,对于每一个遇到的特殊的单词(例如:Indra),作者使用池操作从所有的上下文实例中提取全局单词表示,具体过程见上图2。

未完。。。。

上一篇 下一篇

猜你喜欢

热点阅读