NER----NER数据集以及SOTA模型
CoNLL2003
CoNLL++
This is a cleaner version of the CoNLL 2003 NER task, where about 5% of instances in the test set are corrected due to mislabelling. The training set is left untouched. Models are evaluated based on span-based F1 on the test set.
Model | F1 | Paper / Source | Code |
---|---|---|---|
CrossWeigh + Flair (Wang et al., 2019) | 94.28 | CrossWeigh: Training Named Entity Tagger from Imperfect Annotations | Official |
Flair embeddings (Akbik et al., 2018) | 93.89 | Contextual String Embeddings for Sequence Labeling | Flair framework |
BiLSTM-CRF+ELMo (Peters et al., 2018) | 93.42 | Deep contextualized word representations | AllenNLP Project AllenNLP GitHub |
Ma and Hovy (2016) | 91.87 | End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF | |
LSTM-CRF (Lample et al., 2016) | 91.47 | Neural Architectures for Named Entity Recognition |
Ontonotes v5 (English)
OntoNotes 5.0由 1745k 英语、900k 中文和300k 阿拉伯语文本数据组成,OntoNotes 5.0的数据来源也多种多样, 有电话对话、新闻通讯社、广播新闻、广播对话和博客。实体被标注为【PERSON】、【ORGANIZATION】和【LOCATION】等18个类别。
Makeup 是 OntoNotes 使用的标注方法, 思路比较简单, XML, 比如:
<ENAMEX TYPE=”ORG”>Disney</ENAMEX> is a global brand .
它用标签把 命名实体框出来, 然后,在 TYPE 上, 设置相应的类型。
The Ontonotes corpus v5 is a richly annotated corpus with several layers of annotation, including named entities, coreference, part of speech, word sense, propositions, and syntactic parse trees. These annotations are over a large number of tokens, a broad cross-section of domains, and 3 languages (English, Arabic, and Chinese). The NER dataset (of interest here) includes 18 tags, consisting of 11 types (PERSON, ORGANIZATION, etc) and 7 values (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in Pradhan et al 2013 and can be found here.
转载自:https://github.com/sebastianruder/NLP-progress/blob/master/english/named_entity_recognition.md