ACE 2005 语料集事件预处理(英文)
ACE 2005 语料集
注: ACE 2005 语料集 无法免费下载到,需要付费才能获得。
事件 (英文)
事件主要依赖于:
- tokenizer
- entity
- event
所以事件的英文样本处理将上面的主要数据预处理出来。
sample.json
[
{
"sentence": "He visited all his friends.",
"tokens": ["He", "visited", "all", "his", "friends", "."],
"pos-tag": ["PRP", "VBD", "PDT", "PRP$", "NNS", "."],
"golden-entity-mentions": [
{
"text": "He",
"entity-type": "PER:Individual",
"start": 0,
"end": 0
},
{
"text": "his",
"entity-type": "PER:Group",
"start": 3,
"end": 3
},
{
"text": "all his friends",
"entity-type": "PER:Group",
"start": 2,
"end": 5
}
],
"golden-event-mentions": [
{
"trigger": {
"text": "visited",
"start": 1,
"end": 1
},
"arguments": [
{
"role": "Entity",
"entity-type": "PER:Individual",
"text": "He",
"start": 0,
"end": 0
},
{
"role": "Entity",
"entity-type": "PER:Group",
"text": "all his friends",
"start": 2,
"end": 5
}
],
"event_type": "Contact:Meet"
}
],
"parse": "(ROOT\n (S\n (NP (PRP He))\n (VP (VBD visited)\n (NP (PDT all) (PRP$ his) (NNS friends)))\n (. .)))"
}
]
解析代码
github: https://github.com/nlpcl-lab/ace2005-preprocessing
如何运行以及相关依赖参考 其中的 "README.md",但是在实际使用中还存在下面的问题。
相关环境
- python3 >= 3.7
- nltk
- Standord CoreNlp
nltk
pip install nltk
但是在运行的时候会提示需要 "Resource punkt not found."。
自动安装:
import nltk
nltk.download('punkt')
手动安装:
nltk 说明如下:
Create a folder nltk_data, e.g. C:\nltk_data, or /usr/local/share/nltk_data, and subfolders chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers.
Download individual packages from http://nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown.
具体操作:
- 去 http://www.nltk.org/nltk_data/ 下载 punkt
- 在
C:\nltk_data
或者/usr/local/share/nltk_data
创建tokenizers
, 然后将上一步下载的punkt
解压,放到tokenizers
中。最后的文件目录如下:你的路径/nltk_data/tokenizers/punkt
Standford CoreNLP 安装
pip install stanfordcorenlp
然后,下载资源包 http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip unzip stanford-corenlp-full-2018-10-05.zip
将资源包解压放到一个合适的目录下,
with StanfordCoreNLP('你的路径/stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:
该资源包在试用的时候指定进入代码.