Neural Dependency Parsing
写在前面
- 关于parsing,目前有两种主流的方法,一种是
Constituency Parsing
,另一种是Dependency Parsing
,在本文中我们主要讨论Dependency Parsing
- 知识点参见: 阿衡学姐的笔记 Dependency Parsing and Treebank
- Stanford cs224N讲义: Dependency Parsing
- Basic论文: A Fast and Accurate Dependency Parser using Neural Networks,基于
greedy transition-based parsing
,结合了word_embedding
和neural networks
- Advanced论文
Globally Normalized Transition-Based Neural Networks
Universal Dependencies: A cross-linguistic typology
Incrementality in Deterministic Dependency Parsing
主要改进点有: 使用更深层次的神经网络,处理non-projectivity
问题,使用graph-based parsing
- 基础项目: Neural Dependency Parsing
Neural transition-based Dependency Parsing
Conventional transition-based Dependency Parsing
- Structure: Stack s, Buffer b and A(record dependencies)
- At each move, use a discriminative classifier(such as SVM) to decide what kind of operation next time

Feature Extract
- In conventional ways, we represent word|tag|label as one-hot vectors, which is sparse and costs much time
- In Neural transition-based dependency parsing model, we represent word|tag|label as pre-trained word_embeddings

Choices of featurs
The choice of Sw, St, Sl
Following (Zhang and Nivre, 2011), we pick a rich set of elements for our final parser. In de- tail, Sw contains nw = 18 elements: (1) The top 3 words on the stack and buffer: s1, s2, s3, b1, b2, b3; (2) The first and second leftmost / rightmost children of the top two words on the stack: lc1(si), rc1(si), lc2(si), rc2(si), i = 1, 2. (3) The leftmost of leftmost / rightmost of right- most children of the top two words on the stack: lc1(lc1(si)), rc1(rc1(si)), i = 1, 2.
We use the corresponding POS tags for St (nt = 18), and the corresponding arc labels of words excluding those 6 words on the stack/buffer for Sl (nl = 12). A good advantage of our parser is that we can add a rich set of elements cheaply, instead of hand-crafting many more indicator fea- tures.
Neural Networks
after word_embedding,we feed the input to hidden-layer, which has a novel activation function(f(x) = x^3),then use a softmax-layer to predict next transition
