Pointer Networks

2019-07-12  本文已影响0人  MasterXiong

Pointer Networks

Oriol Vinyals, Meire Fortunato, Navdeep Jaitly
Google, Berkeley
NIPS 2015

Introduction

The key motivation and contribution of this work is a pointer network framework which can solve discrete mapping problems with different dictionary size across instances.

Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention.

The method is to

use attention as a pointer to select a member of the input sequence as the output.

This work provides a new approach to solve discrete optimization problems with sequential model.

Problem Setup

This paper focuses on solving a specific type of seqence-to-sequence task in a supervised learning approach. In the training data, the inputs are planar point sets \mathcal{P} = \{ P_1, \cdots , P_n \} with n elements each, where P_i = (x_i, y_i) are the cartesian coordinates of the points. The training instances are sampled from a uniform distribution in [0, 1] \times [0, 1]. The outputs \mathcal{C}^{\mathcal{P}} = \{ C_1, \cdots , C_{m(\mathcal{P})} \} are sequences of point indices representing the solution associated to the point set \mathcal{P}.

Models

Sequence-to-Sequence Model

This model learns the parameters \theta of an encoder-decoder to maximize the conditional probabilitity of the output sequence on the training samples:
\theta^* = \mathop{\arg\max}_{\theta} \sum_{\mathcal{P}, \mathcal{C}^{\mathcal{P}}} \log{p(\mathcal{C}^{\mathcal{P}} | \mathcal{P}; \theta)} ,
where
p(\mathcal{C}^{\mathcal{P}} | \mathcal{P}; \theta) = \prod_{i = 1}^{m(\mathcal{P})} p_\theta (C_i | C_1, \cdots , C_{i - 1}, \mathcal{P}; \theta) .
Note that this model makes no statistical independence assumptions, thus it is not a Markov chain.

During inference, as the number of possible suquences grows exponentially with the sequence length, beam search is utilized to find the best possible sequence.

Notice that in the standard sequence-to-sequence model, the output dictionary size for all symbols C_i is fixed and equal to n, which means that the model trained for a particular sequence length can not generalize to other sequences with different length.

Content Based Input Attention

The vanilla sequence-to-sequence model only uses the final state of the encoder to represent the whole input sequence, which constrains the amount of information and computation that can flow through to the decoder. The attention model augments the decoder RNNs with an attention module over the encoder states to provide further information:
u_j^i = v^T \tanh{W_1 e_j + W_2 d_i} ,
a^i = softmax(u^i) ,
d'_i = \sum_{i = 1}^n = a_j^i e_j ,
where e_j and d_i are encoder state and decoder state respectively. d'_i and d_i are concatenated and used as the hidden state for prediction and input to the next time step.

Pointer Network

The pointer network simplifies the attention mechanism by normalizing the vector u^i to be an output distribution over the dictionary of inputs, which guarantees that the dictionary size is always consistent with the input dictionary size:
p(C_i | C_1, \cdots , C_{i - 1}, \mathcal{P}) = softmax(u^i) .

Experiments

The paper experiments on three different problems, i.e. convex hull, Delaunay triangulation and TSP, all relating to finding a solution with respect to a discrete input sequence.

(The output is actually a cycle or set in these problems, which means that any point in the solution can be the start point in the decoder sequence. RNN actually can't reflect this property, and the authors had to artificially define a start point of the output sequence in the experimental setup. )

Convex Hull

Delaunay Triangulation

TSP

Generally speaking, the pointer network can work across different sequence length, and perform relatively well for small instance size. However, as it uses 1M training instances for each task, and all of them are uniformly sampled in an unit square, I doubt that it is fitting instead of actually learning.

上一篇下一篇

猜你喜欢

热点阅读