Transformer

2018-10-16 本文已影响81人一梦换须臾_

写在前面

这一篇文章主要是介绍 transformer 模型
论文参考:
Attention is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
知识点参考:
Attention原理和源码解析
 Transformer详解
 语言模型和迁移学习
 Google BERT.
项目参考:
Transformer in Pytorch

RNN + Attention

Recall

Another formation of Attention

Attention = A(Q, K, V)=softmax(sim(Q, K)) • V

Advantages & Disadvantages

Advantages

Takes positional information into account

Disadvantages

Parallel Computation
Only decoder - encoder attention, has no concern on encoder itself and decoder itself

Transformer

Attention

In transformer model, we represent attention in this way:

Scaled Dot-Product Attention
When |K| becomes large,the dot products grow large in magnitude, softmax(qKt) may be similar to 0 or 1

Multi-head Attention
It is beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively

Self Attention
Besides decoder-encoder attention, we can also discover self-attention in encoder itself or decoder itself.

Encoder self-attention: Q=K=V = output of previous layer

Decoder self-attention: Q=K=V = output of previous layer, mask out all the right-part attention, only permit current state's paying attention to the previous state, not future state

Encoder

Embedding-Layer: token embedding & positional embedding(later)
SubLayer_1: Multi-Head Attention: encoder self-attention
SubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network

Decoder

Embedding-Layer: token embedding & positional embedding
SubLayer_1: Masked Multi-Head Attention: decoder masked self-attention
SubLayer_2: Multi-Head Attention:
Q: The previous decoder layer
K, V: Output of the encoder
SubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network
Linear & Softmax: Softmax to classify