Transformer模型的学习总结

2018-12-13 本文已影响2087人奔向算法的喵

Transformer来自Google团队17年的文章Attention is all you need。
该文章的目的：减少计算量并且提高并行效率，同时不减弱最终的实验效果。
创新点：Transformer只采用了attention机制。不像传统的encoder-decoder的模型需要结合RNN或者CNN来使用。创新之处在于使用了scaled Dot-product Attention和Multi-Head Attention。
我觉得将Transformer解释的最容易懂的的还是The illustrated transformer，看完这篇博客就瞬间懂了很多东西。然后哈佛大学也给出了详细的pytorch版本的代码，有jupyter notebook详细的解释，看完也会有别样的收获哈。

一、原理分析

1、模型的架构图如下：
Attentin is all you need文中给出的架构图如下，能够比较详细的看到了一个encoder和decoder的细节。在这里，可以看到一个encoder里面有2个子层，然后一个decoder中含有3个子层，后面会详细说到这个结构。

然后我们从全局的视野来看，encoder和decoder部分都包含了6个encoder和decoder。进入到第一个encoder的inputs结合embedding和positional embedding。通过了6个encoder之后，输出到了decoder部分的每一个decoder中。

2、Encoder部分
Transformer的两个创新的结构图如下，明白了两个部分，后面的东西就清楚多了。

上面的Q，K和V，被作为一种抽象的向量，主要目的是用来做计算和辅助attention。根据文章我们知道Attention的计算公式如下:

在这里说明一下，在tensorflow的代码里面，维度变成了(batch_size, max_len, vector_dimension)。其实就是增加了一个batch_size也就是输入的句子数，然后max_len就是控制的一句话的长度，最后是字向量的维度大小。
上面说的是从下图中x1经过self-attention到了z1的状态，通过了self-attetion的张量还需要进过残差网络和LaterNorm的处理，然后进入到全连接的前馈网络中，前馈网络也是同样的操作，进行的残差处理和正规化。最后输出的张量才真正的进入到了下一个encoder之中，然后这样的操作，经过了6次，然后最后就能进入到decoder的部分了。

可以从上图中看出，在向量进入self-attention层之前，是将词的embedding和位置的encoding做了一个相加的处理。因为模型里面没有用到RNN和CNN的东西，所以该论文采用了位置的编码来解决序列信息获取的问题。这里的positional encoding需要说明一下：

最后一个decoder输出的向量会经过Linear层和softmax层。Linear层的作用就是对decoder部分出来的向量做映射成一个logits向量，然后softmax层根据这个logits向量，将其转换为了概率值，最后找到概率最大值的位置。这样就完成了解码的输出了。

二、代码分析

代码分析来自于https://github.com/EternalFeather/Transformer-in-generating-dialogue，该文件里面的文件组成和作用如下：

文件名字	作用
params.py	定义了该模型里面的所用到的超参数，比如学习率、隐藏单元葛个数等。
make_dic.py	用来做数据预处理的，作用是生成源语言和目标语言的Vocabulary文件。
data_load.py	该文件包含所有关于加载数据以及批量化数据的函数。
modules.py	核心代码部分，包括了embedding和positional embedding、、以及multihead-attention、正则化等函数
train.py	训练模型的代码，定义了模型，损失函数以及训练和保存模型的过程
eval.py	模型训练完之后，评估模型的性能

1、明确该模型的一些超参数(可调)

# -*- coding: utf-8 -*-
class Params:
    '''
    Parameters of our model
    '''
    src_train = "data/src-train.txt"
    tgt_train = "data/tgt-train.txt"
    src_test  = "data/src-val.txt"
    tgt_test  = "data/tgt-val.txt"

    num_identical = 6

    maxlen       = 10
    hidden_units = 512
    num_heads    = 8

    logdir = 'logdir'
    batch_size = 32
    num_epochs = 250
    dropout    = 0.1
    learning_rate = 0.0001

    word_limit_size  = 20
    word_limit_lower = 3

    checkpoint = 'checkpoint'

参数名	大小	对应到代码里面	意义
batch_size	32	N	批量的大小
learning_rate	0.001	lr	学习率的大小
maxlen	10	T,T_q,T_k	一句话最长多少个字
word_limit_size	20		出现次数小于20的话，那么认作UNK
hidden_units	512	num_units,S	隐藏单元的个数、维度大小
num_identical	6		encoder和decoder部分叠加的个数
num_epochs	250		训练的时候，总的epochs数目
num_heads	8		multi-head attention里面的头数
dropout_rate	0.1		dropout大小

2、数据的预处理

make_dic.py

from __future__ import print_function
from params import Params as pm
import codecs
import os
from collections import Counter

def make_dic(path, fname):
    '''
    Constructs vocabulary as a dictionary

    Args:
        path: [String], Input file path
        fname: [String], Output file name

    Build vocabulary line by line to dictionary/ path
    '''
    text = codecs.open(path, 'r', 'utf-8').read()  #codes.open()得到的是一个对象，然后read()之后就变成了字符串了
    words = text.split()
    wordCount = Counter(words)
    if not os.path.exists('dictionary'):
        os.mkdir('dictionary')
    with codecs.open('dictionary/{}'.format(fname), 'w', 'utf-8') as f:
        f.write("{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n".format("<PAD>","<UNK>","<STR>","<EOS>"))
        for word, count in wordCount.most_common(len(wordCount)):
            f.write(u"{}\t{}\n".format(word, count))

if __name__ == '__main__':
    make_dic(pm.src_train, "en.vocab.tsv")
    make_dic(pm.tgt_train, "de.vocab.tsv")
    print("MSG : Constructing Dictionary Finished!")

运行这个文件之后，我们得到了en.vocab和de.vocab两个文件。en.vocab部分结果如下:

<PAD>   1000000000
<UNK>   1000000000
<STR>   1000000000
<EOS>   1000000000
有   17300
的   15767
`   12757
-   10831
卦   8461
八   7865
麼   7771
沒   7324
嗎   6024
是   5940
......
ASCII   1
SAISONduSOLEIL  1
豌   1
迺   1
ThuDec2223  1
snis    1
Ya  1
2100    1
雇   1

主要作用就是统计了一下出现单词的次数，然后按照出现的次数进行了一个排序，用于后续的data_load模块的操作

data_load.py

# -*- coding: utf-8 -*-
from __future__ import print_function
from params import Params as pm
import codecs
import sys
import numpy as np
import tensorflow as tf

def load_vocab(vocab):  #  'en.vocab.tsv'  'de.vocab.tsv'
    '''
    Load word token from encoding dictionary
    Args:
        vocab: [String], vocabulary files
    ''' 
    vocab = [line.split()[0] for line in codecs.open('dictionary/{}'.format(vocab), 'r', 'utf-8').read().splitlines() if int(line.split()[1]) >= pm.word_limit_size]
    word2idx_dic = {word: idx for idx, word in enumerate(vocab)}
    idx2word_dic = {idx: word for idx, word in enumerate(vocab)}
    return word2idx_dic, idx2word_dic

load_vocab这个函数的作用就是处理前面根据词频产生的文档，进过处理之后：

#en.vocab:
word2idx ={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '有': 4, 
'的': 5, '`': 6, '-': 7, '卦': 8, '八': 9, ..., '爬': 1642, 'U': 1643}
idx2word={{0: '<PAD>', 1: '<UNK>', 2: '<STR>', 3: '<EOS>', 4: '有', 
5: '的', 6: '`', 7: '-', 8: '卦', 9: '八', ..., 1642: '爬', 1643: 'U'}}

#de.vocab
word2idx={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '-': 4, 
'的': 5, '`': 6, '不': 7, '人': 8, '好': 9, ..., '遺': 1586, '搜': 1587}
idx2word={0: '<PAD>', 1: '<UNK>', 2: '<STR>',3:'<EOS>', 4: '-', 
5: '的', 6: '`', 7: '不', 8: '人', 9: '好', ..., 1586: '遺', 1587: '搜'}

接着看generate_dataset这个函数，它的作用就是产生数据集，也就是将句子表示成了np array的形式。传入的是source句子和target句子。这里的集外词的id给为1。函数的前半部分做的是一个index化，后半部分做的是padding处理，我们可以知道<pad>符号在Vocabulary里面的位置为0，所以当句子的长度小于10的时候，我们就在不足的位置给加填上0，保证每个句子的长度都是相同的。
最终返回的X，Y的维度都是(句子数，最大的句子的长度)

def generate_dataset(source_sents, target_sents):
    '''
    Parse source sentences and target sentences from corpus with some formats
    Parse word token of each sentences
    Args:
        source_sents: [List], encoding sentences from src-train file
        target_sents: [List], decoding sentences from tgt-train file

    Padding for word token sentence list
    '''
    en2idx, idx2en = load_vocab('en.vocab.tsv')
    de2idx, idx2de = load_vocab('de.vocab.tsv')

    in_list, out_list, Sources, Targets = [], [], [], []
    for source_sent, target_sent in zip(source_sents, target_sents):
        # 1 means <UNK>
        inpt = [en2idx.get(word, 1) for word in (source_sent + u" <EOS>").split()]
        outpt = [de2idx.get(word, 1) for word in (target_sent + u" <EOS>").split()]
        if max(len(inpt), len(outpt)) <= pm.maxlen:
            # sentence token list
            in_list.append(np.array(inpt))
            out_list.append(np.array(outpt))
            # sentence list
            Sources.append(source_sent)
            Targets.append(target_sent)

    X = np.zeros([len(in_list), pm.maxlen], np.int32)
    Y = np.zeros([len(out_list), pm.maxlen], np.int32)
    for i, (x, y) in enumerate(zip(in_list, out_list)):
        X[i] = np.lib.pad(x, (0, pm.maxlen - len(x)), 'constant', constant_values = (0, 0))
        Y[i] = np.lib.pad(y, (0, pm.maxlen - len(y)), 'constant', constant_values = (0, 0))

    return X, Y, Sources, Targets

load_data这个函数的作用就是给generate_dataset里面提供参数用的。

def load_data(l_data):
    '''
    Read train-data from input datasets

    Args:
        l_data: [String], the file name of datasets which used to generate tokens
    '''
    if l_data == 'train':
        en_sents = [line for line in codecs.open(pm.src_train, 'r', 'utf-8').read().split('\n') if line]
        de_sents = [line for line in codecs.open(pm.tgt_train, 'r', 'utf-8').read().split('\n') if line]
        if len(en_sents) == len(de_sents):
            inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
        else:
            print("MSG : Source length is different from Target length.")
            sys.exit(0)
        return inpt, outpt
    elif l_data == 'test':
        en_sents = [line for line in codecs.open(pm.src_test, 'r', 'utf-8').read().split('\n') if line]
        de_sents = [line for line in codecs.open(pm.tgt_test, 'r', 'utf-8').read().split('\n') if line]
        if len(en_sents) == len(de_sents):
            inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
        else:
            print("MSG : Source length is different from Target length.")
            sys.exit(0)
        return inpt, Sources, Targets
    else:
        print("MSG : Error when load data.")
        sys.exit(0)

利用到了前面的load_data这个函数，参数可以是train或者test。
这里面的inpt, outpt维度(句子的总数, maxlen)，那么batch_num就是我们总共的要训练的批量数。
输出x,y的维度(N, T)==(batch_size, maxlen)

def get_batch_data():
    '''
    A batch dataset generator
    '''
    inpt, outpt = load_data("train")

    batch_num   = len(inpt) // pm.batch_size

    inpt  = tf.convert_to_tensor(inpt, tf.int32)
    outpt = tf.convert_to_tensor(outpt, tf.int32)

    # parsing data into queue used for pipeline operations as a generator. 
    input_queues = tf.train.slice_input_producer([inpt, outpt])

    # multi-thread processing using batch
    x, y = tf.train.shuffle_batch(input_queues,
                                num_threads = 8,
                                batch_size = pm.batch_size,
                                capacity = pm.batch_size * 64,
                                min_after_dequeue = pm.batch_size * 32,
                                allow_smaller_final_batch = False)

    return x, y, batch_num

3、核心部分-modules.py

# -*- coding: utf-8 -*-
from __future__ import print_function
import tensorflow as tf
import numpy as np
import math

def normalize(inputs, epsilon = 1e-8, scope = "ln", reuse = None):
    '''
    Implement layer normalization
    Args:
        inputs: [Tensor], A tensor with two or more dimensions, where the first one is "batch_size"
        epsilon: [Float], A small number for preventing ZeroDivision Error
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name
    Returns:
        A tensor with the same shape and data type as "inputs"
    '''
    with tf.variable_scope(scope, reuse = reuse):
        inputs_shape = inputs.get_shape()  # a.get_shape().as_list() --->a维度是(2,3)，那么这个返回就是 [2, 3]
        params_shape = inputs_shape[-1 :]  # params_shape就是最后的一个维度了

        # tf.nn.moments 计算返回的 mean 和 variance 作为 tf.nn.batch_normalization 参数调用。
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims = True)
        beta  = tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
        outputs = gamma * normalized + beta

    return outputs

不论是self-attention还是feed-forward neural network，都会做一个正规化的操作。这里参数epsilon就是为了防止正规化的时候，分母为0，然后scope为命名空间。看代码就知道，这个和batch normalization的公式是一样的。目的肯定是使得训练过程，反向传播的时候的震荡幅度更小。好多复杂的网络在输出之后都会加上normalization。

def positional_encoding(inputs,
                        vocab_size,
                        num_units,
                        zero_pad = True,
                        scale = True,
                        scope = "positional_embedding",
                        reuse = None):
    '''
    Positional_Encoding for a given tensor.

    Args:
        inputs: [Tensor], A tensor contains the ids to be search from the lookup table, shape = [batch_size, 1 + len(inpt)]
        vocab_size: [Int], Vocabulary size
        num_units: [Int], Hidden size of embedding
        zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
        scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
        scope: [String], Optional scope for 'variable_scope'
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name

        Returns:
            A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    """
        inputs (batch_size, 1+len(inputs)) 那么N就是batch_size, 然后T就是maxlen，大小为10
        num_units 就是隐层单元的个数，维度的大小
    """
    N, T = inputs.get_shape().as_list()
    
    with tf.variable_scope(scope, reuse = reuse):
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
            for pos in range(T)])

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1

        # Convert to a tensor
        lookup_table = tf.convert_to_tensor(position_enc)

        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                          lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, position_ind)

        if scale:
            outputs = outputs * num_units**0.5
        
    return tf.cast(outputs, tf.float32)

这一部分的代码就是架构图里面的positional embedding的实现了，由于没有采用序列模型，那么这里讲位置进行了嵌入，就捕捉到了位置上面的信息。

def embedding(inputs,
            vocab_size,
            num_units,
            zero_pad = True,
            scale = True,
            scope = "embedding",
            reuse = None):
    '''
    Embed a given tensor.
    Args:
        inputs: [Tensor], A tensor contains the ids to be search from the lookup table
        vocab_size: [Int], Vocabulary size
        num_units: [Int], Hidden size of embedding
        zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
        scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
        scope: [String], Optional scope for 'variable_scope'
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name

        Returns:
            A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    """
        inputs传进来就(batch_size, 10)
        lookup_table维度(vocab_size, 512)，进行了随机的初始化
        """ 
      # shape = [vocabsize, 8]
    with tf.variable_scope(scope, reuse = reuse):
        lookup_table = tf.get_variable('lookup_table',
                                        dtype = tf.float32,
                                        shape = [vocab_size, num_units],
                                        initializer = tf.contrib.layers.xavier_initializer())

        if zero_pad:
            ''' tf.zeros 维度(1, 512)
                lookup_table[1:, :]的目的是抛开了<PAD>这玩意儿，赋值为0，然后进行了合并
                现在look  _table维度还是(vocab_size, 512  ) 
            '''
            lookup_table = tf.concat((tf.zeros(shape = [1, num_units]),  lookup_table[1:, :]), 0)

        # outputs 维度就是 (batch_size, 10, 512) ==[N ,T, S]
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)

        if scale:
            # embedding 那一步
            outputs = outputs * math.sqrt(num_units)

    return outputs

输入维度(batch_size, maxlen)==[N, T]
输出维度(batch_size, maxlen, S)==[N, T, S]
这里面注意初始化lookup_table时把id=0的那一行（第一行）初始化为全0的结果。scale为True，paper中在embedding里面说明为什么这里需要做一个scale的操作。

下面是multi-head attention，为该代码的核心部分。注释里面写清楚了维度的一个变化情况。
最后输出维度[N, T_q, S]。

def multihead_attention(queries,
                        keys,
                        num_units = None,
                        num_heads = 8,
                        dropout_rate = 0,
                        is_training = True,
                        causality = False,
                        scope = "multihead_attention",
                        reuse = None):
    '''
    Implement multihead attention

    Args:
        queries: [Tensor], A 3-dimensions tensor with shape of [N, T_q, S_q]
        keys: [Tensor], A 3-dimensions tensor with shape of [N, T_k, S_k]
        num_units: [Int], Attention size
        num_heads: [Int], Number of heads
        dropout_rate: [Float], A ratio of dropout
        is_training: [Boolean], If true, controller of mechanism for dropout
        causality: [Boolean], If true, units that reference the future are masked
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name
    
    Returns:
        A 3-dimensions tensor with shape of [N, T_q, S]
    '''
    """ queries = self.enc  (batch_size, 10 ,512)==[N, T_q, S] keys也是self.enc  
        num_units =512, num_heads =10
    """
    with tf.variable_scope(scope, reuse = reuse):
        if num_units is None:
            # length of sentence
            num_units = queries.get_shape().as_list()[-1]

        """ Linear layers in Figure 2(right) 就是Q、K、V进入scaled Dot-product Attention前的Linear的操作
        # 首先是进行了全连接的线性变换
        shape = [N, T_q, S]  (batch_size, 10 ,512)， S可以理解为512"""
        Q = tf.layers.dense(queries, num_units, activation = tf.nn.relu)
        # shape = [N, T_k, S]
        K = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
        # shape = [N, T_k, S]
        V = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
        '''
            Q_、K_、V_就是权重WQ、WK、WV。
            shape (batch_size*8， 10, 512/8=64)
        '''
        # Split and concat
        # shape = [N*h, T_q, S/h]
        Q_ = tf.concat(tf.split(Q, num_heads, axis = 2), axis = 0)
        # shape = [N*h, T_k, S/h]
        K_ = tf.concat(tf.split(K, num_heads, axis = 2), axis = 0)
        # shape = [N*h, T_k, S/h]
        V_ = tf.concat(tf.split(V, num_heads, axis = 2), axis = 0)

        # [N, T_q, S] * [N*h, T_k, S/h] 这一步的张量乘法是怎么做的？
        # shape = [N*h, T_q, T_k]   Q
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))

        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)

        # Masking
        # shape = [N, T_k]
        # 这里的tf.reduce_sum进行了降维，由三维降低到了2维度，然后是取绝对值，转成0-1之间的值
        '''[N, T_k, 512]------> [N, T_k] -----》[N*h, T_k] -----》[N*h, T_q, T_k] '''
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis = -1)))
        # shape = [N*h, T_k]
        key_masks = tf.tile(key_masks, [num_heads, 1])
        # shape = [N*h, T_q, T_k]    tf.expand_dims就是扩维度
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])

        # If key_masks == 0 outputs = [1]*length(outputs)
        paddings = tf.ones_like(outputs) * (-math.pow(2, 32) + 1)
        # shape = [N*h, T_q, T_k]
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)



        if causality: #如果为true的话，那么就是将这个东西未来的units给屏蔽了
            # reduce dims : shape = [T_q, T_k]
            diag_vals = tf.ones_like(outputs[0, :, :])
            # shape = [T_q, T_k]
            # use triangular matrix to ignore the affect from future words
            # like : [[1,0,0]
            #         [1,2,0]
            #         [1,2,3]]
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()
            # shape = [N*h, T_q, T_k]
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])

            paddings = tf.ones_like(masks) * (-math.pow(2, 32) + 1)
            # shape = [N*h, T_q, T_k]
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs)

        # Output Activation
        outputs = tf.nn.softmax(outputs)

        # Query Masking
        # shape = [N, T_q]
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis = -1)))
        # shape = [N*h, T_q]
        query_masks = tf.tile(query_masks, [num_heads, 1])
        # shape = [N*h, T_q, T_k]
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]])
        outputs *= query_masks 

        # Dropouts
        outputs = tf.layers.dropout(outputs, rate = dropout_rate, training = tf.convert_to_tensor(is_training))

        # Weighted sum
        # shape = [N*h, T_q, S/h]
        outputs = tf.matmul(outputs, V_)

        # Restore shape
        # shape = [N, T_q, S]
        outputs = tf.concat(tf.split(outputs, num_heads, axis = 0), axis = 2)

        # Residual connection
        outputs += queries

        # Normalize
        # shape = [N, T_q, S]
        outputs = normalize(outputs)

    return outputs

两层卷积之间加了relu非线性操作。之后是residual操作加上inputs残差，然后是normalize。最后输出的维度还是[N, T_q, S]。

def feedforward(inputs,
                num_units = [2048, 512],
                scope = "multihead_attention",
                reuse = None):
    '''
    Position-wise feed forward neural network

    Args:
        inputs: [Tensor], A 3d tensor with shape [N, T, S]
        num_units: [Int], A list of convolution parameters
        scope: [String], Optional scope for "variable_scope"
        reuse: [Boolean], If to reuse the weights of a previous layer by the same name 
    
    Return:
        A tensor converted by feedforward layers from inputs
    '''

    with tf.variable_scope(scope, reuse = reuse):
        # params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1, \
                  # "activation": tf.nn.relu, "use_bias": True}
        # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[0], kernel_size = 1, activation = tf.nn.relu, use_bias = True)
        # outputs = tf.layers.conv1d(**params)
        params = {"inputs": inputs, "num_outputs": num_units[0], \
                  "activation_fn": tf.nn.relu}
        outputs = tf.contrib.layers.fully_connected(**params)

        # params = {"inputs": inputs, "filters": num_units[1], "kernel_size": 1, \
        #         "activation": None, "use_bias": True}
        params = {"inputs": inputs, "num_outputs": num_units[1], \
                  "activation_fn": None}
        # outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[1], kernel_size = 1, activation = None, use_bias = True)
        # outputs = tf.layers.conv1d(**params)
        outputs = tf.contrib.layers.fully_connected(**params)

        # residual connection
        outputs += inputs

        outputs = normalize(outputs)

    return outputs

最后是进行了一个平滑的操作，就是one_hot中的0改成了一个很小的数，1改成了一个比较接近于1的数。

def label_smoothing(inputs, epsilon = 0.1):
    '''
    Implement label smoothing

    Args:
        inputs: [Tensor], A 3d tensor with shape of [N, T, V]
        epsilon: [Float], Smoothing rate

    Return:
        A tensor after smoothing
    '''
    ''' inputs的维度应该是(batch_size, sentense_length, vector dimension)
        N就是batch_size, T就是句子的长度，V就是向量的维度大小
    '''
    K = inputs.get_shape().as_list()[-1]
    return ((1 - epsilon) * inputs) + (epsilon / K)

4、训练

train.py
这里面的self.decoder_input采取了一个操作就是将每个句子加了一个初始化为2的id，然后除去了最后的一个句子结束符。然后它的维度还是[N ,T]

# -*- coding: utf-8 -*-

from __future__ import print_function
import tensorflow as tf
from params import Params as pm
from data_loader import get_batch_data, load_vocab
from modules import *
from tqdm import tqdm
import os

class Graph():
    # 直接就是一个init初始化一下
    def __init__(self, is_training = True):
        self.graph = tf.Graph()

        with self.graph.as_default():
            if is_training:
                self.inpt, self.outpt, self.batch_num = get_batch_data()

            else:
                '''inpt(None, maxlen)  outpt(None, maxlen) maxlen=10'''
                self.inpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))

                self.outpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))

            # start with 2(<STR>) and without 3(<EOS>)
            self.decoder_input = tf.concat((tf.ones_like(self.outpt[:, :1])*2, self.outpt[:, :-1]), -1)

            #  直接就拿到了en和de的Vocabulary，en的大小是1644，de的大小是1588
            en2idx, idx2en = load_vocab('en.vocab.tsv')
            de2idx, idx2de = load_vocab('de.vocab.tsv')

            # Encoder
            with tf.variable_scope("encoder"):
                ''' self.inpt维度是(batch_size, maxlen)
                    self.enc 维度是(batch_size, maxlen, 512)
                '''
                self.enc = embedding(self.inpt,
                                    vocab_size = len(en2idx),
                                    num_units  = pm.hidden_units,
                                    scale = True,
                                    scope = "enc_embed")

                # Position Encoding(use range from 0 to len(inpt) to represent position dim of each words)
                # tf.tile(tf.expand_dims(tf.range(tf.shape(self.inpt)[1]), 0), [tf.shape(self.inpt)[0], 1]),
                self.enc += positional_encoding(self.inpt,
                                    vocab_size = pm.maxlen,
                                    num_units  = pm.hidden_units,
                                    zero_pad   = False,
                                    scale = False,
                                    scope = "enc_pe")

                # Dropout
                self.enc = tf.layers.dropout(self.enc,
                                            rate = pm.dropout,
                                            training = tf.convert_to_tensor(is_training))

                # Identical
                for i in range(pm.num_identical):
                    with tf.variable_scope("num_identical_{}".format(i)):
                        # Multi-head Attention
                        self.enc = multihead_attention(queries = self.enc,
                                                        keys   = self.enc,
                                                        num_units = pm.hidden_units,
                                                        num_heads = pm.num_heads,
                                                        dropout_rate = pm.dropout,
                                                        is_training  = is_training,
                                                        causality = False)

                        self.enc = feedforward(self.enc, num_units = [4 * pm.hidden_units, pm.hidden_units])

下面就是decoder部分的代码。这里可以参考前面decoder的结构，里面多出了一个attention部分，该部分接受到了encoder输出的张量和decoder中self-attention里面输入的张量，然后再进行了vanilla attention。
最终decoder部分输出张量的维度是[N ,T, 512]

            # Decoder
            with tf.variable_scope("decoder"):
                self.dec = embedding(self.decoder_input,
                                vocab_size = len(de2idx),
                                num_units  = pm.hidden_units,
                                scale = True,
                                scope = "dec_embed")

                # Position Encoding(use range from 0 to len(inpt) to represent position dim)
                self.dec += positional_encoding(self.decoder_input,
                                    vocab_size = pm.maxlen,
                                    num_units = pm.hidden_units,
                                    zero_pad  = False,
                                    scale = False,
                                    scope = "dec_pe")

                # Dropout
                self.dec = tf.layers.dropout(self.dec,
                                            rate = pm.dropout,
                                            training = tf.convert_to_tensor(is_training))

                # Identical
                for i in range(pm.num_identical):
                    with tf.variable_scope("num_identical_{}".format(i)):
                        # Multi-head Attention(self-attention)
                        self.dec = multihead_attention(queries = self.dec,
                                                        keys   = self.dec,
                                                        num_units = pm.hidden_units,
                                                        num_heads = pm.num_heads,
                                                        dropout_rate = pm.dropout,
                                                        is_training  = is_training,
                                                        causality = True,
                                                        scope = "self_attention")

                        # Multi-head Attention(vanilla-attention)
                        self.dec = multihead_attention(queries=self.dec, 
                                                        keys=self.enc, 
                                                        num_units=pm.hidden_units, 
                                                        num_heads=pm.num_heads,
                                                        dropout_rate=pm.dropout,
                                                        is_training=is_training, 
                                                        causality=False,
                                                        scope="vanilla_attention")

                        self.dec = feedforward(self.dec, num_units = [4 * pm.hidden_units, pm.hidden_units])

现在已经走到了decoder部分输出了：
self.logits：进行了Linear变化，维度是[N, T, len(de2idx)]
self.preds：取了self.logits里面最后一个维度里面最大值的下标，维度是[n ,T]
self.istarget：将self.preds中所有id不为0的位置的值用1.0代替，维度是[n ,T]
self.acc: 对比self.preds, self.outpt，对应位置相等那么就是1.0，否则就是0。

            # Linear
            self.logits   = tf.layers.dense(self.dec, len(de2idx))
            self.preds    = tf.to_int32(tf.arg_max(self.logits, dimension = -1))
            self.istarget = tf.to_float(tf.not_equal(self.outpt, 0))
            self.acc      = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.outpt)) * self.istarget) / (tf.reduce_sum(self.istarget))
            tf.summary.scalar('acc', self.acc)

is_training 为True的时候，也就是训练的时候，就需要进行下面的操作了。
loss的维度是[N, T]

            if is_training:
                # smooth inputs
                self.y_smoothed = label_smoothing(tf.one_hot(self.outpt, depth = len(de2idx)))
                # loss function
                self.loss = tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = self.y_smoothed)
                self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))

                self.global_step = tf.Variable(0, name = 'global_step', trainable = False)
                # optimizer
                self.optimizer = tf.train.AdamOptimizer(learning_rate = pm.learning_rate, beta1 = 0.9, beta2 = 0.98, epsilon = 1e-8)
                self.train_op  = self.optimizer.minimize(self.mean_loss, global_step = self.global_step)

                tf.summary.scalar('mean_loss', self.mean_loss)
                self.merged = tf.summary.merge_all()

if __name__ == '__main__':
    '''en2idx{'<PAD>':0, ...}, idx2en{0:'<PAD>'}都是字典形式 长度是1684'''
    '''de2idx{'<PAD>':0, ...}, idx2de{0:'<PAD>'}都是字典形式 长度是1597'''
    en2idx, idx2en = load_vocab('en.vocab.tsv')
    de2idx, idx2de = load_vocab('de.vocab.tsv')

    g = Graph("train")
    print("MSG : Graph loaded!")

    # save model and use this model to training
    supvisor = tf.train.Supervisor(graph = g.graph,logdir = pm.logdir,save_model_secs = 0)

    with supvisor.managed_session() as sess:
        for epoch in range(1, pm.num_epochs + 1):
            if supvisor.should_stop():
                break
            # process bar
            for step in tqdm(range(g.batch_num), total = g.batch_num, ncols = 70, leave = False, unit = 'b'):
                sess.run(g.train_op)

            if not os.path.exists(pm.checkpoint):
                os.mkdir(pm.checkpoint)
            g_step = sess.run(g.global_step)
            supvisor.saver.save(sess, pm.checkpoint + '/model_epoch_%02d_gs_%d' % (epoch, g_step))

    print("MSG : Done!")

三、留下一些问题思考

1、Transformer的创新之处在于哪里？和传统的encoder-decoder模型的区别在哪里？它想实现的目标是什么？
2、为什么要采用self-attention以及multi-head attention？
3、如何理解transformer里面所采用的mask？

最后，有些理解不到位的部分，请指教！

参考资料:
1、Attention is all you need论文
2、attention is all you need模型笔记
3、The illustrated transformer
4、The Annotated Transformer