Transformer模型的学习总结
Transformer来自Google团队17年的文章Attention is all you need。
该文章的目的:减少计算量并且提高并行效率,同时不减弱最终的实验效果。
创新点:Transformer只采用了attention机制。不像传统的encoder-decoder的模型需要结合RNN或者CNN来使用。创新之处在于使用了scaled Dot-product Attention和Multi-Head Attention。
我觉得将Transformer解释的最容易懂的的还是The illustrated transformer,看完这篇博客就瞬间懂了很多东西。然后哈佛大学也给出了详细的pytorch版本的代码,有jupyter notebook详细的解释,看完也会有别样的收获哈。
一、原理分析
1、模型的架构图如下:
Attentin is all you need文中给出的架构图如下,能够比较详细的看到了一个encoder和decoder的细节。在这里,可以看到一个encoder里面有2个子层,然后一个decoder中含有3个子层,后面会详细说到这个结构。
然后我们从全局的视野来看,encoder和decoder部分都包含了6个encoder和decoder。进入到第一个encoder的inputs结合embedding和positional embedding。通过了6个encoder之后,输出到了decoder部分的每一个decoder中。
2、Encoder部分
Transformer的两个创新的结构图如下,明白了两个部分,后面的东西就清楚多了。
上面的Q,K和V,被作为一种抽象的向量,主要目的是用来做计算和辅助attention。根据文章我们知道Attention的计算公式如下:
在这里说明一下,在tensorflow的代码里面,维度变成了(batch_size, max_len, vector_dimension)。其实就是增加了一个batch_size也就是输入的句子数,然后max_len就是控制的一句话的长度,最后是字向量的维度大小。
上面说的是从下图中x1经过self-attention到了z1的状态,通过了self-attetion的张量还需要进过残差网络和LaterNorm的处理,然后进入到全连接的前馈网络中,前馈网络也是同样的操作,进行的残差处理和正规化。最后输出的张量才真正的进入到了下一个encoder之中,然后这样的操作,经过了6次,然后最后就能进入到decoder的部分了。
可以从上图中看出,在向量进入self-attention层之前,是将词的embedding和位置的encoding做了一个相加的处理。因为模型里面没有用到RNN和CNN的东西,所以该论文采用了位置的编码来解决序列信息获取的问题。这里的positional encoding需要说明一下:
最后一个decoder输出的向量会经过Linear层和softmax层。Linear层的作用就是对decoder部分出来的向量做映射成一个logits向量,然后softmax层根据这个logits向量,将其转换为了概率值,最后找到概率最大值的位置。这样就完成了解码的输出了。
二、代码分析
代码分析来自于https://github.com/EternalFeather/Transformer-in-generating-dialogue,该文件里面的文件组成和作用如下:
文件名字 | 作用 |
---|---|
params.py | 定义了该模型里面的所用到的超参数,比如学习率、隐藏单元葛个数等。 |
make_dic.py | 用来做数据预处理的,作用是生成源语言和目标语言的Vocabulary文件。 |
data_load.py | 该文件包含所有关于加载数据以及批量化数据的函数。 |
modules.py | 核心代码部分,包括了embedding和positional embedding、、以及multihead-attention、正则化等函数 |
train.py | 训练模型的代码,定义了模型,损失函数以及训练和保存模型的过程 |
eval.py | 模型训练完之后,评估模型的性能 |
1、明确该模型的一些超参数(可调)
# -*- coding: utf-8 -*-
class Params:
'''
Parameters of our model
'''
src_train = "data/src-train.txt"
tgt_train = "data/tgt-train.txt"
src_test = "data/src-val.txt"
tgt_test = "data/tgt-val.txt"
num_identical = 6
maxlen = 10
hidden_units = 512
num_heads = 8
logdir = 'logdir'
batch_size = 32
num_epochs = 250
dropout = 0.1
learning_rate = 0.0001
word_limit_size = 20
word_limit_lower = 3
checkpoint = 'checkpoint'
参数名 | 大小 | 对应到代码里面 | 意义 |
---|---|---|---|
batch_size | 32 | N | 批量的大小 |
learning_rate | 0.001 | lr | 学习率的大小 |
maxlen | 10 | T,T_q,T_k | 一句话最长多少个字 |
word_limit_size | 20 | 出现次数小于20的话,那么认作UNK | |
hidden_units | 512 | num_units,S | 隐藏单元的个数、维度大小 |
num_identical | 6 | encoder和decoder部分叠加的个数 | |
num_epochs | 250 | 训练的时候,总的epochs数目 | |
num_heads | 8 | multi-head attention里面的头数 | |
dropout_rate | 0.1 | dropout大小 |
2、数据的预处理
- make_dic.py
from __future__ import print_function
from params import Params as pm
import codecs
import os
from collections import Counter
def make_dic(path, fname):
'''
Constructs vocabulary as a dictionary
Args:
path: [String], Input file path
fname: [String], Output file name
Build vocabulary line by line to dictionary/ path
'''
text = codecs.open(path, 'r', 'utf-8').read() #codes.open()得到的是一个对象,然后read()之后就变成了字符串了
words = text.split()
wordCount = Counter(words)
if not os.path.exists('dictionary'):
os.mkdir('dictionary')
with codecs.open('dictionary/{}'.format(fname), 'w', 'utf-8') as f:
f.write("{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n".format("<PAD>","<UNK>","<STR>","<EOS>"))
for word, count in wordCount.most_common(len(wordCount)):
f.write(u"{}\t{}\n".format(word, count))
if __name__ == '__main__':
make_dic(pm.src_train, "en.vocab.tsv")
make_dic(pm.tgt_train, "de.vocab.tsv")
print("MSG : Constructing Dictionary Finished!")
运行这个文件之后,我们得到了en.vocab和de.vocab两个文件。en.vocab部分结果如下:
<PAD> 1000000000
<UNK> 1000000000
<STR> 1000000000
<EOS> 1000000000
有 17300
的 15767
` 12757
- 10831
卦 8461
八 7865
麼 7771
沒 7324
嗎 6024
是 5940
......
ASCII 1
SAISONduSOLEIL 1
豌 1
迺 1
ThuDec2223 1
snis 1
Ya 1
2100 1
雇 1
主要作用就是统计了一下出现单词的次数,然后按照出现的次数进行了一个排序,用于后续的data_load模块的操作
- data_load.py
# -*- coding: utf-8 -*-
from __future__ import print_function
from params import Params as pm
import codecs
import sys
import numpy as np
import tensorflow as tf
def load_vocab(vocab): # 'en.vocab.tsv' 'de.vocab.tsv'
'''
Load word token from encoding dictionary
Args:
vocab: [String], vocabulary files
'''
vocab = [line.split()[0] for line in codecs.open('dictionary/{}'.format(vocab), 'r', 'utf-8').read().splitlines() if int(line.split()[1]) >= pm.word_limit_size]
word2idx_dic = {word: idx for idx, word in enumerate(vocab)}
idx2word_dic = {idx: word for idx, word in enumerate(vocab)}
return word2idx_dic, idx2word_dic
load_vocab这个函数的作用就是处理前面根据词频产生的文档,进过处理之后:
#en.vocab:
word2idx ={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '有': 4,
'的': 5, '`': 6, '-': 7, '卦': 8, '八': 9, ..., '爬': 1642, 'U': 1643}
idx2word={{0: '<PAD>', 1: '<UNK>', 2: '<STR>', 3: '<EOS>', 4: '有',
5: '的', 6: '`', 7: '-', 8: '卦', 9: '八', ..., 1642: '爬', 1643: 'U'}}
#de.vocab
word2idx={'<PAD>': 0, '<UNK>': 1, '<STR>': 2, '<EOS>': 3, '-': 4,
'的': 5, '`': 6, '不': 7, '人': 8, '好': 9, ..., '遺': 1586, '搜': 1587}
idx2word={0: '<PAD>', 1: '<UNK>', 2: '<STR>',3:'<EOS>', 4: '-',
5: '的', 6: '`', 7: '不', 8: '人', 9: '好', ..., 1586: '遺', 1587: '搜'}
接着看generate_dataset这个函数,它的作用就是产生数据集,也就是将句子表示成了np array的形式。传入的是source句子和target句子。这里的集外词的id给为1。函数的前半部分做的是一个index化,后半部分做的是padding处理,我们可以知道<pad>符号在Vocabulary里面的位置为0,所以当句子的长度小于10的时候,我们就在不足的位置给加填上0,保证每个句子的长度都是相同的。
最终返回的X,Y的维度都是(句子数,最大的句子的长度)
def generate_dataset(source_sents, target_sents):
'''
Parse source sentences and target sentences from corpus with some formats
Parse word token of each sentences
Args:
source_sents: [List], encoding sentences from src-train file
target_sents: [List], decoding sentences from tgt-train file
Padding for word token sentence list
'''
en2idx, idx2en = load_vocab('en.vocab.tsv')
de2idx, idx2de = load_vocab('de.vocab.tsv')
in_list, out_list, Sources, Targets = [], [], [], []
for source_sent, target_sent in zip(source_sents, target_sents):
# 1 means <UNK>
inpt = [en2idx.get(word, 1) for word in (source_sent + u" <EOS>").split()]
outpt = [de2idx.get(word, 1) for word in (target_sent + u" <EOS>").split()]
if max(len(inpt), len(outpt)) <= pm.maxlen:
# sentence token list
in_list.append(np.array(inpt))
out_list.append(np.array(outpt))
# sentence list
Sources.append(source_sent)
Targets.append(target_sent)
X = np.zeros([len(in_list), pm.maxlen], np.int32)
Y = np.zeros([len(out_list), pm.maxlen], np.int32)
for i, (x, y) in enumerate(zip(in_list, out_list)):
X[i] = np.lib.pad(x, (0, pm.maxlen - len(x)), 'constant', constant_values = (0, 0))
Y[i] = np.lib.pad(y, (0, pm.maxlen - len(y)), 'constant', constant_values = (0, 0))
return X, Y, Sources, Targets
load_data这个函数的作用就是给generate_dataset里面提供参数用的。
def load_data(l_data):
'''
Read train-data from input datasets
Args:
l_data: [String], the file name of datasets which used to generate tokens
'''
if l_data == 'train':
en_sents = [line for line in codecs.open(pm.src_train, 'r', 'utf-8').read().split('\n') if line]
de_sents = [line for line in codecs.open(pm.tgt_train, 'r', 'utf-8').read().split('\n') if line]
if len(en_sents) == len(de_sents):
inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
else:
print("MSG : Source length is different from Target length.")
sys.exit(0)
return inpt, outpt
elif l_data == 'test':
en_sents = [line for line in codecs.open(pm.src_test, 'r', 'utf-8').read().split('\n') if line]
de_sents = [line for line in codecs.open(pm.tgt_test, 'r', 'utf-8').read().split('\n') if line]
if len(en_sents) == len(de_sents):
inpt, outpt, Sources, Targets = generate_dataset(en_sents, de_sents)
else:
print("MSG : Source length is different from Target length.")
sys.exit(0)
return inpt, Sources, Targets
else:
print("MSG : Error when load data.")
sys.exit(0)
利用到了前面的load_data这个函数,参数可以是train或者test。
这里面的inpt, outpt维度(句子的总数, maxlen),那么batch_num就是我们总共的要训练的批量数。
输出x,y的维度(N, T)==(batch_size, maxlen)
def get_batch_data():
'''
A batch dataset generator
'''
inpt, outpt = load_data("train")
batch_num = len(inpt) // pm.batch_size
inpt = tf.convert_to_tensor(inpt, tf.int32)
outpt = tf.convert_to_tensor(outpt, tf.int32)
# parsing data into queue used for pipeline operations as a generator.
input_queues = tf.train.slice_input_producer([inpt, outpt])
# multi-thread processing using batch
x, y = tf.train.shuffle_batch(input_queues,
num_threads = 8,
batch_size = pm.batch_size,
capacity = pm.batch_size * 64,
min_after_dequeue = pm.batch_size * 32,
allow_smaller_final_batch = False)
return x, y, batch_num
3、核心部分-modules.py
# -*- coding: utf-8 -*-
from __future__ import print_function
import tensorflow as tf
import numpy as np
import math
def normalize(inputs, epsilon = 1e-8, scope = "ln", reuse = None):
'''
Implement layer normalization
Args:
inputs: [Tensor], A tensor with two or more dimensions, where the first one is "batch_size"
epsilon: [Float], A small number for preventing ZeroDivision Error
scope: [String], Optional scope for "variable_scope"
reuse: [Boolean], If to reuse the weights of a previous layer by the same name
Returns:
A tensor with the same shape and data type as "inputs"
'''
with tf.variable_scope(scope, reuse = reuse):
inputs_shape = inputs.get_shape() # a.get_shape().as_list() --->a维度是(2,3),那么这个返回就是 [2, 3]
params_shape = inputs_shape[-1 :] # params_shape就是最后的一个维度了
# tf.nn.moments 计算返回的 mean 和 variance 作为 tf.nn.batch_normalization 参数调用。
mean, variance = tf.nn.moments(inputs, [-1], keep_dims = True)
beta = tf.Variable(tf.zeros(params_shape))
gamma = tf.Variable(tf.ones(params_shape))
normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
outputs = gamma * normalized + beta
return outputs
不论是self-attention还是feed-forward neural network,都会做一个正规化的操作。这里参数epsilon就是为了防止正规化的时候,分母为0,然后scope为命名空间。看代码就知道,这个和batch normalization的公式是一样的。目的肯定是使得训练过程,反向传播的时候的震荡幅度更小。好多复杂的网络在输出之后都会加上normalization。
def positional_encoding(inputs,
vocab_size,
num_units,
zero_pad = True,
scale = True,
scope = "positional_embedding",
reuse = None):
'''
Positional_Encoding for a given tensor.
Args:
inputs: [Tensor], A tensor contains the ids to be search from the lookup table, shape = [batch_size, 1 + len(inpt)]
vocab_size: [Int], Vocabulary size
num_units: [Int], Hidden size of embedding
zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
scope: [String], Optional scope for 'variable_scope'
reuse: [Boolean], If to reuse the weights of a previous layer by the same name
Returns:
A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
'''
"""
inputs (batch_size, 1+len(inputs)) 那么N就是batch_size, 然后T就是maxlen,大小为10
num_units 就是隐层单元的个数,维度的大小
"""
N, T = inputs.get_shape().as_list()
with tf.variable_scope(scope, reuse = reuse):
position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])
# First part of the PE function: sin and cos argument
position_enc = np.array([
[pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
for pos in range(T)])
# Second part, apply the cosine to even columns and sin to odds.
position_enc[:, 0::2] = np.sin(position_enc[:, 0::2]) # dim 2i
position_enc[:, 1::2] = np.cos(position_enc[:, 1::2]) # dim 2i+1
# Convert to a tensor
lookup_table = tf.convert_to_tensor(position_enc)
if zero_pad:
lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
lookup_table[1:, :]), 0)
outputs = tf.nn.embedding_lookup(lookup_table, position_ind)
if scale:
outputs = outputs * num_units**0.5
return tf.cast(outputs, tf.float32)
这一部分的代码就是架构图里面的positional embedding的实现了,由于没有采用序列模型,那么这里讲位置进行了嵌入,就捕捉到了位置上面的信息。
def embedding(inputs,
vocab_size,
num_units,
zero_pad = True,
scale = True,
scope = "embedding",
reuse = None):
'''
Embed a given tensor.
Args:
inputs: [Tensor], A tensor contains the ids to be search from the lookup table
vocab_size: [Int], Vocabulary size
num_units: [Int], Hidden size of embedding
zero_pad: [Boolean], If True, all the values of the first row(id = 0) should be constant zero
scale: [Boolean], If True, the output will be multiplied by sqrt num_units(check details from paper)
scope: [String], Optional scope for 'variable_scope'
reuse: [Boolean], If to reuse the weights of a previous layer by the same name
Returns:
A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
'''
"""
inputs传进来就(batch_size, 10)
lookup_table维度(vocab_size, 512),进行了随机的初始化
"""
# shape = [vocabsize, 8]
with tf.variable_scope(scope, reuse = reuse):
lookup_table = tf.get_variable('lookup_table',
dtype = tf.float32,
shape = [vocab_size, num_units],
initializer = tf.contrib.layers.xavier_initializer())
if zero_pad:
''' tf.zeros 维度(1, 512)
lookup_table[1:, :]的目的是抛开了<PAD>这玩意儿,赋值为0,然后进行了合并
现在look _table维度还是(vocab_size, 512 )
'''
lookup_table = tf.concat((tf.zeros(shape = [1, num_units]), lookup_table[1:, :]), 0)
# outputs 维度就是 (batch_size, 10, 512) ==[N ,T, S]
outputs = tf.nn.embedding_lookup(lookup_table, inputs)
if scale:
# embedding 那一步
outputs = outputs * math.sqrt(num_units)
return outputs
输入维度(batch_size, maxlen)==[N, T]
输出维度(batch_size, maxlen, S)==[N, T, S]
这里面注意初始化lookup_table时把id=0的那一行(第一行)初始化为全0的结果。scale为True,paper中在embedding里面说明为什么这里需要做一个scale的操作。
下面是multi-head attention,为该代码的核心部分。注释里面写清楚了维度的一个变化情况。
最后输出维度[N, T_q, S]。
def multihead_attention(queries,
keys,
num_units = None,
num_heads = 8,
dropout_rate = 0,
is_training = True,
causality = False,
scope = "multihead_attention",
reuse = None):
'''
Implement multihead attention
Args:
queries: [Tensor], A 3-dimensions tensor with shape of [N, T_q, S_q]
keys: [Tensor], A 3-dimensions tensor with shape of [N, T_k, S_k]
num_units: [Int], Attention size
num_heads: [Int], Number of heads
dropout_rate: [Float], A ratio of dropout
is_training: [Boolean], If true, controller of mechanism for dropout
causality: [Boolean], If true, units that reference the future are masked
scope: [String], Optional scope for "variable_scope"
reuse: [Boolean], If to reuse the weights of a previous layer by the same name
Returns:
A 3-dimensions tensor with shape of [N, T_q, S]
'''
""" queries = self.enc (batch_size, 10 ,512)==[N, T_q, S] keys也是self.enc
num_units =512, num_heads =10
"""
with tf.variable_scope(scope, reuse = reuse):
if num_units is None:
# length of sentence
num_units = queries.get_shape().as_list()[-1]
""" Linear layers in Figure 2(right) 就是Q、K、V进入scaled Dot-product Attention前的Linear的操作
# 首先是进行了全连接的线性变换
shape = [N, T_q, S] (batch_size, 10 ,512), S可以理解为512"""
Q = tf.layers.dense(queries, num_units, activation = tf.nn.relu)
# shape = [N, T_k, S]
K = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
# shape = [N, T_k, S]
V = tf.layers.dense(keys, num_units, activation = tf.nn.relu)
'''
Q_、K_、V_就是权重WQ、WK、WV。
shape (batch_size*8, 10, 512/8=64)
'''
# Split and concat
# shape = [N*h, T_q, S/h]
Q_ = tf.concat(tf.split(Q, num_heads, axis = 2), axis = 0)
# shape = [N*h, T_k, S/h]
K_ = tf.concat(tf.split(K, num_heads, axis = 2), axis = 0)
# shape = [N*h, T_k, S/h]
V_ = tf.concat(tf.split(V, num_heads, axis = 2), axis = 0)
# [N, T_q, S] * [N*h, T_k, S/h] 这一步的张量乘法是怎么做的?
# shape = [N*h, T_q, T_k] Q
outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))
# Scale
outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
# Masking
# shape = [N, T_k]
# 这里的tf.reduce_sum进行了降维,由三维降低到了2维度,然后是取绝对值,转成0-1之间的值
'''[N, T_k, 512]------> [N, T_k] -----》[N*h, T_k] -----》[N*h, T_q, T_k] '''
key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis = -1)))
# shape = [N*h, T_k]
key_masks = tf.tile(key_masks, [num_heads, 1])
# shape = [N*h, T_q, T_k] tf.expand_dims就是扩维度
key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])
# If key_masks == 0 outputs = [1]*length(outputs)
paddings = tf.ones_like(outputs) * (-math.pow(2, 32) + 1)
# shape = [N*h, T_q, T_k]
outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)
if causality: #如果为true的话,那么就是将这个东西未来的units给屏蔽了
# reduce dims : shape = [T_q, T_k]
diag_vals = tf.ones_like(outputs[0, :, :])
# shape = [T_q, T_k]
# use triangular matrix to ignore the affect from future words
# like : [[1,0,0]
# [1,2,0]
# [1,2,3]]
tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()
# shape = [N*h, T_q, T_k]
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])
paddings = tf.ones_like(masks) * (-math.pow(2, 32) + 1)
# shape = [N*h, T_q, T_k]
outputs = tf.where(tf.equal(masks, 0), paddings, outputs)
# Output Activation
outputs = tf.nn.softmax(outputs)
# Query Masking
# shape = [N, T_q]
query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis = -1)))
# shape = [N*h, T_q]
query_masks = tf.tile(query_masks, [num_heads, 1])
# shape = [N*h, T_q, T_k]
query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]])
outputs *= query_masks
# Dropouts
outputs = tf.layers.dropout(outputs, rate = dropout_rate, training = tf.convert_to_tensor(is_training))
# Weighted sum
# shape = [N*h, T_q, S/h]
outputs = tf.matmul(outputs, V_)
# Restore shape
# shape = [N, T_q, S]
outputs = tf.concat(tf.split(outputs, num_heads, axis = 0), axis = 2)
# Residual connection
outputs += queries
# Normalize
# shape = [N, T_q, S]
outputs = normalize(outputs)
return outputs
两层卷积之间加了relu非线性操作。之后是residual操作加上inputs残差,然后是normalize。最后输出的维度还是[N, T_q, S]。
def feedforward(inputs,
num_units = [2048, 512],
scope = "multihead_attention",
reuse = None):
'''
Position-wise feed forward neural network
Args:
inputs: [Tensor], A 3d tensor with shape [N, T, S]
num_units: [Int], A list of convolution parameters
scope: [String], Optional scope for "variable_scope"
reuse: [Boolean], If to reuse the weights of a previous layer by the same name
Return:
A tensor converted by feedforward layers from inputs
'''
with tf.variable_scope(scope, reuse = reuse):
# params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1, \
# "activation": tf.nn.relu, "use_bias": True}
# outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[0], kernel_size = 1, activation = tf.nn.relu, use_bias = True)
# outputs = tf.layers.conv1d(**params)
params = {"inputs": inputs, "num_outputs": num_units[0], \
"activation_fn": tf.nn.relu}
outputs = tf.contrib.layers.fully_connected(**params)
# params = {"inputs": inputs, "filters": num_units[1], "kernel_size": 1, \
# "activation": None, "use_bias": True}
params = {"inputs": inputs, "num_outputs": num_units[1], \
"activation_fn": None}
# outputs = tf.layers.conv1d(inputs = inputs, filters = num_units[1], kernel_size = 1, activation = None, use_bias = True)
# outputs = tf.layers.conv1d(**params)
outputs = tf.contrib.layers.fully_connected(**params)
# residual connection
outputs += inputs
outputs = normalize(outputs)
return outputs
最后是进行了一个平滑的操作,就是one_hot中的0改成了一个很小的数,1改成了一个比较接近于1的数。
def label_smoothing(inputs, epsilon = 0.1):
'''
Implement label smoothing
Args:
inputs: [Tensor], A 3d tensor with shape of [N, T, V]
epsilon: [Float], Smoothing rate
Return:
A tensor after smoothing
'''
''' inputs的维度应该是(batch_size, sentense_length, vector dimension)
N就是batch_size, T就是句子的长度,V就是向量的维度大小
'''
K = inputs.get_shape().as_list()[-1]
return ((1 - epsilon) * inputs) + (epsilon / K)
4、训练
- train.py
这里面的self.decoder_input采取了一个操作就是将每个句子加了一个初始化为2的id,然后除去了最后的一个句子结束符。然后它的维度还是[N ,T]
# -*- coding: utf-8 -*-
from __future__ import print_function
import tensorflow as tf
from params import Params as pm
from data_loader import get_batch_data, load_vocab
from modules import *
from tqdm import tqdm
import os
class Graph():
# 直接就是一个init初始化一下
def __init__(self, is_training = True):
self.graph = tf.Graph()
with self.graph.as_default():
if is_training:
self.inpt, self.outpt, self.batch_num = get_batch_data()
else:
'''inpt(None, maxlen) outpt(None, maxlen) maxlen=10'''
self.inpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))
self.outpt = tf.placeholder(tf.int32, shape = (None, pm.maxlen))
# start with 2(<STR>) and without 3(<EOS>)
self.decoder_input = tf.concat((tf.ones_like(self.outpt[:, :1])*2, self.outpt[:, :-1]), -1)
# 直接就拿到了en和de的Vocabulary,en的大小是1644,de的大小是1588
en2idx, idx2en = load_vocab('en.vocab.tsv')
de2idx, idx2de = load_vocab('de.vocab.tsv')
# Encoder
with tf.variable_scope("encoder"):
''' self.inpt维度是(batch_size, maxlen)
self.enc 维度是(batch_size, maxlen, 512)
'''
self.enc = embedding(self.inpt,
vocab_size = len(en2idx),
num_units = pm.hidden_units,
scale = True,
scope = "enc_embed")
# Position Encoding(use range from 0 to len(inpt) to represent position dim of each words)
# tf.tile(tf.expand_dims(tf.range(tf.shape(self.inpt)[1]), 0), [tf.shape(self.inpt)[0], 1]),
self.enc += positional_encoding(self.inpt,
vocab_size = pm.maxlen,
num_units = pm.hidden_units,
zero_pad = False,
scale = False,
scope = "enc_pe")
# Dropout
self.enc = tf.layers.dropout(self.enc,
rate = pm.dropout,
training = tf.convert_to_tensor(is_training))
# Identical
for i in range(pm.num_identical):
with tf.variable_scope("num_identical_{}".format(i)):
# Multi-head Attention
self.enc = multihead_attention(queries = self.enc,
keys = self.enc,
num_units = pm.hidden_units,
num_heads = pm.num_heads,
dropout_rate = pm.dropout,
is_training = is_training,
causality = False)
self.enc = feedforward(self.enc, num_units = [4 * pm.hidden_units, pm.hidden_units])
下面就是decoder部分的代码。这里可以参考前面decoder的结构,里面多出了一个attention部分,该部分接受到了encoder输出的张量和decoder中self-attention里面输入的张量,然后再进行了vanilla attention。
最终decoder部分输出张量的维度是[N ,T, 512]
# Decoder
with tf.variable_scope("decoder"):
self.dec = embedding(self.decoder_input,
vocab_size = len(de2idx),
num_units = pm.hidden_units,
scale = True,
scope = "dec_embed")
# Position Encoding(use range from 0 to len(inpt) to represent position dim)
self.dec += positional_encoding(self.decoder_input,
vocab_size = pm.maxlen,
num_units = pm.hidden_units,
zero_pad = False,
scale = False,
scope = "dec_pe")
# Dropout
self.dec = tf.layers.dropout(self.dec,
rate = pm.dropout,
training = tf.convert_to_tensor(is_training))
# Identical
for i in range(pm.num_identical):
with tf.variable_scope("num_identical_{}".format(i)):
# Multi-head Attention(self-attention)
self.dec = multihead_attention(queries = self.dec,
keys = self.dec,
num_units = pm.hidden_units,
num_heads = pm.num_heads,
dropout_rate = pm.dropout,
is_training = is_training,
causality = True,
scope = "self_attention")
# Multi-head Attention(vanilla-attention)
self.dec = multihead_attention(queries=self.dec,
keys=self.enc,
num_units=pm.hidden_units,
num_heads=pm.num_heads,
dropout_rate=pm.dropout,
is_training=is_training,
causality=False,
scope="vanilla_attention")
self.dec = feedforward(self.dec, num_units = [4 * pm.hidden_units, pm.hidden_units])
现在已经走到了decoder部分输出了:
self.logits:进行了Linear变化,维度是[N, T, len(de2idx)]
self.preds:取了self.logits里面最后一个维度里面最大值的下标,维度是[n ,T]
self.istarget:将self.preds中所有id不为0的位置的值用1.0代替,维度是[n ,T]
self.acc: 对比self.preds, self.outpt,对应位置相等那么就是1.0,否则就是0。
# Linear
self.logits = tf.layers.dense(self.dec, len(de2idx))
self.preds = tf.to_int32(tf.arg_max(self.logits, dimension = -1))
self.istarget = tf.to_float(tf.not_equal(self.outpt, 0))
self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.outpt)) * self.istarget) / (tf.reduce_sum(self.istarget))
tf.summary.scalar('acc', self.acc)
is_training 为True的时候,也就是训练的时候,就需要进行下面的操作了。
loss的维度是[N, T]
if is_training:
# smooth inputs
self.y_smoothed = label_smoothing(tf.one_hot(self.outpt, depth = len(de2idx)))
# loss function
self.loss = tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = self.y_smoothed)
self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))
self.global_step = tf.Variable(0, name = 'global_step', trainable = False)
# optimizer
self.optimizer = tf.train.AdamOptimizer(learning_rate = pm.learning_rate, beta1 = 0.9, beta2 = 0.98, epsilon = 1e-8)
self.train_op = self.optimizer.minimize(self.mean_loss, global_step = self.global_step)
tf.summary.scalar('mean_loss', self.mean_loss)
self.merged = tf.summary.merge_all()
if __name__ == '__main__':
'''en2idx{'<PAD>':0, ...}, idx2en{0:'<PAD>'}都是字典形式 长度是1684'''
'''de2idx{'<PAD>':0, ...}, idx2de{0:'<PAD>'}都是字典形式 长度是1597'''
en2idx, idx2en = load_vocab('en.vocab.tsv')
de2idx, idx2de = load_vocab('de.vocab.tsv')
g = Graph("train")
print("MSG : Graph loaded!")
# save model and use this model to training
supvisor = tf.train.Supervisor(graph = g.graph,logdir = pm.logdir,save_model_secs = 0)
with supvisor.managed_session() as sess:
for epoch in range(1, pm.num_epochs + 1):
if supvisor.should_stop():
break
# process bar
for step in tqdm(range(g.batch_num), total = g.batch_num, ncols = 70, leave = False, unit = 'b'):
sess.run(g.train_op)
if not os.path.exists(pm.checkpoint):
os.mkdir(pm.checkpoint)
g_step = sess.run(g.global_step)
supvisor.saver.save(sess, pm.checkpoint + '/model_epoch_%02d_gs_%d' % (epoch, g_step))
print("MSG : Done!")
三、留下一些问题思考
1、Transformer的创新之处在于哪里?和传统的encoder-decoder模型的区别在哪里?它想实现的目标是什么?
2、为什么要采用self-attention以及multi-head attention?
3、如何理解transformer里面所采用的mask?
最后,有些理解不到位的部分,请指教!
参考资料:
1、Attention is all you need论文
2、attention is all you need模型笔记
3、The illustrated transformer
4、The Annotated Transformer