论文阅读机器学习100天Machine Learning & Recommendation & NLP & DL

Pytorch学习记录- 训练Attention机制的Seq2S

2019-06-06  本文已影响5人  我的昵称违规了

对Pytorch的Seq2Seq这6篇论文进行精读,第三篇,Bahdanau, D., K. Cho and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate. 2014.
发表于2014年,全文链接
使用共同学习进行对齐和翻译实现NMT

摘要

NMT和SMT的区别在于,NMT是用来构建一个独立神经网络,这个网络可以联合调整以最大化翻译性能。
这篇论文推测定长矢量结构是造成encoder-decoder发展的瓶颈,建议通过允许模型自动(软)搜索与预测目标词相关的源句子的部分来扩展encoder-decoder结构。

1. 介绍

The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence(最大化给定源句的正确翻译概率).

2. 关于NMT的背景

2.1 RNN encoder-decoder

3. 学习对齐和翻译

新的模型使用Bi RNN作为encoder和decoder。

3.1 decoder

论文先介绍decoder。
定义每一个条件概率
p(y_i|y_1,...,y_{i-1},X)=g(y_{i-1},s_i,c_i)

基于上面的方程,推导其中权重部分。


这样我们就能得到这个公式,虽然在实现的时候还是会用到上面的一步一步进行推导
c_i=\Sigma_{j=1}^{T_x} \frac{exp(a(s_{i-1},h_j)}{\Sigma_{k=1}^{T_x}exp(a(s_{i-1},h_k))}h_j
\alpha_{ij}和它相关的e_{ij}反映了注释h_j相对于先前隐藏状态s_{i-1}决定下一状态s_i和产生y_i时的重要性。简单来说,这实现了解码器中的注意力机制。解码器决定要关注的源句子的部分。通过让解码器具有注意机制,我们使编码器免于必须将源句中的所有信息编码成固定长度矢量的负担。利用这种方法,信息可以在整个注释序列中传播,这可以由解码器相应地选择性地检索。

这部分在上次学的时候没有看懂,现在好像稍微懂一些了。

3.2 encoder,针对注释序列的BiRNN

在这里使用的是BIRNN,包括了两个RNN(前向和后向),前向RNN从左到右处理数据句子,后向RNN从右到左输入句子

image.png

4. 实验设置

4.1 数据集

这里主要注意一下数据集,使用的是ACL的WMT'14数据集,包含英法并行语料库:Europarl(61M字),新闻评论(5.5M),UN(421M)和两个分别为90M和272.5M字的爬虫语料库,共计850M字。使用Axelrod等人的数据选择方法将组合语料库的大小减小到348M字。不使用除了上述平行语料库之外的任何单语数据,尽管可能使用更大的单语语料库来预编码编码器。

4.2 模型

训练了两个模型进行对比,一般RNN encoder-decoder和attention encoder-decoder,每个模型训练两次,分别使用30字和50字句子进行训练,然后使用20字句子进行验证,最后发现,确实有较大提升。

5. 模型实现

在这里,通过构建四个模块来实现这个模型:encoder,attention,decoder,seq2seq,使用数据集为Multi30k

5.1 引入相关库并进行数据预处理

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator
import spacy
import random
import math
import time
SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

spacy_de=spacy.load('de')
spacy_en=spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC=Field(tokenize=tokenize_de,init_token='<sos>',eos_token='<eos>',lower=True)
TRG=Field(tokenize=tokenize_en,init_token='<sos>',eos_token='<eos>',lower=True)
train_data,valid_data,test_data=Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))
print(vars(train_data.examples[11]))
{'src': ['vier', 'typen', ',', 'von', 'denen', 'drei', 'hüte', 'tragen', 'und', 'einer', 'nicht', ',', 'springen', 'oben', 'in', 'einem', 'treppenhaus', '.'], 'trg': ['four', 'guys', 'three', 'wearing', 'hats', 'one', 'not', 'are', 'jumping', 'at', 'the', 'top', 'of', 'a', 'staircase', '.']}
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE=128
train_iterator, valid_iterator, test_iterator=BucketIterator.splits(
    (train_data,valid_data,test_data),
    batch_size=BATCH_SIZE,
    device=device
)

5.2 构建模型

5.2.1 参数设定

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

5.2.2 Seq2Seq

在这个模型中,encoder和decoder的隐藏维度不同,因为encoder是双向的。
Seq2Seq和之前两个比较类似,唯一区别是在encoder返回的数据,包括了最终隐藏状态和每一个隐藏状态,都需要传入decoder。

for i ,batch in enumerate(train_iterator):
    if i <1:
        print(i)
        src=batch.src
        trg=batch.trg
        print(type(src))
        print(src.shape)
        print(src)
        print(src.shape[0])
        print(src.shape[1])
        max_len=trg.shape[0]
        batch_size=trg.shape[1]
        trg_vocab_size=len(TRG.vocab)
        print(max_len)
        print(batch_size)
        print(trg_vocab_size)
        
        outputs=torch.zeros(max_len,batch_size,trg_vocab_size)
        print(outputs.shape)
#         print(outputs)
    else: break
0
<class 'torch.Tensor'>
torch.Size([37, 128])
tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [   5,    5,   18,  ...,    5,    5,   18],
        [  26,   13,   45,  ...,   66, 2305,  121],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0')
37
128
36
128
5893
torch.Size([36, 128, 5893])
class Seq2Seq(nn.Module):
    def __init__(self, encoder,decoder,device):
        super(Seq2Seq,self).__init__()
        self.encoder=encoder
        self.decoder=decoder
        self.device=device
    
    def forward(self, src,trg,teacher_forcing_ratio=0.5):
        batch_size=src.shape[1]
        max_len=trg.shape[0]
        trg_vocab_size=self.decoder.output_dim
        
        outputs=torch.zeros(max_len,batch_size,trg_vocab_size).to(self.device)
        # torch.Size([21, 128, 5893])
        encoder_outputs,hidden=self.encoder(src)
        
        output=trg[0,:]
        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

5.2.3 Encoder

Encoder使用单层GRU,在这里使用bidirectional RNN。通过bidirectional RNN,每层可以有两个RNN网络。

在这里,会获取两个上下文向量,一个来自前向RNN,在它看到句子中的最后一个单词后,z ^ \rightarrow = h_T ^ \rightarrow;一个来自后向RNN后看到第一个单词在句子中,z ^ \leftarrow = h_T ^ \leftarrow
RNN返回两个输出outputs和hidden。

由于Decoder不是双向的,它只需要一个上下文向量z作为其初始隐藏状态s_0,我们目前有两个,前向和后向(z ^ \rightarrow = h_T ^ \rightarrowz ^ \leftarrow = h_T ^ \leftarrow)。我们通过将两个上下文向量连接在一起,通过线性层g并应用\tanh激活函数来解决这个问题。公式如下:
z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0

在这里与原论文有区别,原论文中将后向RNN的隐藏状态通过线性层获取了上下文向量和decoder的最初隐藏状态,在这里做了修改,使用“通过线性层g\tanh激活函数”来解决。
由于我们希望我们的模型回顾整个源句,我们返回输出,源句中每个标记的堆叠前向和后向隐藏状态。我们还返回hidden,它在解码器中充当我们的初始隐藏状态。

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super(Encoder, self).__init__()
        self.input_dim=input_dim
        self.emb_dim=emb_dim
        self.enc_hid_dim=enc_hid_dim
        self.dec_hid_dim=dec_hid_dim
        self.dropout=dropout
        
        self.embedding=nn.Embedding(input_dim,emb_dim)
        self.rnn=nn.GRU(emb_dim,enc_hid_dim,bidirectional=True)
        self.fc=nn.Linear(enc_hid_dim*2,dec_hid_dim)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,src):
        #src = [src sent len, batch size]
        embedded=self.dropout(self.embedding(src))
        #embedded = [src sent len, batch size, emb dim]
        outputs,hidden=self.rnn(embedded)
        #outputs = [src sent len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        #[-2,:,:]在最后的时间步之后(即在看到最后一个单词之后)给出顶层前向RNN隐藏状态在句子。和[-1,:,:]在最后的时间步之后(即在看到句子中的第一个单词之后)给出顶层后向RNN隐藏状态
        hidden=torch.tanh(self.fc(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1)))
        #outputs = [src sent len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        return outputs, hidden

5.2.4 Attention

attention其实就是对上一个时间步decoder输出的hidden状态、encoder输出的前向后向hidden状态进行处理。输出的是一个attention向量\alpha_t,长度为源句长度,每一个元素都在0和1之间,总和为1。
\alpha_t表示的是源句中最应该受到注意的词,这能够帮助decoder预测下一个词\hat y_{t+1}
操作步骤和之前的论文阅读中类似

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super(Attention,self).__init__()
        self.enc_hid_dim=enc_hid_dim
        self.dec_hid_dim=dec_hid_dim
        self.attn=nn.Linear((enc_hid_dim*2)+dec_hid_dim,dec_hid_dim)
        self.v=nn.Parameter(torch.rand(dec_hid_dim))
        
    def forward(self, hidden, encoder_outputs):
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        batch_size=encoder_outputs.shape[1]
        src_len=encoder_outputs.shape[0]
        #重复操作,让隐藏状态的第二个维度和encoder相同
        hidden=hidden.unsqueeze(1).repeat(1,src_len,1)
        #该函数按指定的向量来重新排列一个数组,在这里是调整encoder输出的维度顺序,在后面能够进行比较
        encoder_outputs=encoder_outputs.permute(1,0,2)
        #hidden = [batch size, src sent len, dec hid dim]
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        #开始计算hidden和encoder_outputs之间的匹配值
        energy=torch.tanh(self.attn(torch.cat((hidden,encoder_outputs),dim=2)))
        #energy = [batch size, src sent len, dec hid dim]
        #调整energy的排序
        energy=energy.permute(0,2,1)
        #energy = [batch size, dec hid dim, src sent len]
        
        #v = [dec hid dim]
        v=self.v.repeat(batch_size,1).unsqueeze(1)
        #v = [batch_size, 1, dec hid dim] 注意这个bmm的作用,对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操
        attention=torch.bmm(v,energy).squeeze(1)
        #attention=[batch_size, src_len]
        return F.softmax(attention, dim=1)

5.2.5 decoder

Decoder包括了注意力层,含有上一个隐藏状态s_{t-1},所有Encoder的隐藏状态H,返回注意力向量a_t
接下来使用注意力向量创建加权源向量功能w_t,含有Encoder隐藏状态的加权和H,并使用注意力向量a_t作为权重。公式如下
w_t = a_t H
输入字(已嵌入)y_t,加权源向量w_t和先前的Decoder隐藏状态s_ {t-1},全部传递到Decoder。
s_t = \text{DecoderGRU}(y_t, w_t, s_{t-1})
最后使用线性层处理y_t,做出预测
\hat{y}_{t+1} = f(y_t, w_t, s_t)

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super(Decoder,self).__init__()
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention
        
        self.embedding=nn.Embedding(output_dim,emb_dim)
        self.rnn=nn.GRU((enc_hid_dim*2)+emb_dim,dec_hid_dim)
        self.out=nn.Linear((enc_hid_dim*2)+dec_hid_dim+emb_dim,output_dim)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        #先搞input,展开
        input=input.unsqueeze(0)
        #input = [1,batch size]
        
        embedded=self.dropout(self.embedding(input))
        #embedded = [1, batch size, emb dim]
        
        a=self.attention(hidden, encoder_outputs)
        #a = [batch size, src len]
        a=a.unsqueeze(1)
        #a = [batch size, 1, src len]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        #在获取了权重和encoder隐藏状态之后,开始完成第一个公式,创建加权向量wt,使用bmm进行乘
        weighted=torch.bmm(a, encoder_outputs)
        #weighted = [batch size, 1, enc hid dim * 2]
        weighted=weighted.permute(1,0,2)
        
        rnn_input=torch.cat((embedded,weighted),dim=2)
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
        
        output,hidden=self.rnn(rnn_input,hidden.unsqueeze(0))
        
        
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        output = self.out(torch.cat((output, weighted, embedded), dim = 1))
        
        #output = [bsz, output dim]
        
        return output, hidden.squeeze(0)
attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5)
  )
)
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 20,518,917 trainable parameters
optimizer = optim.Adam(model.parameters())
PAD_IDX = TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

晚点补上colab的训练结果。

上一篇 下一篇

猜你喜欢

热点阅读