Pytorch学习记录- 训练Attention机制的Seq2S

2019-06-06 本文已影响5人我的昵称违规了

对Pytorch的Seq2Seq这6篇论文进行精读，第三篇，Bahdanau, D., K. Cho and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate. 2014.
发表于2014年，全文链接
使用共同学习进行对齐和翻译实现NMT

摘要

NMT和SMT的区别在于，NMT是用来构建一个独立神经网络，这个网络可以联合调整以最大化翻译性能。
这篇论文推测定长矢量结构是造成encoder-decoder发展的瓶颈，建议通过允许模型自动（软）搜索与预测目标词相关的源句子的部分来扩展encoder-decoder结构。

1. 介绍

NMT的目标：neural machine translation attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation
NMT工作原理：An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector.

The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence（最大化给定源句的正确翻译概率）.

encoder-decoder的问题：信息压缩，将输入的源句压缩为一个定长向量，这个会造成对长句处理的问题，尤其是在处理比训练语料库更长的句子的时候。
解决方案：构建一个扩展encoder-decoder模型，能够学会共同调整和翻译。模型在每个时间步都会生成一个翻译的词，模型会在源句中搜索一个位置集，位置集集中了最相关的信息。模型接着基于上下文向量预测目标语的词，该上下文向量与源句位置和之前生成的所有目标词相关。
与之前一般encoder-decoder最大的区别：不再试图将源句整个进行编码放入一个定长向量，而是转为一个句子向量，并在解码翻译时自适应地选择这些向量的子集。这样就解决了之前所说的信息压缩的问题。

2. 关于NMT的背景

2.1 RNN encoder-decoder

3. 学习对齐和翻译

新的模型使用Bi RNN作为encoder和decoder。

3.1 decoder

论文先介绍decoder。
定义每一个条件概率
$p(y_i|y_1,...,y_{i-1},X)=g(y_{i-1},s_i,c_i)$

$y_{i-1}$ 是上一个时间步预测的目标词，
$s_i$ 是当前时间步RNN的隐藏状态，
$s_i=f(s_{i-1},y{i-1},c_i)$
$c_i$ 是上下文向量，上下文向量取决于encoder映射的输入源句的注释序列 $(h_1,...,h_{T_x})$ ， $c_i$ 是所有注释 $h_j$ 的加权求和。
$c_i=\Sigma_{j=1}^{T_x} \alpha_{ij}h_j$

基于上面的方程，推导其中权重部分。

权重值 $\alpha_{ij}$ 其实就是求softmax。
$\alpha_{ij}=\frac{exp(e_{ij})}{\Sigma_{k=1}^{T_x}exp(e_{ik})}$
$e_{ij}$ 是一个对齐模型，是围绕位置j输入和围绕位置i输出两个数值的匹配得分。得分基于RNN的隐藏层 $s_{i-1}$ 和输入句子第j个注释 $h_j$ 得到。
$e_{ij}=a(s_{i-1},h_j)$

这样我们就能得到这个公式，虽然在实现的时候还是会用到上面的一步一步进行推导
$c_i=\Sigma_{j=1}^{T_x} \frac{exp(a(s_{i-1},h_j)}{\Sigma_{k=1}^{T_x}exp(a(s_{i-1},h_k))}h_j$
$\alpha_{ij}$ 和它相关的 $e_{ij}$ 反映了注释 $h_j$ 相对于先前隐藏状态 $s_{i-1}$ 决定下一状态 $s_i$ 和产生 $y_i$ 时的重要性。简单来说，这实现了解码器中的注意力机制。解码器决定要关注的源句子的部分。通过让解码器具有注意机制，我们使编码器免于必须将源句中的所有信息编码成固定长度矢量的负担。利用这种方法，信息可以在整个注释序列中传播，这可以由解码器相应地选择性地检索。

这部分在上次学的时候没有看懂，现在好像稍微懂一些了。

3.2 encoder，针对注释序列的BiRNN

在这里使用的是BIRNN，包括了两个RNN（前向和后向），前向RNN从左到右处理数据句子，后向RNN从右到左输入句子

image.png

4. 实验设置

4.1 数据集

这里主要注意一下数据集，使用的是ACL的WMT'14数据集，包含英法并行语料库：Europarl（61M字），新闻评论（5.5M），UN（421M）和两个分别为90M和272.5M字的爬虫语料库，共计850M字。使用Axelrod等人的数据选择方法将组合语料库的大小减小到348M字。不使用除了上述平行语料库之外的任何单语数据，尽管可能使用更大的单语语料库来预编码编码器。

4.2 模型

训练了两个模型进行对比，一般RNN encoder-decoder和attention encoder-decoder，每个模型训练两次，分别使用30字和50字句子进行训练，然后使用20字句子进行验证，最后发现，确实有较大提升。

5. 模型实现

在这里，通过构建四个模块来实现这个模型：encoder，attention，decoder，seq2seq，使用数据集为Multi30k

5.1 引入相关库并进行数据预处理

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator
import spacy
import random
import math
import time
SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

spacy_de=spacy.load('de')
spacy_en=spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC=Field(tokenize=tokenize_de,init_token='<sos>',eos_token='<eos>',lower=True)
TRG=Field(tokenize=tokenize_en,init_token='<sos>',eos_token='<eos>',lower=True)

train_data,valid_data,test_data=Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))
print(vars(train_data.examples[11]))

{'src': ['vier', 'typen', ',', 'von', 'denen', 'drei', 'hüte', 'tragen', 'und', 'einer', 'nicht', ',', 'springen', 'oben', 'in', 'einem', 'treppenhaus', '.'], 'trg': ['four', 'guys', 'three', 'wearing', 'hats', 'one', 'not', 'are', 'jumping', 'at', 'the', 'top', 'of', 'a', 'staircase', '.']}

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE=128
train_iterator, valid_iterator, test_iterator=BucketIterator.splits(
    (train_data,valid_data,test_data),
    batch_size=BATCH_SIZE,
    device=device
)

5.2 构建模型

5.2.1 参数设定

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

5.2.2 Seq2Seq

在这个模型中，encoder和decoder的隐藏维度不同，因为encoder是双向的。
Seq2Seq和之前两个比较类似，唯一区别是在encoder返回的数据，包括了最终隐藏状态和每一个隐藏状态，都需要传入decoder。

创建输出张量以保存所有预测， $\ hat {Y}$
源序列 $X$ 被送入encoder以接收 $z$ 和 $H$
初始decoder隐藏状态被设置为上下文向量， $s_0 = z = h_T$
使用 $<sos>$ 标记作为第一个输入， $y_1$
在一个循环中解码：
- 插入输入标记 $y_t$ ，之前的隐藏状态， $s_ {t- 1}$ ，和所有编码器输出， $H$ ，
- 进入解码器接收预测， $\ hat {y} _ {t + 1}$ ，以及一个新的隐藏状态， $s_t$
- 决定是否要去教师强迫与否，适当设置下一个输入

for i ,batch in enumerate(train_iterator):
    if i <1:
        print(i)
        src=batch.src
        trg=batch.trg
        print(type(src))
        print(src.shape)
        print(src)
        print(src.shape[0])
        print(src.shape[1])
        max_len=trg.shape[0]
        batch_size=trg.shape[1]
        trg_vocab_size=len(TRG.vocab)
        print(max_len)
        print(batch_size)
        print(trg_vocab_size)
        
        outputs=torch.zeros(max_len,batch_size,trg_vocab_size)
        print(outputs.shape)
#         print(outputs)
    else: break

0
<class 'torch.Tensor'>
torch.Size([37, 128])
tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [   5,    5,   18,  ...,    5,    5,   18],
        [  26,   13,   45,  ...,   66, 2305,  121],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0')
37
128
36
128
5893
torch.Size([36, 128, 5893])

class Seq2Seq(nn.Module):
    def __init__(self, encoder,decoder,device):
        super(Seq2Seq,self).__init__()
        self.encoder=encoder
        self.decoder=decoder
        self.device=device
    
    def forward(self, src,trg,teacher_forcing_ratio=0.5):
        batch_size=src.shape[1]
        max_len=trg.shape[0]
        trg_vocab_size=self.decoder.output_dim
        
        outputs=torch.zeros(max_len,batch_size,trg_vocab_size).to(self.device)
        # torch.Size([21, 128, 5893])
        encoder_outputs,hidden=self.encoder(src)
        
        output=trg[0,:]
        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

5.2.3 Encoder

Encoder使用单层GRU，在这里使用bidirectional RNN。通过bidirectional RNN，每层可以有两个RNN网络。

前向RNN从左到右处理句子（图中绿色）
后向RNN从右到左处理句子（图中黄色）
在这里要做的就是设置 bidirectional = True ，然后输入嵌入好的句子。

在这里，会获取两个上下文向量，一个来自前向RNN，在它看到句子中的最后一个单词后， $z ^ \rightarrow = h_T ^ \rightarrow$ ；一个来自后向RNN后看到第一个单词在句子中， $z ^ \leftarrow = h_T ^ \leftarrow$ 。
RNN返回两个输出outputs和hidden。

outputs的大小为[src长度, batch_size, hid_dim * num_directions]，其中hid_dim是来自前向RNN的隐藏状态。这里可以将（hid_dim * num_directions）看成是前向、后向隐藏状态的堆叠。 $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$ , $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ ，我们也可以将所有堆叠的编码器隐藏状态表示为 $H = \{h_1，h_2，...，h_T \}$ 。
hidden的大小为[n_layers * num_directions, batch_size, hid_dim]，其中[-2,:,:]在最后的时间步之后（即在看到最后一个单词之后）给出顶层前向RNN隐藏状态在句子。和[-1，:，:]在最后的时间步之后（即在看到句子中的第一个单词之后）给出顶层后向RNN隐藏状态。

由于Decoder不是双向的，它只需要一个上下文向量 $z$ 作为其初始隐藏状态 $s_0$ ，我们目前有两个，前向和后向（ $z ^ \rightarrow = h_T ^ \rightarrow$ 和 $z ^ \leftarrow = h_T ^ \leftarrow$ ）。我们通过将两个上下文向量连接在一起，通过线性层 $g$ 并应用 $\tanh$ 激活函数来解决这个问题。公式如下：
$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$

在这里与原论文有区别，原论文中将后向RNN的隐藏状态通过线性层获取了上下文向量和decoder的最初隐藏状态，在这里做了修改，使用“通过线性层 $g$ 和 $\tanh$ 激活函数”来解决。
由于我们希望我们的模型回顾整个源句，我们返回输出，源句中每个标记的堆叠前向和后向隐藏状态。我们还返回hidden，它在解码器中充当我们的初始隐藏状态。

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super(Encoder, self).__init__()
        self.input_dim=input_dim
        self.emb_dim=emb_dim
        self.enc_hid_dim=enc_hid_dim
        self.dec_hid_dim=dec_hid_dim
        self.dropout=dropout
        
        self.embedding=nn.Embedding(input_dim,emb_dim)
        self.rnn=nn.GRU(emb_dim,enc_hid_dim,bidirectional=True)
        self.fc=nn.Linear(enc_hid_dim*2,dec_hid_dim)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,src):
        #src = [src sent len, batch size]
        embedded=self.dropout(self.embedding(src))
        #embedded = [src sent len, batch size, emb dim]
        outputs,hidden=self.rnn(embedded)
        #outputs = [src sent len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        #[-2,:,:]在最后的时间步之后（即在看到最后一个单词之后）给出顶层前向RNN隐藏状态在句子。和[-1，:，:]在最后的时间步之后（即在看到句子中的第一个单词之后）给出顶层后向RNN隐藏状态
        hidden=torch.tanh(self.fc(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1)))
        #outputs = [src sent len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        return outputs, hidden

5.2.4 Attention

attention其实就是对上一个时间步decoder输出的hidden状态、encoder输出的前向后向hidden状态进行处理。输出的是一个attention向量 $\alpha_t$ ，长度为源句长度，每一个元素都在0和1之间，总和为1。
$\alpha_t$ 表示的是源句中最应该受到注意的词，这能够帮助decoder预测下一个词 $\hat y_{t+1}$ 。
操作步骤和之前的论文阅读中类似

计算E值，就是上一个decoder隐藏状态和encoder隐藏状态之间的匹配得分。encoder的隐藏状态是一个Ttensor的序列，上一个decoder隐藏状态是一个独立tensor，第一步是要重复上一个decoder隐藏状态T次，然后计算匹配分数 $E_t$ 。
现在在batch中每一个例子都有一个[dec_hid_dim, src_sent_len]大小的tensor，而需要的是[src_sent_len]大小，由于attention要超过源句长度，可以通过让 $E_t$ 乘一个大小为[1,dec_hid_dim]的tensor来实现。
$\hat{a}_t = v E_t$
最后，确保attention向量符合所有元素在0和1之间的约束，并且通过将它传递到 softmax 层，向量求和为1。

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super(Attention,self).__init__()
        self.enc_hid_dim=enc_hid_dim
        self.dec_hid_dim=dec_hid_dim
        self.attn=nn.Linear((enc_hid_dim*2)+dec_hid_dim,dec_hid_dim)
        self.v=nn.Parameter(torch.rand(dec_hid_dim))
        
    def forward(self, hidden, encoder_outputs):
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        batch_size=encoder_outputs.shape[1]
        src_len=encoder_outputs.shape[0]
        #重复操作，让隐藏状态的第二个维度和encoder相同
        hidden=hidden.unsqueeze(1).repeat(1,src_len,1)
        #该函数按指定的向量来重新排列一个数组，在这里是调整encoder输出的维度顺序，在后面能够进行比较
        encoder_outputs=encoder_outputs.permute(1,0,2)
        #hidden = [batch size, src sent len, dec hid dim]
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        #开始计算hidden和encoder_outputs之间的匹配值
        energy=torch.tanh(self.attn(torch.cat((hidden,encoder_outputs),dim=2)))
        #energy = [batch size, src sent len, dec hid dim]
        #调整energy的排序
        energy=energy.permute(0,2,1)
        #energy = [batch size, dec hid dim, src sent len]
        
        #v = [dec hid dim]
        v=self.v.repeat(batch_size,1).unsqueeze(1)
        #v = [batch_size, 1, dec hid dim] 注意这个bmm的作用，对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操
        attention=torch.bmm(v,energy).squeeze(1)
        #attention=[batch_size, src_len]
        return F.softmax(attention, dim=1)

5.2.5 decoder

Decoder包括了注意力层，含有上一个隐藏状态 $s_{t-1}$ ，所有Encoder的隐藏状态 $H$ ，返回注意力向量 $a_t$ 。
接下来使用注意力向量创建加权源向量功能 $w_t$ ，含有Encoder隐藏状态的加权和 $H$ ，并使用注意力向量 $a_t$ 作为权重。公式如下
$w_t = a_t H$
输入字（已嵌入） $y_t$ ，加权源向量 $w_t$ 和先前的Decoder隐藏状态 $s_ {t-1}$ ，全部传递到Decoder。
$s_t = \text{DecoderGRU}(y_t, w_t, s_{t-1})$
最后使用线性层处理 $y_t$ ，做出预测
$\hat{y}_{t+1} = f(y_t, w_t, s_t)$

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super(Decoder,self).__init__()
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention
        
        self.embedding=nn.Embedding(output_dim,emb_dim)
        self.rnn=nn.GRU((enc_hid_dim*2)+emb_dim,dec_hid_dim)
        self.out=nn.Linear((enc_hid_dim*2)+dec_hid_dim+emb_dim,output_dim)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        #先搞input，展开
        input=input.unsqueeze(0)
        #input = [1,batch size]
        
        embedded=self.dropout(self.embedding(input))
        #embedded = [1, batch size, emb dim]
        
        a=self.attention(hidden, encoder_outputs)
        #a = [batch size, src len]
        a=a.unsqueeze(1)
        #a = [batch size, 1, src len]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        #在获取了权重和encoder隐藏状态之后，开始完成第一个公式，创建加权向量wt，使用bmm进行乘
        weighted=torch.bmm(a, encoder_outputs)
        #weighted = [batch size, 1, enc hid dim * 2]
        weighted=weighted.permute(1,0,2)
        
        rnn_input=torch.cat((embedded,weighted),dim=2)
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
        
        output,hidden=self.rnn(rnn_input,hidden.unsqueeze(0))
        
        
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        output = self.out(torch.cat((output, weighted, embedded), dim = 1))
        
        #output = [bsz, output dim]
        
        return output, hidden.squeeze(0)

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5)
  )
)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,917 trainable parameters

optimizer = optim.Adam(model.parameters())

PAD_IDX = TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

晚点补上colab的训练结果。