Pytorch学习记录-更深的TorchText学习02

2019-04-19 本文已影响1人我的昵称违规了

Pytorch学习记录-更深的TorchText学习02
简单实现torchtext之后，我希望能够进一步学习torchtext。找到两个教程

1. practical-torchtext简介

有效使用torchtext的教程，包括两个部分

~~文本分类~~
词级别的语言模型

3. 词级别的语言模型

在本教程中，我们将看到如何使用torchtext中的内置数据集训练语言模型。
教程还将介绍在培训您自己的实用模型时可能需要使用的一些更实用的torchtext功能。

3.1 什么是语言模型

语言建模是构建模型的任务，该模型可以将一系列单词作为输入，并确定该序列作为实际人类语言的可能性。例如，我们希望我们的模型预测“这是一个句子”是一个可能的序列，而“冷却他的书”是不可能的。
通常训练语言模型的方法是训练他们预测下一个单词给出句子中的所有先前单词或多个句子。因此，我们所需要做的语言建模都是大量的语言数据（称为语料库）。
在本教程中，我们将使用着名的WikiText2数据集。

USE_GPU=True
BATCH_SIZE=32

3.2 准备数据

注意
这里会使用spacy作为分词器，torchtext很容易使用spacy作为分词器：我们所要做的就是传递spacy函数。

import torchtext
from torchtext import data
import spacy
from spacy.symbols import ORTH

# 
my_tok=spacy.load('en')
my_tok.tokenizer.add_special_case('<eos>',[{ORTH:'<eos>'}])
my_tok.tokenizer.add_special_case('<bos>',[{ORTH:'<bos>'}])
my_tok.tokenizer.add_special_case('<unk>',[{ORTH:'<unk>'}])
# 这里使用add_special_case，add_special_case只是告诉tokenizer以某种方式解析某个字符串。特殊情况字符串后面的列表表示我们希望如何对字符串进行标记化。
# 举例，如果我们希望将don't分词为do和n't，就可以这样写
# my_tok.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "n't"}])
def spacy_tok(x):
    return [tok.text for tok in my_tok.tokenizer(x)]

# 使用spacy分词之后，就要将分词结果放入Field中
TEXT=data.Field(lower=True, tokenize=spacy_tok)

接下来就可以加载内置数据集，在这里有两种方式进行加载

通过加载Dataset到训练集、验证集和测试集
作为迭代器加载

dataset提供了更大的方便，教程使用数据集进行加载
教程在这里使用的是WikiText2数据集。
在这里出现了一些问题，这个教程和之前我们做的那个构建语言模型的很类似，但是在这里它是现场下载抽取，我还是放弃这么做了，直接使用之前下载好的Glove词向量，处理确实很慢。

from torchtext.datasets import WikiText2
train, valid, test=WikiText2.splits(TEXT)

现在我们有了数据，建立词汇表。这一次，让我们尝试使用预先计算的单词嵌入。
这次我们将使用200维的GloVe向量。在torchtext中还有各种其他预先计算的单词嵌入（包括具有100和300维度的GloVe向量），它们可以以大致相同的方式加载。

TEXT.build_vocab(train, vectors="glove.6B.200d")

3.3 构建迭代器

搞定之后就可以构建迭代器了。
经过这两天的学习，总结一下流程哈。

处理原始数据，英文直接用分词器，中文应该也是类似的。生成的结果是一个Field格式的TEXT（或者其他什么鬼）
基于这个TEXT，使用数据集记载成为训练集、验证集、测试集
建立词汇表，使用预训练好的词向量进行嵌入。（卧槽，这个嵌入的意思是不是说将已有的或是预先训练好的词向量和我们的数据集/语料库进行整合完成对语料库的嵌入？）
构建迭代器，处理text和target
构建模型
数据放进去训练、验证、测试

在这里和之前的Iterator、BucketIterator不同，这里使用了另一个迭代器BPTTIterator。
BPTTIterator为我们做了以下事情：

将语料库分成序列长度为bptt的批次。
例如，假设我们有以下语料库：

“Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.”

虽然这个句子很短，但实际的语料库长达数千字，所以我们不可能一次性地提供它。我们想要将语料库划分为更短的序列。在上面的例子中，如果我们想将语料库划分为序列长度为5的批次，我们将得到以下序列：

["Machine", "learning", "is", "a", "field"],
["of", "computer", "science", "that", "gives"],
["computers", "the", "ability", "to", "learn"],
["without", "being", "explicitly", "programmed", EOS]

生成作为输入序列偏移的批次。
在语言建模中，监督数据是单词序列中的下一个单词。因此，我们希望生成输入序列偏移1的序列（就是单词后移一个）。在上面的例子中，我们将得到以下序列，我们训练模型进行预测：

["learning", "is", "a", "field", "of"],
["computer", "science", "that", "gives", "computers"],
["the", "ability", "to", "learn", "without"],
["being", "explicitly", "programmed", EOS, EOS]

train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train, valid, test),
    batch_size=BATCH_SIZE,
    bptt_len=30, # 这个就是我们特别提到的句子长度
    device=-1,
    repeat=False)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.

b = next(iter(train_iter))
vars(b).keys()

dict_keys(['batch_size', 'dataset', 'fields', 'text', 'target'])

在这里我们可以看一下测试的结果，查看b中的text和target的内容。
输出结果中target顺移一位。搞定了可以开始训练模型了。

b.text[:, :3]

tensor([[    9,   953,     0],
        [   10,   324,  5909],
        [    9,    11, 20014],
        [   12,  5906,    27],
        [ 3872, 10434,     2],
        [ 3892,     3, 10780],
        [  886,    11,  3273],
        [   12,  9357,     0],
        [   10,  8826, 23499],
        [    9,  1228,     4],
        [   10,     7,   569],
        [    9,     2,   235],
        [20059,  2592,  5909],
        [   90,     3,    20],
        [ 3872,   141,     2],
        [   95,     8,  1450],
        [   49,  6794,   369],
        [    0,  9046,     5],
        [ 3892,  1497,     2],
        [   24,    13,  2168],
        [  786,     4,   488],
        [   49,    26,  5967],
        [28867,    25,   656],
        [    3, 18430,    14],
        [ 6213,    58,    48],
        [    4,  4886,  4364],
        [ 3872,   217,     4],
        [    5,     5,    22],
        [    2,     2,  1936],
        [ 5050,   593,    59]])

b.target[:, :3]

tensor([[   10,   324,  5909],
        [    9,    11, 20014],
        [   12,  5906,    27],
        [ 3872, 10434,     2],
        [ 3892,     3, 10780],
        [  886,    11,  3273],
        [   12,  9357,     0],
        [   10,  8826, 23499],
        [    9,  1228,     4],
        [   10,     7,   569],
        [    9,     2,   235],
        [20059,  2592,  5909],
        [   90,     3,    20],
        [ 3872,   141,     2],
        [   95,     8,  1450],
        [   49,  6794,   369],
        [    0,  9046,     5],
        [ 3892,  1497,     2],
        [   24,    13,  2168],
        [  786,     4,   488],
        [   49,    26,  5967],
        [28867,    25,   656],
        [    3, 18430,    14],
        [ 6213,    58,    48],
        [    4,  4886,  4364],
        [ 3872,   217,     4],
        [    5,     5,    22],
        [    2,     2,  1936],
        [ 5050,   593,    59],
        [   95,     7,    14]])

3.4 训练语言模型

首先还是构建模型，构建一个RNN

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable as V

class RNNModel(nn.Module):
    def __init__(self, ntoken,ninp, nhid, nlayers, bsz, dropout=0.5, tie_weights=True):
        super(RNNModel, self).__init__()
        self.nhid, self.nlayers, self.bsz = nhid, nlayers, bsz
        self.drop=nn.Dropout(dropout)
        self.encoder=nn.Embedding(ntoken,ninp)
        self.rnn=nn.LSTM(ninp, nhid, nlayers, dropout=dropout)
        self.decoder=nn.Linear(nhid,ntoken)
        self.init_weights()
        self.hidden=self.init_hidden(bsz)
    
    def init_weights(self):
        initrange=0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)
    
    def init_hidden(self, bsz):
        weight = next(self.parameters()).data
        return (V(weight.new(self.nlayers, bsz, self.nhid).zero_()), V(weight.new(self.nlayers, bsz, self.nhid).zero_()))
#         return (V(weight.new(self.nlayers, bsz, self.nhid).zero_().cuda()),
#                 V(weight.new(self.nlayers, bsz, self.nhid).zero_()).cuda())
        
    
    def reset_history(self):
        """Wraps hidden states in new Variables, to detach them from their history."""
        self.hidden = tuple(V(v.data) for v in self.hidden)
    
    def forward(self,input):
        emb=self.drop(self.encoder(input))
        output,self.hidden=self.rnn(emb, self.hidden)
        output=self.drop(output)
        decoded=self.decoder(output.view(output.size(0)*output.size(1),output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1))

weight_matrix=TEXT.vocab.vectors
model=RNNModel(weight_matrix.size(0),weight_matrix.size(1),200,1,BATCH_SIZE)
model.encoder.weight.data.copy_(weight_matrix)
# if USE_GPU:
#     model.cuda()
# model.to(device)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0715,  0.0935,  0.0237,  ...,  0.3362,  0.0306,  0.2558],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

criterion=nn.CrossEntropyLoss()
optimizer=optim.Adam(model.parameters(),lr=0.001,betas=(0.7,0.99))
n_epochs=2
n_tokens=weight_matrix.size(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from tqdm import tqdm
def train_epoch(epoch):
    epoch_loss=0
    for batch in tqdm(train_iter):
        model.reset_history()
        optimizer.zero_grad()
        text, target=batch.text, batch.target
        prediction=model(text)
        loss=criterion(prediction.view(-1,n_tokens),target.view(-1))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item() * prediction.size(0) * prediction.size(1)

    epoch_loss /= len(train.examples[0].text)
    
    val_loss = 0
    model.eval()
    for batch in valid_iter:
        model.reset_history()
        text, targets = batch.text, batch.target
        prediction = model(text)
        loss = criterion(prediction.view(-1, n_tokens), targets.view(-1))
        val_loss += loss.item() * text.size(0)
    val_loss /= len(valid.examples[0].text)
    
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

for epoch in range(1,n_epochs+1):
    train_epoch(epoch)

Epoch: 1, Training Loss: 6.2106, Validation Loss: 0.1713
Epoch: 2, Training Loss: 5.2644, Validation Loss: 0.1595

Epoch: 1, Training Loss: 6.2106, Validation Loss: 0.1713
Epoch: 2, Training Loss: 5.2644, Validation Loss: 0.1595

结果和原来教程里差别不大，问题来了，这两次都遇到一个问题，就是使用GPU和CPU的问题，在训练时，不像之前直接使用to(device)就可以将数据和模型转移到GPU上，最近总报错，可能需要进一步的学习

b=next(iter(valid_iter))
def word_ids_to_sentence(id_tensor,vocab,join=None):
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)

    batch = [vocab.itos[ind] for ind in ids]  # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

word_ids_to_sentence(b.text.cpu().data, TEXT.vocab, join=' ')[:210]

'  <eos>   = homarus gammarus = <eos>   <eos>   homarus gammarus , known as the european lobster or common lobster , is a species of <unk> lobster from . <unk> ceo hiroshi <unk> referred to <unk> as one of his f'

arrs = model(b.text).cpu().data.numpy()

import numpy as np
word_ids_to_sentence(np.argmax(arrs, axis=2), TEXT.vocab, join=' ')[:210]

'<unk>   <eos> = = = <eos>   <eos>   <eos>   = the as " <unk> <unk> , <unk> starling " <unk> a <unk> of the , , the <eos> <unk> <unk> <unk> , to the the " of the first , , the , <eos>   <eos> years of the : the '

结果一般啊，教程增加了训练的迭代次数，这里就不再赘述了