[fairseq] tutorial
2019-04-23 本文已影响0人
VanJordan
首先学习一下官方的LSTM的教程
- 建立一个
encoder
和decoder
模型,分别要继承FairseqEncoder
和FairseqDecoder
。
-
encoder
除过要实现最基本的init
和forward
函数还要实现一个reorder_encoder_out
函数,这个函数是用来将一个batch
内部的元素进行重排序的。
def reorder_encoder_out(self, encoder_out, new_order):
"""
Reorder encoder output according to `new_order`.
Args:
encoder_out: output from the ``forward()`` method
new_order (LongTensor): desired order
Returns:
`encoder_out` rearranged according to `new_order`
"""
final_hidden = encoder_out['final_hidden']
return {
'final_hidden': final_hidden.index_select(0, new_order),
}
- 注册模型,使用
register_model
from fairseq.models import FairseqModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('simple_lstm')
class SimpleLSTMModel(FairseqModel):
- 所有注册的模型必须编写
BaseFairseqModel
的接口,对于序列到序列模型,我们需要编写FairseqModel
里面的接口。 - 实例化add_args函数,build_model函数
from fairseq.models import FairseqModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('simple_lstm')
class SimpleLSTMModel(FairseqModel):
@staticmethod
def add_args(parser):
# Models can override this method to add new command-line arguments.
# Here we'll add some new command-line arguments to configure dropout
# and the dimensionality of the embeddings and hidden states.
parser.add_argument(
'--encoder-embed-dim', type=int, metavar='N',
help='dimensionality of the encoder embeddings',
)
parser.add_argument(
'--encoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the encoder hidden state',
)
parser.add_argument(
'--encoder-dropout', type=float, default=0.1,
help='encoder dropout probability',
)
parser.add_argument(
'--decoder-embed-dim', type=int, metavar='N',
help='dimensionality of the decoder embeddings',
)
parser.add_argument(
'--decoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the decoder hidden state',
)
parser.add_argument(
'--decoder-dropout', type=float, default=0.1,
help='decoder dropout probability',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'll just return a SimpleLSTMModel instance.
# Initialize our Encoder and Decoder.
encoder = SimpleLSTMEncoder(
args=args,
dictionary=task.source_dictionary,
embed_dim=args.encoder_embed_dim,
hidden_dim=args.encoder_hidden_dim,
dropout=args.encoder_dropout,
)
decoder = SimpleLSTMDecoder(
dictionary=task.target_dictionary,
encoder_hidden_dim=args.encoder_hidden_dim,
embed_dim=args.decoder_embed_dim,
hidden_dim=args.decoder_hidden_dim,
dropout=args.decoder_dropout,
)
model = SimpleLSTMModel(encoder, decoder)
# Print the model architecture.
print(model)
return model
# We could override the ``forward()`` if we wanted more control over how
# the encoder and decoder interact, but it's not necessary for this
# tutorial since we can inherit the default implementation provided by
# the FairseqModel base class, which looks like:
#
# def forward(self, src_tokens, src_lengths, prev_output_tokens):
# encoder_out = self.encoder(src_tokens, src_lengths)
# decoder_out = self.decoder(prev_output_tokens, encoder_out)
# return decoder_out
- 最后使用
register_model_architecture()
注册这个模型结构,注册函数的第一个参数是刚才注册的模型的名称,第二个参数是次结构的名称,那么以后可以直接在命令行使用-a tutorial_simple_lstm
来使用这个结构和参数了。 -
getattr
的作用是如果encoder_embed_dim
定义了默认值那么就使用默认值作为模型参数,否则使用256
。 - 写在这里的参数一般是不需要改的,如果需要调节才能达到一个比较好的结果那么参数应该写在
model
哪里这里不用getattr
. - 使用setup_task函数加载数据,然后返回一个task到的实例
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'simple_lstm'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
def tutorial_simple_lstm(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
- 在命令行进行训练
fairseq-train data-bin/iwslt14.tokenized.de-en \
--arch tutorial_simple_lstm \
--encoder-dropout 0.2 --decoder-dropout 0.2 \
--optimizer adam --lr 0.005 --lr-shrink 0.5 \
--max-tokens 12000
- 一个比较有意思的是
register_buffer
函数创建变量,这样如果使用的是GPU
那么自动会.cuda
class FairseqRNNClassifier(BaseFairseqModel):
def __init__(self, rnn, input_vocab):
# The RNN module in the tutorial expects one-hot inputs, so we can
# precompute the identity matrix to help convert from indices to
# one-hot vectors. We register it as a buffer so that it is moved to
# the GPU when ``cuda()`` is called.
self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
task
- 和
t2t
一样最重要的还是task
,因为在task上加在数据,然后把数据传递给模型然后拿到返回结果。 -
load_dataset
的时候split
就是那个阶段是train
,valid
,test
中的哪一个。 - 注意
import os
import torch
from fairseq.data import Dictionary, LanguagePairDataset
from fairseq.tasks import FairseqTask, register_task
@register_task('simple_classification')
class SimpleClassificationTask(FairseqTask):
@staticmethod
def add_args(parser):
# Add some command-line arguments for specifying where the data is
# located and the maximum supported input length.
parser.add_argument('data', metavar='FILE',
help='file prefix for data')
parser.add_argument('--max-positions', default=1024, type=int,
help='max input length')
@classmethod
def setup_task(cls, args, **kwargs):
# Here we can perform any setup required for the task. This may include
# loading Dictionaries, initializing shared Embedding layers, etc.
# In this case we'll just load the Dictionaries.
input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
print('| [input] dictionary: {} types'.format(len(input_vocab)))
print('| [label] dictionary: {} types'.format(len(label_vocab)))
return SimpleClassificationTask(args, input_vocab, label_vocab)
def __init__(self, args, input_vocab, label_vocab):
super().__init__(args)
self.input_vocab = input_vocab
self.label_vocab = label_vocab
def load_dataset(self, split, **kwargs):
"""Load a given dataset split (e.g., train, valid, test)."""
prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
# Read input sentences.
sentences, lengths = [], []
with open(prefix + '.input', encoding='utf-8') as file:
for line in file:
sentence = line.strip()
# Tokenize the sentence, splitting on spaces
tokens = self.input_vocab.encode_line(
sentence, add_if_not_exist=False,
)
sentences.append(tokens)
lengths.append(tokens.numel())
# Read labels.
labels = []
with open(prefix + '.label', encoding='utf-8') as file:
for line in file:
label = line.strip()
labels.append(
# Convert label to a numeric ID.
torch.LongTensor([self.label_vocab.add_symbol(label)])
)
assert len(sentences) == len(labels)
print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
# We reuse LanguagePairDataset since classification can be modeled as a
# sequence-to-sequence task where the target sequence has length 1.
self.datasets[split] = LanguagePairDataset(
src=sentences,
src_sizes=lengths,
src_dict=self.input_vocab,
tgt=labels,
tgt_sizes=torch.ones(len(labels)), # targets have length 1
tgt_dict=self.label_vocab,
left_pad_source=False,
max_source_positions=self.args.max_positions,
max_target_positions=1,
# Since our target is a single class label, there's no need for
# input feeding. If we set this to ``True`` then our Model's
# ``forward()`` method would receive an additional argument called
# *prev_output_tokens* that would contain a shifted version of the
# target sequence.
input_feeding=False,
)
def max_positions(self):
"""Return the max input length allowed by the task."""
# The source should be less than *args.max_positions* and the "target"
# has max length 1.
return (self.args.max_positions, 1)
@property
def source_dictionary(self):
"""Return the source :class:`~fairseq.data.Dictionary`."""
return self.input_vocab
@property
def target_dictionary(self):
"""Return the target :class:`~fairseq.data.Dictionary`."""
return self.label_vocab
# We could override this method if we wanted more control over how batches
# are constructed, but it's not necessary for this tutorial since we can
# reuse the batching provided by LanguagePairDataset.
#
# def get_batch_iterator(
# self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
# ignore_invalid_inputs=False, required_batch_size_multiple=1,
# seed=1, num_shards=1, shard_id=0,
# ):
# (...)
- 需要重写的两个类,返回
fairseq
中已经写好的字典类
- 训练时候的方法,我们可以看到,
task
指定了如何加载数据,然后把加载好的数据放在self.datasets[split]
里面,然后相应的architecture
从这个里面拿到数据,其他的事情就不用管了比如怎么组织batch
什么的,reuse the batching provided by LanguagePairDataset
这个里面都实现好了。 -
setup_task
里面是加载词表。
> fairseq-train names-bin \
--task simple_classification \
--arch pytorch_tutorial_rnn \
--optimizer adam --lr 0.001 --lr-shrink 0.5 \
--max-tokens 1000
- 写一个评估函数
path
里面存放的是模型的ckpt
文件,可以看到加载加载数据是sentence = input('\nInput: ')
这算方式所以可以使用管道来加载数据。 - 使用
data
内置的collate
函数来将数据组织成batch
,然后输出到模型里面进行预测。
from fairseq import data, options, tasks, utils
# Parse command-line arguments for generation
parser = options.get_generation_parser(default_task='simple_classification')
args = options.parse_args_and_arch(parser)
# Setup task
task = tasks.setup_task(args)
# Load model
print('| loading model from {}'.format(args.path))
models, _model_args = utils.load_ensemble_for_inference([args.path], task)
model = models[0]
while True:
sentence = input('\nInput: ')
# Tokenize into characters
chars = ' '.join(list(sentence.strip()))
tokens = task.source_dictionary.encode_line(
chars, add_if_not_exist=False,
)
# Build mini-batch to feed to the model
batch = data.language_pair_dataset.collate(
samples=[{'id': -1, 'source': tokens}], # bsz = 1
pad_idx=task.source_dictionary.pad(),
eos_idx=task.source_dictionary.eos(),
left_pad_source=False,
input_feeding=False,
)
# Feed batch to the model and get predictions
preds = model(**batch['net_input'])
# Print top 3 predictions and their log-probabilities
top_scores, top_labels = preds[0].topk(k=3)
for score, label_idx in zip(top_scores, top_labels):
label_name = task.target_dictionary.string([label_idx])
print('({:.2f})\t{}'.format(score, label_name))
- 预测结果的命令:
python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
解析命令行
-
preprocess
命令行查找各种数据的方法
Training Flow
-
Tasks
的作用是存储字典以及给加载数据和训练中循环数据。 -
Training Flow
中我们可以看到task.get_batch_iterator
负责加载数据然后返回batch
的生成器,lr_scheduler
在每一次更新和每一个epoch
的时候启用不同的学习率调度程序。
for epoch in range(num_epochs):
itr = task.get_batch_iterator(task.dataset('train'))
for num_updates, batch in enumerate(itr):
task.train_step(batch, model, criterion, optimizer)
average_and_clip_gradients()
optimizer.step()
lr_scheduler.step_update(num_updates)
lr_scheduler.step(epoch)
- 默认的
train.train_step
是
def train_step(self, batch, model, criterion, optimizer):
loss = criterion(model, batch)
optimizer.backward(loss)
- 使用定制的位置加载额外的
module
- 假设目录在
/home/user/my-module/
└── __init__.py
-
__init__.py
里面是
from fairseq.models import register_model_architecture
from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
@register_model_architecture('transformer', 'my_transformer')
def transformer_mmt_big(args):
transformer_vaswani_wmt_en_de_big(args)
- 这样就可以在
fairseq-train
中加入新的结构
fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
task
- task的用法,
build_model
和build_criterion
,load_dataset
,get_batch_iterator
,get_loss
基本上所有的功能都是task
来完成的
# setup the task (e.g., load dictionaries)
task = fairseq.tasks.setup_task(args)
# build model and criterion
model = task.build_model(args)
criterion = task.build_criterion(args)
# load datasets
task.load_dataset('train')
task.load_dataset('valid')
# iterate over mini-batches of data
batch_itr = task.get_batch_iterator(
task.dataset('train'), max_tokens=4096,
)
for batch in batch_itr:
# compute the loss
loss, sample_size, logging_output = task.get_loss(
model, criterion, batch,
)
loss.backward()