tensor2tensor 修改记

2019-04-01  本文已影响0人  VanJordan
@registry.register_problem
class TranslateEnzhWmt32k(translate.TranslateProblem):
  @property
  def approx_vocab_size(self):
    return 2**15  # 32k
datasets = train_dataset = _NC_TRAIN_DATASETS = [[
    "http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz", [
        "training-parallel-nc-v13/news-commentary-v13.zh-en.en",
        "training-parallel-nc-v13/news-commentary-v13.zh-en.zh"
    ]
]]  # 里面只有一个item
source_datasets = [[item[0], [item[1][0]]] for item in train_dataset]  # dataset name
source_datasets = [[
["http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz"],
["training-parallel-nc-v13/news-commentary-v13.zh-en.en"]
]]

生成source的词典

source_vocab = generator_utils.get_or_generate_vocab(
    data_dir,
    tmp_dir,
    self.source_vocab_name,
    self.approx_vocab_size,
    source_datasets,
    file_byte_budget=1e8,
    max_subtoken_length=self.max_subtoken_length)

总的处理逻辑

总结

training-parallel-nc-v13.tgz

Vocab

上一篇 下一篇

猜你喜欢

热点阅读