BERT

2019-01-25 本文已影响0人瓜子小姐

与Elmo/GPT相比，bert的改进
bert预训练的方式、input representation
fine-tune方式、常用数据集介绍
思考

与Elmo相比，bert的改进？

深度（bi-lstm - transformer） + 同时双向：ELMo中是通过双向的两层RNN结构对两个方向进行建模，但两个方向的loss计算相互独立。

传统语言模型.png

基于bi-lstm的拼接向量的分类.png

transformer - RNN
Self-Attention不需要依赖前一个阶段的信息，便于并行计算；
单词两两之间都会做Attention，可以捕捉长距离依赖关系。

transformer.png

双向 + 大语料 + ...

GPT-pretrain.png

pre-train+fine-tune.png

Task 1：Masked Language Model
Task 2：Next Sentence Prediction

为什么需要masked LM？3.3.1
15%
为什么对length有限制？3.3.2
They are sampled such that the combined length is ≤ 512 tokens?
文章怎么办?