2019-02 BERT 预训练
总的来说,BERT 在预训练时候用到的是以下三个元素:
一、 Next Sentence Prediction (NSP) : 按照 [CLS] [token_A] [SEP] [token_B] [SEP] 格式进行拼接。
二、 Masked Language Model (Masked LM): 对以上句子中15%的单词个数进行masking, 并且masking时候 80%概率用 [Mask],10%用random word,10%用原来的单词。代码如下:
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
cand_indexes.append(i)
rng.shuffle(cand_indexes)
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index in cand_indexes: ##从句子开始就以概率进行masking,直到masking的单词个数==num_to_predict数为止
if len(masked_lms) >= num_to_predict:
break
if index in covered_indexes:
continue
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
masked_lms = sorted(masked_lms, key=lambda x: x.index) ## size=[num_to_predict]
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)
三、 3个Embedding操作: Token Embedding + Segment Embedding + Position Embedding。
1. 数据的预处理
下图为训练所需要的数据格式:
(最大句子长度 max_seq_length=128, 最大的mask长度 max_predictions_per_seq=20)
准备的数据格式.png
在BERT源码 creat_pretraining_data.py 中:
对于数据处理部分是:
1. 先全部读取源文件数据到内存中,再shuffle。
2. 将以上文本数据再进行转换,转换成上图的数据格式,全部存入内存中,最后再shuffle。
不过以上的数据处理本人也有疑惑:
1. 在大规模的处理下肯定不能都存入内存,但是在随机挑选next_sentence_pair 时候,好像需要用到整个存入内存的document list才可以随机选择下一句。这个在大规模数据处理时候怎么处理呢?除非是分批次进行处理咯。。
2. 后续run_pretraining.py 中读取tfrecords文件时,也是有shuffle操作的,前面数据处理部分的shuffle是否必须呢?
数据处理的整体代码如下:
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
line = tokenization.convert_to_unicode(reader.readline())
if not line:
break
line = line.strip()
# Empty lines are used as document delimiters
if not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
vocab_words = list(tokenizer.vocab.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instance = create_instances_from_document(all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
rng.shuffle(instances)
return instances
2. 模型的预训练过程
1. 两个Loss的计算:
1. Mask Loss
a.输入是BERT模型的最后一层sequence embeddings + masked_positions ==>通过gather函数将对应position上的embeddings聚集起来
b.将上述gather到的embeddings + masked_lm_ids ==>得到每个mask的位置对应的loss
c.上述的loss + masked_lm_weights ==>得到非padding部分的每个位置的loss
d.从而再得到一个batch_size下的所有mask position的平均loss
代码如下
def get_masked_lm_output(bert_config,
input_tensor,
output_weights,
positions,
label_ids,
label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions) ## gather 特定位置上的向量,so input_tensor=(batch_size*20, width)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings
output_bias = tf.get_variable("output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True) ## (batch_size*20, vocab_size)
logits = tf.nn.bias_add(logits, output_bias) # an output-only bias for each token.
log_probs = tf.nn.log_softmax(logits,
axis=-1) ## (batch_size*20, vocab_size) 这里:每行表示一个position位置,每列表示字典中每个字对应的概率
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])
one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) ## (batch_size*20,1)
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator ## 该batch*20下,平均loss
return (loss, per_example_loss, log_probs)
2. Classification Loss
a.输入是BERT模型的最后一层pooled data, 即BERT模型最前面的[CLS]标志符的embedding
b.上述embedding + next_sentence_labels ==>得到batch下的每个loss
c.从而再得到一个batch_size下的平均loss
代码如下
def get_next_sentence_output(bert_config, input_tensor, labels): ##input_tensor=(batch_size, hidden_size)
"""Get loss and log probs for the next sentence prediction."""
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
output_weights = tf.get_variable("output_weights",
shape=[2, bert_config.hidden_size],
initializer=modeling.create_initializer(bert_config.initializer_range))
output_bias = tf.get_variable("output_bias", shape=[2], initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True) ##(batch_size, 2)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss) ## 该batch下的平均loss
return (loss, per_example_loss, log_probs)
3. Total Loss
Total loss = Mask Loss+ Classification Loss