2019-02 BERT 预训练

2019-02-18  本文已影响0人  Hugo_Ng_7777

总的来说,BERT 在预训练时候用到的是以下三个元素:
一、 Next Sentence Prediction (NSP) : 按照 [CLS] [token_A] [SEP] [token_B] [SEP] 格式进行拼接。
二、 Masked Language Model (Masked LM): 对以上句子中15%的单词个数进行masking, 并且masking时候 80%概率用 [Mask],10%用random word,10%用原来的单词。代码如下:

def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
    cand_indexes = []
    for (i, token) in enumerate(tokens):
        if token == "[CLS]" or token == "[SEP]":
            continue
        cand_indexes.append(i)

    rng.shuffle(cand_indexes)

    output_tokens = list(tokens)

    num_to_predict = min(max_predictions_per_seq,
                         max(1, int(round(len(tokens) * masked_lm_prob))))

    masked_lms = []
    covered_indexes = set()
    for index in cand_indexes: ##从句子开始就以概率进行masking,直到masking的单词个数==num_to_predict数为止
        if len(masked_lms) >= num_to_predict:
            break
        if index in covered_indexes:
            continue
        covered_indexes.add(index)

        masked_token = None
        # 80% of the time, replace with [MASK]
        if rng.random() < 0.8:
            masked_token = "[MASK]"
        else:
            # 10% of the time, keep original
            if rng.random() < 0.5:
                masked_token = tokens[index]
            # 10% of the time, replace with random word
            else:
                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

        output_tokens[index] = masked_token

        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

    masked_lms = sorted(masked_lms, key=lambda x: x.index) ## size=[num_to_predict]

    masked_lm_positions = []
    masked_lm_labels = []
    for p in masked_lms:
        masked_lm_positions.append(p.index)
        masked_lm_labels.append(p.label)

    return (output_tokens, masked_lm_positions, masked_lm_labels)

三、 3个Embedding操作: Token Embedding + Segment Embedding + Position Embedding。

1. 数据的预处理

 下图为训练所需要的数据格式:
(最大句子长度 max_seq_length=128, 最大的mask长度 max_predictions_per_seq=20)

准备的数据格式.png

 在BERT源码 creat_pretraining_data.py 中:
 对于数据处理部分是:
  1. 先全部读取源文件数据到内存中,再shuffle。
  2. 将以上文本数据再进行转换,转换成上图的数据格式,全部存入内存中,最后再shuffle。
 不过以上的数据处理本人也有疑惑:
  1. 在大规模的处理下肯定不能都存入内存,但是在随机挑选next_sentence_pair 时候,好像需要用到整个存入内存的document list才可以随机选择下一句。这个在大规模数据处理时候怎么处理呢?除非是分批次进行处理咯。。
  2. 后续run_pretraining.py 中读取tfrecords文件时,也是有shuffle操作的,前面数据处理部分的shuffle是否必须呢?

 数据处理的整体代码如下:

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng):
    """Create `TrainingInstance`s from raw text."""
    all_documents = [[]]

    for input_file in input_files:
        with tf.gfile.GFile(input_file, "r") as reader:
            while True:
                line = tokenization.convert_to_unicode(reader.readline())
                if not line:
                    break
                line = line.strip()

                # Empty lines are used as document delimiters
                if not line:
                    all_documents.append([])
                tokens = tokenizer.tokenize(line)
                if tokens:
                    all_documents[-1].append(tokens)

    # Remove empty documents
    all_documents = [x for x in all_documents if x]
    rng.shuffle(all_documents)

    vocab_words = list(tokenizer.vocab.keys())
    instances = []
    for _ in range(dupe_factor):
        for document_index in range(len(all_documents)):
            instance = create_instances_from_document(all_documents, document_index, max_seq_length, short_seq_prob,
                                                      masked_lm_prob, max_predictions_per_seq, vocab_words, rng)

    rng.shuffle(instances)
    return instances

 

2. 模型的预训练过程

 1. 两个Loss的计算:

  1. Mask Loss
    a.输入是BERT模型的最后一层sequence embeddings + masked_positions ==>通过gather函数将对应position上的embeddings聚集起来
    b.将上述gather到的embeddings + masked_lm_ids ==>得到每个mask的位置对应的loss
    c.上述的loss + masked_lm_weights ==>得到非padding部分的每个位置的loss
    d.从而再得到一个batch_size下的所有mask position的平均loss

  代码如下

def get_masked_lm_output(bert_config,
                         input_tensor,
                         output_weights,
                         positions,
                         label_ids,
                         label_weights):
    """Get loss and log probs for the masked LM."""

    input_tensor = gather_indexes(input_tensor, positions)  ## gather 特定位置上的向量,so input_tensor=(batch_size*20, width)

    with tf.variable_scope("cls/predictions"):
        # We apply one more non-linear transformation before the output layer.
        # This matrix is not used after pre-training.
        with tf.variable_scope("transform"):
            input_tensor = tf.layers.dense(input_tensor,
                                           units=bert_config.hidden_size,
                                           activation=modeling.get_activation(bert_config.hidden_act),
                                           kernel_initializer=modeling.create_initializer(
                                               bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)

        # The output weights are the same as the input embeddings
        output_bias = tf.get_variable("output_bias",
                                      shape=[bert_config.vocab_size],
                                      initializer=tf.zeros_initializer())
        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)  ## (batch_size*20, vocab_size)
        logits = tf.nn.bias_add(logits, output_bias)  # an output-only bias for each token.
        log_probs = tf.nn.log_softmax(logits,
                                      axis=-1)  ## (batch_size*20, vocab_size) 这里:每行表示一个position位置,每列表示字典中每个字对应的概率

        label_ids = tf.reshape(label_ids, [-1])
        label_weights = tf.reshape(label_weights, [-1])

        one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

        per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])  ## (batch_size*20,1)
        numerator = tf.reduce_sum(label_weights * per_example_loss)
        denominator = tf.reduce_sum(label_weights) + 1e-5
        loss = numerator / denominator  ## 该batch*20下,平均loss

    return (loss, per_example_loss, log_probs)

  2. Classification Loss
    a.输入是BERT模型的最后一层pooled data, 即BERT模型最前面的[CLS]标志符的embedding
    b.上述embedding + next_sentence_labels ==>得到batch下的每个loss
    c.从而再得到一个batch_size下的平均loss

  代码如下

def get_next_sentence_output(bert_config, input_tensor, labels):  ##input_tensor=(batch_size, hidden_size)
    """Get loss and log probs for the next sentence prediction."""

    # Simple binary classification. Note that 0 is "next sentence" and 1 is
    # "random sentence". This weight matrix is not used after pre-training.

    with tf.variable_scope("cls/seq_relationship"):
        output_weights = tf.get_variable("output_weights",
                                         shape=[2, bert_config.hidden_size],
                                         initializer=modeling.create_initializer(bert_config.initializer_range))
        output_bias = tf.get_variable("output_bias", shape=[2], initializer=tf.zeros_initializer())

        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)  ##(batch_size, 2)
        logits = tf.nn.bias_add(logits, output_bias)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        labels = tf.reshape(labels, [-1])
        one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)  ## 该batch下的平均loss
        return (loss, per_example_loss, log_probs)

  3. Total Loss
    Total loss = Mask Loss+ Classification Loss

参考文献:
https://mp.weixin.qq.com/s/I315hYPrxV0YYryqsUysXw

上一篇 下一篇

猜你喜欢

热点阅读