1. tokenizer

2022-04-28  本文已影响0人  yoyo9999
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
t1 = ' Today we released the new model: multilingual generative pre-trained transformer.'
t2 = 'The checkpoints are available on Huggingface model page.'
encoded_texts = tokenizer(
        t1,
        t2,
        truncation='only_second',
        max_length=20,
        padding="max_length",
        return_offsets_mapping=True,
        return_token_type_ids=True
        )

print(encoded_texts)
====================================
{
'input_ids': [1, 2477, 52, 703, 5, 92, 1421, 35, 7268, 41586, 20181, 3693, 1198, 12, 23830, 40878, 4, 2, 133, 2], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
'offset_mapping': [(0, 0), (0, 6), (6, 9), (9, 18), (18, 22), (22, 26), (26, 32), (32, 33), (33, 38), (38, 46), (46, 52), (52, 57), (57, 61), (61, 62), (62, 69), (69, 81), (81, 82), (0, 0), (0, 3), (0, 0)]
}
converted2txt = tokenizer.convert_ids_to_tokens(encoded_texts['input_ids'])
print(converted2txt)
=================================
['[CLS]', 'ĠToday', 'Ġwe', 'Ġreleased', 'Ġthe', 'Ġnew', 'Ġmodel',
 ':', 'Ġmult', 'ilingual', 'Ġgener', 'ative', 'Ġpre', '-', 'trained', 
'Ġtransformer', '.', '[SEP]', 'The', '[SEP]']
print(encoded_texts['offset_mapping'])
=====================================
[(0, 0), (0, 6), (6, 9), (9, 18), (18, 22), (22, 26), (26, 32), 
(32, 33), (33, 38), (38, 46), (46, 52), (52, 57), (57, 61), 
(61, 62), (62, 69), (69, 81), (81, 82), (0, 0), (0, 3), (0, 0)]

各种bert输入格式:

bert: [CLS] + tokens + [SEP] + padding

roberta: [CLS] + prefix_space + tokens + [SEP] + padding

distilbert: [CLS] + tokens + [SEP] + padding

xlm: [CLS] + tokens + [SEP] + padding

xlnet: padding + tokens + [SEP] + [CLS]

另外,用sequence_ids参数来区分第一句和第二句。例如,

sequence_ids = encoded_texts.sequence_ids()
print(sequence_ids)

#输出:
[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1,  None]

这里None对应了special tokens,然后01分表代表第一句和第二句。

Inference

[1] HuggingFace-Transformers

上一篇 下一篇

猜你喜欢

热点阅读