Sentence-Transformer
详细用法,看官网:https://www.sbert.net/index.html
这篇笔记大部分都是官网的例子,只是搬过来记录几个熟悉的用法
简介:
Sentence-Transformer 是一个 python 框架,用于句子和文本嵌入
The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
可以使用这个框架计算100多种语言的句子或者文本嵌入,可以用于计算句子相似度,文本相似度,语义搜索,释义挖掘等下游任务(semantic textual similar, semantic search, or paraphrase mining)
这个框架是基于 pytorch 和 transformer 的,并且提供大量的预训练模型,也易于在自己的模型上做微调 fine-tune.
下面以官方提供的几个任务做例子,简单记录一下如何使用sentence embedding
1. Semantic Textual Similarity
计算两段文本的相似度,这里的例子是计算两段文本对应的每一条句子计算余弦相似度;
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v1',device='cuda')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
print(embeddings1)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
print(cosine_scores)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
>>
tensor([[ 0.0111, 0.1261, 0.2388, ..., -0.0673, 0.1713, 0.0163],
[-0.1193, 0.1150, -0.1560, ..., -0.2504, -0.0789, -0.1212],
[ 0.0114, 0.1248, -0.0231, ..., -0.2252, 0.3014, 0.1654]])
tensor([[ 0.4579, 0.1059, 0.1447],
[ 0.1239, 0.1759, -0.0344],
[ 0.1696, 0.1313, 0.9283]])
The cat sits outside The dog plays in the garden Score: 0.4579
A man is playing guitar A woman watches TV Score: 0.1759
The new movie is awesome The new movie is so great Score: 0.9283
>>
下面是一个:输入一句话,在一段文本中获取最相似的几句话,这要求计算出所有文本句子的词嵌入,然后计算向量的余弦距离,代码比较简单:
from sentence_transformers import SentenceTransformer,util
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens',device='cuda')
sentences = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome',
'The new opera is nice']
sentence_embeddings = model.encode(sentences,convert_to_tensor=True)
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
query = 'The new movie is so great' # A query sentence uses for searching semantic similarity score.
queries = [query]
query_embeddings = model.encode(queries,convert_to_tensor=True)
print("Semantic Search Results")
number_top_matches = 2
for query, query_embedding in zip(queries, query_embeddings):
cosine_scores = util.pytorch_cos_sim(query_embedding,sentence_embeddings)[0]
results = zip(range(len(cosine_scores)), cosine_scores)
results = sorted(results, key=lambda x: x[1],reverse=True)
for i,j in results:
print(i,j)
print("Query:", query)
print("\nTop {} most similar sentences in corpus:".format(number_top_matches))
for idx, distance in results[0:number_top_matches]:
print(sentences[idx].strip(), "(Cosine Score: %.4f)" % distance)
>>输出
Sentence: The cat sits outside
Embedding: tensor([-0.6349, 0.3843, -0.4646, ..., -0.3325, -0.7107, -0.0827])
Sentence: A man is playing guitar
Embedding: tensor([-0.1785, 0.6163, -0.1034, ..., 1.2210, -1.2130, -0.4310])
Sentence: The new movie is awesome
Embedding: tensor([ 0.8274, 0.5452, -0.1739, ..., 0.7432, -2.1740, 1.8347])
Sentence: The new opera is nice
Embedding: tensor([ 1.4234, 0.9776, -0.4403, ..., 0.5330, -0.8313, 1.5077])
Semantic Search Results
2 tensor(0.9788)
3 tensor(0.6040)
1 tensor(0.0189)
0 tensor(-0.0109)
Query: The new movie is so great
Top 2 most similar sentences in corpus:
The new movie is awesome (Cosine Score: 0.9788)
The new opera is nice (Cosine Score: 0.6040)
>>
util.pytorch_cos_sim()
- 上面两个例子类似,都展示了 util.pytorch_cos_sim() 计算余弦相似度的方式,接受的参数可以是两个二维的tensor,也可以其中一个是一个tensor,分别代表每个文本的句子嵌入,计算时候,会将每个句子嵌入跟另一个文本的每个句子嵌入计算相似度,最后返回一个多维的tensor 一一对应句子之间的相似度结果.
- 对于 Semantic Textual Similarity 任务来说,有很多效果都不错的预训练模型,例子中的 roberta-large-nli-stsb-mean-tokens 和 paraphrase-distilroberta-base-v1 只是其中选取的
- 这是在文本句子列表中查找相似句子的简化版本,对于更大的句子集合,官方提供了另外一个高效的函数 paraphrase_mining
- 另外还有直接使用 util.semantic_search() 寻找最相似的句子,使用GPU等加速方式,并且指定top num;以及使用粗略计算的方式加速训练和在更大的语料上面计算的算法 :API 示例
2. Clustering
将几句话使用 k-means 简单聚类:
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer('paraphrase-distilroberta-base-v1',device='cuda')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
#print(cluster_assignment)[1 1 1 0 0 3 3 4 4 2 2]每句话属于哪个类别打上id 标签
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i+1)
print(cluster)
print("")
>>
Cluster 1
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
Cluster 2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
Cluster 3
['The girl is carrying a baby.', 'The baby is carried by the woman']
Cluster 4
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
Cluster 5
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
>>
另外还有 Agglomerative Clustering 和 Fast Clustering 这两种聚类算法的使用 参见官网详细的解释:cluster
3. train own embedding
使用 sentence-transformer 来微调自己的 sentence / text embedding ,最基本的网络结构来训练embedding:
from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
- 将句子输入一个网络层如 bert,bert 为每一个 token 输出一个embedding ,接着将输出进入一个 池化层(pooling),选择最简单的平均池化层,将所有token embedding 的均值作为输出,所以得到跟输入句子长度无关的一个 定长的句子嵌入sentence embedding(768dim),之前的例子模型是直接调用的 封装好的 pre-trained model 已经由一个bert 层和 pooling 层组成。
- 同样使用 Semantic Textual Similarity 任务训练,那么模型结构如下:
输入是两个句子,label 为两者的相似度打分,句子转换成 embedding u & v,将这两个向量计算余弦相似度然后跟模型输入的 gold similarity score 比较,计算出 loss ,接着进行模型下一步的 fine-tune 参数。调用 model.fit()来 tune 搭建好的模型,关于 .fit() 的参数,模型细节参看:trainning overview
#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
- 使用现有的 evaluators 来测试模型的表现,传递给 .fit()函数
from sentence_transformers import evaluation
sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on']
sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]']
scores = [0.3, 0.6, 0.2]
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
# ... Your other code to load training data
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=500)
Example:
数据长这样:
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1205'), ('score', '3.20'), ('sentence1', 'Israel Forces Kill 2 Palestinian Militants'), ('sentence2', 'Israeli army kills Palestinian militant in raid')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1207'), ('score', '1.20'), ('sentence1', "Death toll 'rises to 17' after typhoon strikes Japan"), ('sentence2', 'Death Toll Rises to 84 in Pakistan Floods')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1223'), ('score', '5.00'), ('sentence1', 'Protests continue in tense Ukraine capital'), ('sentence2', "Protests Continue In Ukraine's Capital")])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1229'), ('score', '4.60'), ('sentence1', 'Two French journalists killed after Mali kidnapping'), ('sentence2', 'Two French journalists abducted, killed in Northern Mali')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1231'), ('score', '2.00'), ('sentence1', 'Headlines in major Iranian newspapers on Dec 14'), ('sentence2', 'Headlines in major Iranian newspapers on July 29')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1233'), ('score', '1.80'), ('sentence1', 'Iran warns of spillover of possible war on Syria'), ('sentence2', 'Iranian Delegation Heads to Lebanon, Syria')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1242'), ('score', '3.00'), ('sentence1', 'Former Pakistan President Pervez Musharraf arrested again'), ('sentence2', 'Former Pakistan military ruler Pervez Musharraf granted bail')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1263'), ('score', '3.80'), ('sentence1', "US drone strike 'kills 4 militants in Pakistan'"), ('sentence2', 'US drone strike kills 10 in Pakistan')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1271'), ('score', '4.00'), ('sentence1', "UK's Ex-Premier Margaret Thatcher Dies At 87"), ('sentence2', 'Former British PM Margaret Thatcher dies')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1303'), ('score', '4.80'), ('sentence1', "Polar bear DNA 'may help fight obesity'"), ('sentence2', 'Polar bear study may boost fight against obesity')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1311'), ('score', '4.80'), ('sentence1', "'Fast & Furious' star Paul Walker dies in car crash"), ('sentence2', 'Paul Walker dead: Fast and Furious star, 40, killed in car crash')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1316'), ('score', '2.80'), ('sentence1', "Air strike kills one man in Syria's Hama"), ('sentence2', 'US drone strike kills eleven in Pakistan')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1321'), ('score', '2.60'), ('sentence1', 'Turkish PM, president clash over reply to protests'), ('sentence2', 'Turkish president calls for calm amid nationwide protests')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1351'), ('score', '4.40'), ('sentence1', 'Strong new quake hits shattered Pak region'), ('sentence2', '6.8 quake in shattered Pakistan region')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1383'), ('score', '4.60'), ('sentence1', 'Floods in central Europe continue to create havoc'), ('sentence2', 'Europe floods continue to create havoc')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1403'), ('score', '2.20'), ('sentence1', 'Luxembourg PM quits amid spying scandal'), ('sentence2', 'Luxembourg votes after spying row')])
OrderedDict([('split', 'test'), ('genre', 'main-news'), ('dataset', 'headlines'), ('year', '2015'), ('sid', '1425'), ('score', '0.40'), ('sentence1', '3 dead, 4 missing in central China construction accident'), ('sentence2', 'One dead, 8 missing in Vietnam boat accident')])
- 首先读入数据并且划分为 训练集,测试集和验证集等;Sentence-Transformer 在 fine-tune 的时候,数据必须保存到 list 中,list 里是 Sentence-Transformer 库的作者自己定义的 InputExample() 对象;InputExample() 对象需要传两个参数 texts 和 label,其中,texts 也是个 list 类型,里面保存了一个句子对,label 必须为 float 类型,表示这个句子对的相似程度
train_samples = []
dev_samples = []
test_samples = []
with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
score = float(row['score']) / 5.0 # Normalize score to range 0 ... 1
inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)
if row['split'] == 'dev':
dev_samples.append(inp_example)
elif row['split'] == 'test':
test_samples.append(inp_example)
else:
train_samples.append(inp_example)
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
- 搭建网络结构:
model_name = 'distilbert-base-uncased'
word_embedding_model = models.Transformer(model_name)
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model],device='cuda')
train_loss = losses.CosineSimilarityLoss(model=model)
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
- 开始训练model
# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=model_save_path)
- 在验证集上测试:
# Load the stored model and evaluate its performance on STS benchmark dataset
model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)
官网提供的如何使用huggingface 预训练的 transformer 框架模型 加上一个 pooling 层 去创建 SentenceTransformer model 的完整训练测试过程:代码
参考和推荐阅读:
Sentence-Transformer Semantic Textual Similarity
Sentence-Transformer 的使用及 fine-tune 教程
Sentence Bert
Sentence-BERT: 一种能快速计算句子相似度的孪生网络