NLP小工具
2022-01-01 本文已影响0人
WritingHere
日常用NLP脚本备份
机器翻译
- 使用Huggingface提供的接口,和Helsinki-NLP提供的脚本,实现快速的机器翻译;
- 为了便于批量处理,服务端使用Flask制作API,客户端使用requests发送请求
服务端代码api.py
如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import torch
import json
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from flask import Flask, request
app = Flask(__name__)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-zh").to(device)
@app.route("/", methods=['POST'])
def index():
text = request.get_json()['text']
batch = tokenizer.prepare_seq2seq_batch(src_texts=text)
for k, v in batch.items():
batch[k] = torch.tensor([w[:512] for w in v]).to(device)
translation = model.generate(**batch)
result = tokenizer.batch_decode(translation, skip_special_tokens=True)
return json.dumps({'result': result}, ensure_ascii=False)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=9100)
客户端代码main.py
如下:
#!/usr/bin/env python
import yaml
import requests as rq
text = ['Oh, god, this is great! The plane is gone, so it looks like I\'m stuck here with you guys.', 'I love you.']
headers = {'Content-Type': 'application/json', 'Accept':'application/json'}
data = {'text': text}
a = rq.post('http://127.0.0.1:9100', data=json.dumps(data), headers=self.headers)
print(a.text)
- 运行服务端:
python api.py
看到* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
就说明服务器端启动成功了 - 运行客户端:
python main.py
可以看到输出结果为:{"result": ["飞机没了 看来我跟你们困在这里了", "我爱你"]}
TopK算法
- 有时候需要从一段连续序列中取出topk个元素,如果直接排序,则复杂度较高,为,这里我们维护一个大小为K的小顶堆,从前往后遍历数组并依次加入小顶堆,即可实现topK算法。下面展示了
python
的实现方式: - TopK代码
import heapq
class PriorityQueueTopK:
def __init__(self, k=10):
"""[summary]
Args:
k (int, optional): Max number of the queue. Defaults to 10.
"""
self._queue = []
self._index = 0
self.k = k
def push(self, item, priority=None):
# 传入两个参数,一个是存放元素的数组,另一个是要存储的元素,这里是一个元组。
if priority is None: priority = item
if len(self._queue) < self.k:
heapq.heappush(self._queue, (priority, self._index, item))
self._index += 1
elif priority > self._queue[0][0]:
heapq.heapreplace(self._queue, (priority, self._index, item))
self._index += 1
def pop(self):
return heapq.heappop(self._queue)[-1]
def topk(self):
return [w[-1] for w in self._queue]
return self._queue
- TopK测试代码:
k = 5
items = [random.randint(1, 10) for i in range(10)]
print(items)
pq = PriorityQueueTopK(k)
for i in range(len(items)):
pq.push(items[i])
res = pq.topk()
print(res)