Word2vector句子相似度计算

2017-07-05 本文已影响2005人 zqh_zy

word2vector

最近有任务要对句子和文档的相似读进行评估计算，学习了词向量的相关知识，并做了简单的测试。在测试过程中发现网上完整且简单的词向量分析句子相似度的文章比较少，所以打算整理一篇简单的文章，这里忽略词向量的理论知识，直接记录主要的实现过程。

step1 准备语料

这里使用百度开放的WebQA数据集，该数据集包括百度知道的很多问题、答案和相关证据（evidences），原始数据集为json格式，同时包括其他信息，这里仅使用其问题与证据作为训练词向量的语料库。处理后的语料库可以在文章末尾的源码链接下载。

在对数据进行提取后，训练之前需要对文本进行预处理：

读取json数据集，提取问题及其证据，分词并按行保存到文件中，方便训练词向量。
去停用词，这里由于要保留上下文信息，停用词大都包括标点符号及其他特殊字符。

# -*- coding:utf-8 -*-
import json
import os
import shutil
import jieba
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

data_set = "./dataset/me_train.json"
target = "./dataset/train_questions_with_evidence.txt"
stopwords_dict = "./dataset/stop_words_ch.txt"


def rm_stopwords(file_path, word_dict):
    """
        rm stop word for {file_path}, stop words save in {word_dict} file.
        file_path: file path of file generated by function splitwords.
                    each lines of file is format as <file_unique_id> <file_words>.
        word_dict: file containing stop words, and every stop words in one line.
        output: file_path which have been removed stop words and overwrite original file.
    """

    # read stop word dict and save in stop_dict
    stop_dict = {}
    with open(word_dict) as d:
        for word in d:
            stop_dict[word.strip("\n")] = 1
    # remove tmp file if exists
    if os.path.exists(file_path + ".tmp"):
        os.remove(file_path + ".tmp")

    print "now remove stop words in %s." % file_path
    # read source file and rm stop word for each line.
    with open(file_path) as f1, open(file_path + ".tmp", "w") as f2:
        for line in f1:
            tmp_list = []  # save words not in stop dict
            words = line.split()
            for word in words:
                if word not in stop_dict:
                    tmp_list.append(word)
            words_without_stop = " ".join(tmp_list)
            to_write = words_without_stop + "\n"
            f2.write(to_write.encode("utf8"))

    # overwrite origin file with file been removed stop words
    shutil.move(file_path + ".tmp", file_path)
    print "stop words in %s has been removed." % file_path


with open(data_set, "r") as f, open(target, "w") as f2:
    data = json.load(f)
    count = 0
    for key, value in data.iteritems():
        question = data[key]["question"]
        words = jieba.cut(question, cut_all=False)
        f2.write(" ".join(words) + "\n")
        for k, v in data[key]['evidences'].iteritems():
            words2 = jieba.cut(data[key]['evidences'][k]['evidence'], cut_all=False)
            f2.write(" ".join(words2) + "\n")
            count += 1
        count += 1
    print "all question num is %s" % count

rm_stopwords(target, stopwords_dict)

代码中首先读取语料库文件dataset/me_train.json，提取目标文本，分词后保存到dataset/train_questions_with_evidence.txt，最后对该文件中的文本去停用词。

step2 训练词向量

在理解词向量的训练原理后，实现部分利用开源工具，过程相对简单。该部分使用gensim对上面预处理后的文本进行词向量的训练：

# -*- coding:utf-8 -*-
import logging
from gensim.models.word2vec import LineSentence, Word2Vec
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences= LineSentence("./dataset/train_questions_with_evidence.txt")
model = Word2Vec(sentences, min_count=1, iter=1000)
model.save("./model/w2v.mod")

这里使用默认的训练参数，几个代表性的参数：

词向量的维度size=100
上下文窗口window=5
负采样数negative=5

代码中首先用LineSentence将语料文本中的内容按行读入为词序列作为训练数据。接着对语料库进行1000次迭代，并取词的min_count为1，即所有词汇都参与训练。最后保存训练好的模型。这里1000次迭代大概用了10h左右，果断利用晚上的宝贵时间。

句子相似度baseline

目标是进行句子相似度的计算，得到词向量后最简单的方式是把目标句子的各个词的词向量进行相加取平均，把任意长的句子表示成固定维度的向量进行相似度比较。这么做虽然忽略了句子中的词顺序，但可以作为baseline简单衡量句子相似度计算以及词向量训练的效果。

这里进行简单测试，从所有问题集中随机抽取500个问题作为目标句子，然后从终端输入相关问题，从500个问题中匹配相似度前十的10个问题。


# -*- coding:utf-8 -*-
from gensim.models import Word2Vec
import numpy as np
import sys
import jieba
reload(sys)
sys.setdefaultencoding("utf-8")

target = "./dataset/train_questions.txt"
model = "./model/w2v.mod"

rand_i = np.random.choice(range(36190),size=500,replace=False)
with open(target) as f, open("./dataset/target.txt", "w") as f2:
    count = 1
    for line in f:
        if count in rand_i:
            f2.write(line)
        count += 1

class ResultInfo(object):
    def __init__(self, index, score, text):
        self.id = index
        self.score = score
        self.text = text

model_loaded = Word2Vec.load(model)

candidates = []
with open(target) as f:
    for line in f:
        candidates.append(line.decode("utf-8").strip().split())

while True:
    text = raw_input("input sentence: ").decode("utf-8")
    words = list(jieba.cut(text.strip(), cut_all=False))
    print len(words)
    res = []
    index = 0
    for candidate in candidates:
        # print candidate
        score = model_loaded.n_similarity(words, candidate)
        res.append(ResultInfo(index, score, " ".join(candidate)))
        index += 1
    res.sort(cmp=None, key=lambda x:x.score, reverse=True)
    k = 0
    for i in res:
        k += 1
        print "text %s: %s, score : %s" % (i.id, i.text, i.score)
        if k > 9:
            break

文件dataset/train_questions.txt包括所有的问题集，对上面数据预处理脚本稍加修改可以得到，从该数据集中随机抽取500个问题保存到dataset/target.txt。之后load训练好的模型，利用n_similarity进行两个词序列的相似度计算，方法的输入参数是终端输入句子分词后的词序列以及候选目标词序列。

随机测试两个相关问题，确实能得到感人的结果：

Using TensorFlow backend.

input sentence: 编写史记的人受到了什么处罚
text 414: 司马迁 收到 了 什么 刑罚, score : 0.835533210003
text 437: 孔子 认为 可以 使人 温柔敦厚 的 儒家 经书 是 哪一部, score : 0.685654355168
text 36: 复活 是 谁 的 作品, score : 0.668847026927
text 158: 植物 人 的 神经系统 可能 没有 受到 损伤 的 部位 是, score : 0.666936575118
text 314: 中庸 是 谁 的 著作, score : 0.651872698188
text 487: 毛主席 的 战士 最 听 党 的话 这 首歌 反映 了 什么 地方 边防战士 的 生活, score : 0.643818242781
text 198: 少年 韩寒 中学 肄业 却 出 了 一本 叫做 三重门 的 书 这 本书 的 体裁 是, score : 0.640927475925
text 175: 孔子 创立 了 什么 学派, score : 0.639292126835
text 366: 锯子 是 谁 发明 的, score : 0.630026412448
text 363: 古人 对 幼年 的 儿童 的 代称 是, score : 0.627604867741

input sentence: 谁是百度的总裁
text 409: 百度 的 董事长 是 谁, score : 0.950410820636
text 337: 阿里巴巴 的 总裁 是 谁, score : 0.902061296606
text 330: 中国移动 老总 是 谁 啊, score : 0.817496001508
text 323: 火影忍者 谁 是 名人 的 爸爸, score : 0.760750518611
text 386: china 老大 是 谁, score : 0.758674075597
text 234: 不能 说 的 秘密 导演 是 谁, score : 0.757126307252
text 318: 李连杰 的 老婆 是 谁 呀, score : 0.743579774903
text 317: 姐 的 儿子 是 我 什么 呢, score : 0.726849299939
text 325: 中国 最后 一个 皇帝 是 谁 拜托 了 各位 谢谢, score : 0.726334050126
text 477: 刘备 的 爸爸 是 谁, score : 0.724550888419

小结

这里简单对语料文本进行词向量的训练，并直接通过词序列向量求和取平均作为句子的向量表示，进行句子相似度的计算。这种方式简单粗暴，对于像文章中的短句，而且歧义少的文本有不错的效果。如果目标文本是长文，并且需要考虑词顺序，句子顺序的情况，这种简单的方法很难适应。

源码链接

参考资源：
models.word2vec – Deep learning with word2vec
word2vec 中的数学原理详解
 Distributed Representations of Sentences and Documents
word2vec Parameter Learning Explained

原创文章，转载注明出处, 更多点击链接到我的博客

Word2vector句子相似度计算

step1 准备语料

step2 训练词向量

句子相似度baseline

小结

猜你喜欢

热点阅读