利用xlwings库,实现文件名与excel表格内容的极大似然匹

2022-04-14  本文已影响0人  不懂球的2大业

1.背景

2.实现

1.在hotels.xlsx单元表格中获取所有宾馆的名字全称并存至列表生成文本集hotel_names。在住宿行业承诺书文件夹获取所有宾馆简称,生成关键字keywords库;

2.用jieba分词对hotel_names处理,生成分词列表hotel_names;

3.基于分词列表hotel_names建立词典dictionary,并获得词典特征数num_features;

4.基于词典dictionary,将分词列表集hotel_names转换成稀疏向量集corpus,称作语料库corpus;

5.用词典dictionary将keywords库中的每一个keyword转换成稀疏向量;

6.创建TF-IDF模型,传入语料库corpus训练;

7.用TF-IDF模型处理语料库,得到sparse_matrix;

8.对于每一个keyword,用TF-IDF模型处理,得到相似度。获取相似度最大的那一个的索引,在表格上找出标黄。
'''
Descripttion: 利用xlwings库,实现文件名与excel表格内容的极大似然匹配标注
version: V2.0
Author: HK
Date: 2022-04-11 22:25:44
LastEditors: HK
LastEditTime: 2022-04-13 23:36:30
'''

import os
from sklearn.decomposition import sparse_encode
import xlwings as xw
from jieba import lcut
from gensim.similarities import SparseMatrixSimilarity
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
import numpy as np

hotel_names = []

hotel_file_path = "./hotels.xlsx"

hotel_names_path = "./住宿行业承诺书/"
keywords = os.listdir(hotel_names_path)

for i in range(len(keywords)):
    index = keywords[i].rfind('.')
    keywords[i] = keywords[i][0:index]

try:
    app = xw.App(visible=False,add_book=False)
    wb = app.books.open(hotel_file_path)
    sht = wb.sheets("Sheet1")

    info = sht.used_range
    nrows = info.last_cell.row

    for i in range(3,nrows + 1):
        rng = "C%d" % i
        item = sht.range(rng)
        hotel_names.append(item.value)

    hotel_names = [lcut(hotel_name) for hotel_name in hotel_names]

    dictionary = Dictionary(hotel_names)
    num_features = len(dictionary.token2id)

    corpus = [dictionary.doc2bow(hotel_name) for hotel_name in hotel_names]

    kw_vectors = [dictionary.doc2bow(lcut(keyword)) for keyword in keywords]

    tfidf = TfidfModel(corpus)

    tf_texts = tfidf[corpus]
    sparse_matrix = SparseMatrixSimilarity(tf_texts,num_features)

    for kw_vector in kw_vectors:
        tf_kw = tfidf[kw_vector]
        similarities = sparse_matrix.get_similarities(tf_kw)
        index = np.argmax(similarities)
        item = sht.range(f'C{index + 3}')
        item.color = (255,255,0)

finally:
    wb.save()
    wb.close()
    app.quit()

3.总结

参考文献:

1.Python+gensim-文本相似度分析
2.使用gensim计算文本相似度

上一篇 下一篇

猜你喜欢

热点阅读