词袋模型(Bag-of-words) 和TF-IDF

2022-03-21 本文已影响0人 Cache_wood

@[toc]

TF-IDF(Term Frequency-Inverse Document Frequency),词频-逆文件概率，一种用于资讯检索与资讯探寻的常用加权技术。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个文件语料库中一份文件的重要程度。

字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

上述引用总结就是, 一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章。这也就是TF-IDF的含义。

TF(Term Frequency, 词频)表示词条在文本中出现的频率
$TF_{i,j} = \frac{n_{i,j}}{\sum_k n_{k,j}}$
有些词语在所有的文章中出现频率都比较高，这反而说明这些词并不是那么重要，所以除了使用TF之外，如果一个词只在某篇文章中出现次数比较多，而在大多数文章中出现次数很少或者不出现，那么这个词比较重要，所以引入IDF。

IDF(Inverse Document Frequency, 逆文件频率)表示关键词的普遍程度。如果包含词条 i 的文档越少， IDF越大，则说明该词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数。
$IDF_i = log\frac{|D|}{1+|j:t_i\in d_j|}$
其中， $|D|$ 表示所有文档的数量， $|j:t_i \in d_j|$ 表示包含词条 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kXGwHHRm-1646482895134)(https://www.zhihu.com/equation?tex=t_i)] 的文档数量,加1主要是防止包含词条 $t_i$ 的数量为 0。

某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。因此，TF-IDF倾向于过滤掉常见的词语，保留重要的词语
$TF-IDF = TF·IDF$

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

X.toarray()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

## 中文
corpus_zh = [
    '我喜欢吃苹果',
    '你喜欢吃橘子'
]

import jieba
corpus_zh_out = [' '.join(jieba.cut(s, cut_all=False)) for s in corpus_zh]
corpus_zh_out

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus_zh_out)
vectorizer.get_feature_names()

X.toarray()

array([[1, 0, 1],
       [1, 1, 0]], dtype=int64)

corpus = ['this is the first document',
        'this is the second second document',
        'and the third one',
        'is this the first document']
words_list = list()
for i in range(len(corpus)):
    words_list.append(corpus[i].split(' '))
print(words_list)

[['this', 'is', 'the', 'first', 'document'], ['this', 'is', 'the', 'second', 'second', 'document'], ['and', 'the', 'third', 'one'], ['is', 'this', 'the', 'first', 'document']]

from collections import Counter
count_list = list()
for i in range(len(words_list)):
    count = Counter(words_list[i])
    count_list.append(count)
print(count_list)

[Counter({'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}), Counter({'second': 2, 'this': 1, 'is': 1, 'the': 1, 'document': 1}), Counter({'and': 1, 'the': 1, 'third': 1, 'one': 1}), Counter({'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1})]

import math
def tf(word, count):
    return count[word] / sum(count.values())

def idf(word, count_list):
    n_contain = sum([1 for count in count_list if word in count])
    return math.log(len(count_list) / (1 + n_contain))

def tf_idf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

第 1 个文档 TF-IDF 统计信息
    word: first, TF-IDF: 0.05754
    word: this, TF-IDF: 0.0
    word: is, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
    word: the, TF-IDF: -0.04463
第 2 个文档 TF-IDF 统计信息
    word: second, TF-IDF: 0.23105
    word: this, TF-IDF: 0.0
    word: is, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
    word: the, TF-IDF: -0.03719
第 3 个文档 TF-IDF 统计信息
    word: and, TF-IDF: 0.17329
    word: third, TF-IDF: 0.17329
    word: one, TF-IDF: 0.17329
    word: the, TF-IDF: -0.05579
第 4 个文档 TF-IDF 统计信息
    word: first, TF-IDF: 0.05754
    word: is, TF-IDF: 0.0
    word: this, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
    word: the, TF-IDF: -0.04463

词袋模型(Bag-of-words) 和TF-IDF

猜你喜欢

热点阅读