如何统计序列中元素的出现频度？

2019-02-07 本文已影响0人 Diolog

实际案例:

某随机序列[12,5,6,4,6,5,5,7,...]中，找到出现次数最高的三个元素，它们出现的次数是多少？
对某英文文章的单词，进行词频统计，找到出现次数最高的10个单词，它们出现次数是多少？

获取序列出现次数最高的三个元素

首先我们生成需要用到的数据：随机生成某个列表:

from random import randint
data = [randint(0,20) for _ in range(30)]

做法1：遍历列表

result = dict.fromkeys(data,0)
for _ in data:
  result[_] += 1

解决方案：

使用collections.Counter对象
将序列传入Counter的构造器，得到Counter对象是元素频度的字典
Counter.most_common(n)方法得到频度最高的n个元素的列表

from collections import Counter
result = Counter(data).most_common(3)

输出结果：
[(12, 4), (8, 3), (17, 3)]

词频统计

首先获取文件中的英文文章，这里我将文件保存在Windows的桌面上，文件名为：'testCounter.txt'

txt = open(r'C:\Users\Administrator\Desktop\testCounter.txt',encoding='UTF-8').read()

使用正则表达式来分割单词,使用非字母的字符作为分隔符,获取所有单词列表:
获取列表之后就可以使用Counter类处理list_words,来获取词频统计，最后通过most_common(n)方法来获取最高词频的单词。

import re
list_words = re.split('\W+',txt)

from collections import Counter
counter_words = Counter(list_words)

max_ten = counter_words.most_common(10)

输出结果：

[('the', 31), ('of', 18), ('and', 14), ('in', 13), ('are', 13), ('a', 10), ('Python', 9), ('to', 9), ('you', 8), ('that', 7)]

如何统计序列中元素的出现频度？

获取序列出现次数最高的三个元素

词频统计

猜你喜欢

热点阅读