Python3进行词频统计

一、统计序列中元素的频率
1.以序列中统计元素个数为例

from random import randint
# 先用随机库生成有重复元素的序列
list = [randint(0,10) for _ in range(1,20)]
print(list)

[10,7,10,6,10,5,2,6,1,0,9,0,3,5,2,5,5,3,10]

方法①:新建空字典,循环遍历做判断

d={}
for i in list:
    if i not in d:
        d[i]=1
    else:
        d[i]+=1

d
{0:2,1:1,2:2,3:2,5:4,6:2,7:1,9:1,10:4}# 0:2表示0出现2次

方法②:新建一个以序列元素为键,值为0的字典

# 生成以序列list为key,value全为0的字典
c=dict.fromkeys(list,0)

c
{0:0,1:0,2:0,3:0,5:0,6:0,7:0,9:0,10:0}

for i in list:
    c[i]+=1 

c
{0:2,1:1,2:2,3:2,5:4,6:2,7:1,9:1,10:4} 

sorted(c.items(),key=lambda x:x[1],reverse=True)
[(10,4),(5,4),(6,2),(2,2)(0,2),(3,2),(7,1)(9,1)(1,1)]

方法③:使用Collections下的Counter对象

from collections import Counter
c1=Counter(list)

c1
Counter({0:2,1:1,2:2,3:2,5:4,6:2,7:1,9:1,10:4})

c1.most_common(5)
[(10,4),(5,4),(6,2),(2,2)(0,2)]

二、统计一段文本中单词出现的频率
1.先用正则表达把文本分割成单词列表

import re
s='The Zen of Python,by Tim Peters
   Beautiful is better than ugly
   Simple is better than complex
   Sparse is better than dense'
data=re.split(r"\W+",s)
data
['The',
'Zen',
'of',
...
'than',
'dense']

2.实例化Counter对象

c2=Counter(data)
c2
Counter({'Beatuiful':1,
         'Complex':1,
         ...
         'better':3,
         'ugly':1})
c2.most_common(3)
[('better',3)('than',3)('is',3)]

猜你喜欢

转载自blog.csdn.net/qq_32482091/article/details/81072595