Hamlet.txt全文下载:https://python123.io/resources/pye/hamlet.txt
文本词频统计代码①如下:
# CalHamlet_1.py
def getText():
txt = open("Hamlet.txt",'r').read()
txt = txt.lower() #将所有文本中的英文全部换为小写字母
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
txt = txt.replace(ch, ' ') #将文本中的特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print('{0:<10}{1:>5}'.format(word, count))
运行结果:
D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
the 1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436
Process finished with exit code 0
词频统计代码②如下:
(排除掉大多数冠词、代词、连接词等语法型词汇)
# CalHamlet_2.py
excludes = {"the","and","of","you","a","i","my","in"}
#建立排除库,排除掉大多数冠词、代词、连接词等语法型词汇
def getText():
txt = open("Hamlet.txt",'r').read()
txt = txt.lower() #将所有文本中的英文全部换为小写字母
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
txt = txt.replace(ch, ' ') #将文本中的特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print('{0:<10}{1:>5}'.format(word, count))
运行结果:
D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
to 754
hamlet 462
it 416
that 391
is 340
not 314
lord 309
his 296
this 295
but 269
Process finished with exit code 0
参考文献:
[1] 嵩天,礼欣,黄天羽. python语言程序设计基础[M]. 第二版. 北京:高等教育出版社,2019:171-174.