Hamlet.txt下载及实现文本词频统计


Hamlet.txt全文下载:https://python123.io/resources/pye/hamlet.txt


文本词频统计代码①如下:

# CalHamlet_1.py
def getText():
    txt = open("Hamlet.txt",'r').read()
    txt = txt.lower()    #将所有文本中的英文全部换为小写字母
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
        txt = txt.replace(ch, ' ')  #将文本中的特殊字符替换为空格
    return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
    word, count = items[i]
    print('{0:<10}{1:>5}'.format(word, count))

运行结果:

D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
the        1138
and         965
to          754
of          669
you         550
i           542
a           542
my          514
hamlet      462
in          436

Process finished with exit code 0

词频统计代码②如下:
(排除掉大多数冠词、代词、连接词等语法型词汇)

# CalHamlet_2.py
excludes = {"the","and","of","you","a","i","my","in"}
#建立排除库,排除掉大多数冠词、代词、连接词等语法型词汇
def getText():
    txt = open("Hamlet.txt",'r').read()
    txt = txt.lower()    #将所有文本中的英文全部换为小写字母
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
        txt = txt.replace(ch, ' ')  #将文本中的特殊字符替换为空格
    return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
for word in excludes:
    del(counts[word])
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
    word, count = items[i]
    print('{0:<10}{1:>5}'.format(word, count))

运行结果:

D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
to          754
hamlet      462
it          416
that        391
is          340
not         314
lord        309
his         296
this        295
but         269

Process finished with exit code 0


参考文献:

[1] 嵩天,礼欣,黄天羽. python语言程序设计基础[M]. 第二版. 北京:高等教育出版社,2019:171-174.

发布了7 篇原创文章 · 获赞 2 · 访问量 1546

猜你喜欢

转载自blog.csdn.net/qq_38636076/article/details/104626943
今日推荐