符号分词和词频统计

现在有一段文本

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

我就是想看看 里面的词的高频和低频

我需要做两件事情

1. 先分词,分词我们就按照标点和空格来分

2. 接着统计词频

import re
from collections import Counter


def count_words(text):
    """Count """
    counts = dict()
    # convert to lower case
    text_lower = text.lower()
    tokens = re.split('\W+', text_lower)
    counts = Counter(tokens)
    return counts


def test_run():
    with open("text.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\nCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))


if __name__ == '__main__':
    test_run()

运行结果如下

Word
Count
a 9
he 6
the 6
and 5
as 4
was 4
with 3
i 2of 2

his 2

10 least common words:
Word Count
merry 1
word 1
or 1
slap 1
on 1
for 1
more 1
favoured 1
guests 1
1

Process finished with exit code 0

猜你喜欢

转载自www.cnblogs.com/chenyusheng0803/p/10794964.html
今日推荐