统计文本中token及其词频

使用collection中的Counter()类

sentences = [
    ['the', 'cat', 'is', 'running', 'in', 'the', 'room'],
    ['I', 'love', 'you', 'very', 'much'],
    ['my', 'kids', 'are', 'smart']]

mini_frq = 1

word_counts = collections.Counter()
for sent in sentences:
    word_counts.update(sent)
print(word_counts)
print('=======================================')
print(word_counts.most_common())# 按照value进行排序,并改成list of tuple形式
print('=========================================')
vocabulary_inv = ['<START>', '<UNK>', '<END>'] + \
                 [x[0] for x in word_counts.most_common() if x[1] >= mini_frq]
print(vocabulary_inv)
print('==========================================')
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
print(vocabulary)
Counter({'the': 2, 'cat': 1, 'is': 1, 'running': 1, 'in': 1, 'room': 1, 'I': 1, 'love': 1, 'you': 1, 'very': 1, 'much': 1, 'my': 1, 'kids': 1, 'are': 1, 'smart': 1})
=======================================
[('the', 2), ('cat', 1), ('is', 1), ('running', 1), ('in', 1), ('room', 1), ('I', 1), ('love', 1), ('you', 1), ('very', 1), ('much', 1), ('my', 1), ('kids', 1), ('are', 1), ('smart', 1)]
=========================================
['<START>', '<UNK>', '<END>', 'the', 'cat', 'is', 'running', 'in', 'room', 'I', 'love', 'you', 'very', 'much', 'my', 'kids', 'are', 'smart']
==========================================
{'<START>': 0, '<UNK>': 1, '<END>': 2, 'the': 3, 'cat': 4, 'is': 5, 'running': 6, 'in': 7, 'room': 8, 'I': 9, 'love': 10, 'you': 11, 'very': 12, 'much': 13, 'my': 14, 'kids': 15, 'are': 16, 'smart': 17}

上述使用了Collection模块,该模块的具体讲解可参考链接:
http://www.pythoner.com/205.html

发布了111 篇原创文章 · 获赞 113 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_43178406/article/details/102779756
今日推荐