符号分词和词频统计 - 代码天地

符号分词和词频统计

其他 2019-04-30 11:37:59 阅读次数: 0

现在有一段文本

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

我就是想看看里面的词的高频和低频

我需要做两件事情

1. 先分词，分词我们就按照标点和空格来分

2. 接着统计词频

import re
from collections import Counter


def count_words(text):
    """Count """
    counts = dict()
    # convert to lower case
    text_lower = text.lower()
    tokens = re.split('\W+', text_lower)
    counts = Counter(tokens)
    return counts


def test_run():
    with open("text.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\nCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))


if __name__ == '__main__':
    test_run()

运行结果如下

Word
Count
a 9
he 6
the 6
and 5
as 4
was 4
with 3
i 2of 2

his 2

10 least common words:
Word Count
merry 1
word 1
or 1
slap 1
on 1
for 1
more 1
favoured 1
guests 1
1

Process finished with exit code 0

猜你喜欢

转载自www.cnblogs.com/chenyusheng0803/p/10794964.html

符号分词和词频统计

python实现中文分词和词频统计

Python 分词并统计词频

中文分词及词频统计

python进行分词、去停用词和统计词频

简单的结巴分词与词频统计

python进行分词及统计词频

Python jieba 分词+词频统计

jieba分词+collections 词频统计

分词去停用词词频统计

结合jieba库分词并做词频统计

python jieba分词及中文词频统计

使用Ansj分词器+Pig来统计中文的词频

利用word分词来对文本进行词频统计

利用jieba进行中文分词并进行词频统计

【python】英文文本分词词频统计

自然语言处理学习3：中文分句re.split()，jieba分词和词频统计FreqDist

自然语言处理学习1：nltk英文分句WordPunctTokenizer、分词word_tokenize和词频统计FreqDist

（十七）python网络爬虫实战：A股企业公开年报数据的获取与解析，分词和词频统计

统计分词/无字典分词学习(2):n-gram词频统计

词频统计

统计词频

准确分词之动态调整词频和字典顺序

Spark shell 词频统计和统计PV心得

python自然语言处理（一）之中文分词预处理、统计词频

NLP 学习 task2 - jieba、分词、去停用词、词频统计

Python3进行中文文章分词实现词云图与TOP词频统计

【代码模版】加载自定义词典、去停用词分词、词性标注、词频统计

【Python】对英文文本进行词频统计（分词、字典排序、文件读写）

【Python】英文文本分词与词频统计（split()函数、re库）

今日推荐

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

《2024 年一季度互联网投融资运行情况》研究报告

报告：Django 仍然是 74% 开发者的首选

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

周排行

记一下去大梅沙的准备（2018-05-26）

Spring 注解事务

基于HTTP协议的客户端缓存

阿里云rds 备份和还原

[PHP] 几个拖慢 PHP 程序/API 运行速度的点

python 代码风格------------PEP8规则

js控制json生成菜单——自制菜单（一）

将字符串: 'k:1|k1:2|k2:3|k3:4 ' ,处理成 python 字典: {'k':1, 'k1':2, ...}

微信小程序转支付宝小程序

Qt551.窗口滚动条

每日归档

更多

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)