理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer - 代码天地

理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer

其他 2018-08-30 10:55:04 阅读次数: 0

"""
理解sklearn中的CountVectorizer和TfidfVectorizer
"""
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

sentences = ["there is a dog dog", "here is a cat"]
count_vec = CountVectorizer()
a = count_vec.fit_transform(sentences)
print(a.toarray())
print(count_vec.vocabulary_)
"""
输出
{'dog': 1, 'there': 4, 'here': 2, 'cat': 0, 'is': 3}
表示每个词汇对应的坐标
"""

print("=" * 10)
tf_vec = TfidfVectorizer()
b = tf_vec.fit_transform(sentences)
print(b.toarray())
print(tf_vec.vocabulary_)
print(tf_vec.idf_)  # 逆文档频率
print(tf_vec.get_feature_names())


def mytf_idf(s):
    # 自己实现tfidf
    words = tf_vec.get_feature_names()
    tf_matrix = np.zeros((len(s), len(words)), dtype=np.float32)
    smooth = 1
    # 初始值加上平滑因子
    df_matrix = np.ones(len(words), dtype=np.float32) * smooth
    for i in range(len(s)):
        s_words = s[i].split()
        for j in range(len(words)):
            cnt = Counter(s_words).get(words[j], 0)
            tf_matrix[i][j] = cnt
            if cnt > 0:
                df_matrix[j] += 1
    # idf一定是大于1的数值
    idf_matrix = np.log((len(s) + smooth) / df_matrix) + 1
    matrix = tf_matrix * idf_matrix
    matrix = matrix / np.linalg.norm(matrix, 2, axis=1).reshape(matrix.shape[0], 1)
    print(matrix)


print("=" * 10)
mytf_idf(sentences)
"""
TODO:
* IDF可以学到，通过神经网络反向传播来学习IDF而不是直接计算得出
* CountVectorizer有时不需要考虑个数，只需要知道是否出现过即可
"""

猜你喜欢

转载自www.cnblogs.com/weiyinfu/p/9558755.html

理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer

文本数据预处理：sklearn中CountVectorizer、TfidfTransformer和TfidfVectorizer

sklearn 中的Countvectorizer/TfidfVectorizer保留长度小于2的字符方法

sklearn.feature_extraction.text.CountVectorizer 参数说明

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

sklearn CountVectorizer\TfidfVectorizer\TfidfTransformer函数详解

sklearn基础（一）文本特征提取函数CountVectorizer()和TfidfVectorizer()

HashingVectorizer-CountVectorizer-TfidfVectorizer的区别和联系

【Python3机器学习】sklearn中的CountVectorizer和TfidfTransformer

tf-idf：sklearn中TfidfVectorizer使用

Python sklearn 中的TfidfVectorizer参数解析

python sklearn包中的CountVectorizer函数

Scikit-learn CountVectorizer与TfidfVectorizer

sklearn.feature_extraction.text.CountVector

CountVectorizer与TfidfVectorizer 对文本特征的特征抽取

sklearn——TfidfVectorizer笔记

sklearn——CountVectorizer详解

sklearn countvectorizer坑

sklearn CountVectorizer 单字

特征选择- Sklearn.feature_selection的理解

03_数据的特征抽取，sklearn特征抽取API，字典特征抽取DictVectorizer,文本特征抽取CountVectorizer，TF-IDF(TfidfVectorizer),详细案例

使用CountVectorizer和TfidfVectorizer对fetch_20newsgroups数据进行分类，并对是否使用停用词进行对比（精确度）

sklearn.feature_extraction.text文本特征实验

n-gram的理解：使用sklearn CountVectorizer 实现n-gram

机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer

4.2 文本特征抽取的两种方式CountVectorizer与TfidfVectorizer

local feature和global feature的理解

深入理解深度学习中的【卷积】和 feature map

Python开发之 Sklearn的模型和 CountVectorizer 、Transformer 保存和使用

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)