4.2 Two methods of text feature extraction CountVectorizer and TfidfVectorizer

introduction

In my opinion, the following two methods are to convert text data into numeric data.

The array format of CountVectorizer can be understood as each row of the array records the number of words in each text data

The array format of TfidfVectorizer can be understood as each row of the array records the importance of each word in the text data

1. CountVectorizer—Count the number of times

The main function is: feature value of text data

1 Introduction

See the official website for details

作用:对文本数据进行特征值化
类:sklearn.feature_extraction.text.CountVectorizer

CountVectorizer语法
CountVectorizer(max_df=1.0,min_df=1,)
# 返回词频矩阵

# 方法
CountVectorizer.fit_transform(X,y)      
# X:文本或者包含文本字符串的可迭代对象
# 返回值:返回sparse矩阵
CountVectorizer.inverse_transform(X)
# X:array数组或者sparse矩阵
# 返回值:转换之前数据格式
CountVectorizer.get_feature_names()
# 返回值:单词列表

Case

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

# ngram_range=(2, 2)
# (1,1)的ngram_range表示仅字母组合,(1,2)表示单字母组合和双字母组,而(2,2)则仅表示双字母组。
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> print(vectorizer2.get_feature_names())
['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
 'this document', 'this is', 'this the']
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]
# 文本特征抽取---英文
# 导入包
from sklearn.feature_extraction.text import CountVectorizer

def countvec():
    '''对文本进行特征值化
    return None'''
    # 实例化
    cv = CountVectorizer()
    # 调用fit_transform方法输入数据并转换,返回sparse矩阵
    data = cv.fit_transform(["life is short,i like python","life is too long,i dislike python"])
    # 调用get_feature_names()方法,返回单词列表
    print(cv.get_feature_names())
    # CountVectorizer()这个API没有sparse参数,利用toarray()方法将sparse矩阵转换array数组
    print(data.toarray())
    return None

if __name__ == '__main__':
    countvec()

# ['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']
# [[0 1 1 1 0 1 1 0]
#  [1 1 1 0 1 1 0 1]]
'''1.统计所有文章当中所有的词,重复的只看做一次-词的列表
   2.对每篇文章在词的列表里面进行统计每个词出现的次数
   3.单个字母不统计-单个英文字母没有分类依据'''

# 文本特征抽取-中文
from sklearn.feature_extraction.text import CountVectorizer
import jieba

# 利用jieba包,利用jieba.cut进行分词,返回值是词语生成器
# 需要对中文进行分词才能详细的进行特征值化
def cutword():
    
    con1 = jieba.cut("今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。")
    con2 = jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。")
    con3 = jieba.cut("如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")

    # 转换成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # 把列表转换成字符串,用空格隔开
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def hanzivec():
    """
    中文特征值化
    :return: None
    """
    c1, c2, c3 = cutword()
    print(c1, c2, c3)

    cv = CountVectorizer()

    data = cv.fit_transform([c1, c2, c3])

    print(cv.get_feature_names())

    print(data.toarray())

    return None

if __name__ == '__main__':
    hanzivec()

['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某样', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '这样']
[[0 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0]]

Second, TfidfVectorizer

See the official website for details

1 Introduction

TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的概率高,
并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分
能力,适合用来分类。

tf*idf---重要性程度
tf:词的频率(term frequency)      出现的次数
idf:逆文档频率(inverse document frequency) log(总文档数量/该词出现的文档数量)

TF-IDF作用:用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。

类:sklearn.feature_extraction.text.TfidfVectorizer
TfidfVectorizer语法

TfidfVectorizer(stop_words=None,)
# 返回词的权重矩阵
# 方法
TfidfVectorizer.fit_transform(X,y)      
# X:文本或者包含文本字符串的可迭代对象
# 返回值:返回sparse矩阵
TfidfVectorizer.inverse_transform(X)
# X:array数组或者sparse矩阵
# 返回值:转换之前数据格式
TfidfVectorizer.get_feature_names()
# 返回值:单词列表

2. Case

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
print(X.shape)


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
(4, 9)

# tfidf文本特征抽取
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba
def cutword():

    con1 = jieba.cut("今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。")

    con2 = jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。")

    con3 = jieba.cut("如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")

    # 转换成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # 把列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def tfidfvec():
    """
    中文特征值化
    :return: None
    """
    c1, c2, c3 = cutword()

    print(c1, c2, c3)

    tf = TfidfVectorizer()

    data = tf.fit_transform([c1, c2, c3])

    print(tf.get_feature_names())

    print(data.toarray())

    return None
if __name__ == '__main__':
    tfidfvec()

# ['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某样', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '这样']
# [[0.         0.         0.21821789 0.         0.         0.
#   0.43643578 0.         0.         0.         0.         0.
#   0.21821789 0.         0.21821789 0.         0.         0.
#   0.         0.21821789 0.21821789 0.         0.43643578 0.
#   0.21821789 0.         0.43643578 0.21821789 0.         0.
#   0.         0.21821789 0.21821789 0.         0.         0.        ]
#  [0.         0.         0.         0.2410822  0.         0.
#   0.         0.2410822  0.2410822  0.2410822  0.         0.
#   0.         0.         0.         0.         0.         0.2410822
#   0.55004769 0.         0.         0.         0.         0.2410822
#   0.         0.         0.         0.         0.48216441 0.
#   0.         0.         0.         0.         0.2410822  0.2410822 ]
#  [0.15698297 0.15698297 0.         0.         0.62793188 0.47094891
#   0.         0.         0.         0.         0.15698297 0.15698297
#   0.         0.15698297 0.         0.15698297 0.15698297 0.
#   0.1193896  0.         0.         0.15698297 0.         0.
#   0.         0.15698297 0.         0.         0.         0.31396594
#   0.15698297 0.         0.         0.15698297 0.         0.        ]]


Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/112546357