Chinese CountVectorizer method for feature extraction

The method of feature extraction CountVectorizer

from sklearn.feature.extraction.text import CountVectorizer

This method is based on the number of statistics continue to text classification word

Text Feature Extraction

Effect: the characteristic value of the text

sklearn.feature_extraction.text.CountVectorizer(stop_words = [])

 返回:词频矩阵

CountVectorizer.fit_transform (X) X: text or iterables contain text strings

 返回:sparse矩阵 在后面加上 .toarray() 可以转换为二维数组

CountVectorizer.inverse_transform (X) X: array sparse array or matrix

 返回:转换之前数据格

CountVectorizer.get_feature_names()

 返回:单词列表,也可以说是返回特征名字



Chinese feature extraction example (manual word)

from sklearn.feature_extraction.text Import CountVectorizer
 # Chinese need word, sentence or is used as a word. English is not required, because the English word already has a space 
DEF chinese_text_count_demo (): 
    the Data = [ " I Love Beijing Tiananmen " , " Tiananmen Square on the sun rose " ] 
    
    # 1, an example of a converter class (why is it called converters, because is converted to the text value) 
    Transfer = CountVectorizer () 
    
    # 2, call fit_transform 
    data_new = transfer.fit_transform (Data)
     Print ( " data_new: \ n- " , data_new.toarray ())
     Print ( " feature name: \ n- ", Transfer.get_feature_names ()) 
    
    return None 

IF  the __name__ == ' __main__ ' : 
    chinese_text_count_demo () 

output: 
data_new: 
 [[ . 1. 1 0] 
 [0 . 1. 1 ]] 
wherein Name: 
 [ ' Beijing ' , ' Tiananmen ' , ' sun ' ]

Analysis: The above represents the first line, data of the sentence

Numbers represent the number of times the word appears characteristics

Chinese feature extraction example (using jieba word)

First you need to download jieba in their cmd command line

pip3 install jieba / pip install jieba

from sklearn.feature_extraction.text Import CountVectorizer
 Import jieba 

DEF cut_word (text):
     # Chinese word segmentation 
    return  "  " .join (List (jieba.cut (text)))
     # jieba.cut (text) returns a generator object to be converted into iterator 
    # return "" .join (jieba.lcut (text)) 
    # jieba.cut (text) directly returns a list list 

DEF auto_chinese_text_count_demo (): 
    the Data = [ " you say how to do this " 
           , " Tang Long loudly asked how is it " 
           , " night to find a place to drink Jizhong how kind " 
           , "Old Faithful brought them to Zhulao out there standing at the grave of cypress say you look at how this kind of terrain, if our people come from the city through large or small ferry crossing thousands of miles along the dike " ] 
    data_new = []
     for Sent in Data: 
        data_new.append (cut_word (Sent)) 
    
    Print ( " sentence after word: \ n- " , data_new) 
    
    # . 1, a converter class instantiation 
    Transfer = CountVectorizer (STOP_WORDS = [ " said " , " a " ]) # pause word should pre-clean up, here is the demonstration 
    
    # 2, call fit_transform 
    data_vector_value = transfer.fit_transform (data_new)
     Print ( " data_vector_value: \ the n-" , Data_vector_value.toarray ())
     Print ( " feature name: \ the n- " , transfer.get_feature_names ()) 
    
    return None 
    
    
IF  __name__ == ' __main__ ' : 
    auto_chinese_text_count_demo () 


output: 
After the sentence word: 
 [ ' You say that the how to do ' , ' Tang Long wondered aloud how it ' , ' at night to find a place to drink Jizhong how kind ' , ' old faithful brought them standing there next to Zhulao Grand cypress grave saying how you look at this kind of terrain If our people come from the city through large or small ferry crossing thousands of miles along the dike ' ] 
data_vector_value:
 [[0 0 0 0 0 0 0 0 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
 [0 0 0 . 1 0 0 0 0. 1 0 0 0. 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
 [0 . 1 0 0 0. 1 0 0 0 0 0 0 0. 1 0 0. 1 0. 1 0 0 0 0 0 0 0 0 0] 
 [ 1,010,101,101,100,111,010. 1. 1 1111111 ]] 
features name: 
 [ ' them ' , ' Jizhong ' , ' Trinidad ' , ' Tang Long ' , ' terrain ' , ' local ' , ' grave ' , ' the city ' ,' Loud ' , ' Big Cypress ' , ' big ferry ' , ' how to do ' , ' how is it ' , ' how to ' , ' we ' , ' or ' , ' find ' , ' from ' , ' night ' , ' Zhulao out ' , ' down ' , ' crossing ' ,' Look' , ' After ' , ' Old Faithful collar ' , ' come ' , ' this ' , ' where ' ]

 

Guess you like

Origin www.cnblogs.com/henabo/p/11588437.html