The method of feature extraction CountVectorizer
from sklearn.feature.extraction.text import CountVectorizer
This method is based on the number of statistics continue to text classification word
Text Feature Extraction
Effect: the characteristic value of the text
sklearn.feature_extraction.text.CountVectorizer(stop_words = [])
返回:词频矩阵
CountVectorizer.fit_transform (X) X: text or iterables contain text strings
返回:sparse矩阵 在后面加上 .toarray() 可以转换为二维数组
CountVectorizer.inverse_transform (X) X: array sparse array or matrix
返回:转换之前数据格
CountVectorizer.get_feature_names()
返回:单词列表,也可以说是返回特征名字
Chinese feature extraction example (manual word)
from sklearn.feature_extraction.text Import CountVectorizer # Chinese need word, sentence or is used as a word. English is not required, because the English word already has a space DEF chinese_text_count_demo (): the Data = [ " I Love Beijing Tiananmen " , " Tiananmen Square on the sun rose " ] # 1, an example of a converter class (why is it called converters, because is converted to the text value) Transfer = CountVectorizer () # 2, call fit_transform data_new = transfer.fit_transform (Data) Print ( " data_new: \ n- " , data_new.toarray ()) Print ( " feature name: \ n- ", Transfer.get_feature_names ()) return None IF the __name__ == ' __main__ ' : chinese_text_count_demo () output: data_new: [[ . 1. 1 0] [0 . 1. 1 ]] wherein Name: [ ' Beijing ' , ' Tiananmen ' , ' sun ' ]
Analysis: The above represents the first line, data of the sentence
Numbers represent the number of times the word appears characteristics
Chinese feature extraction example (using jieba word)
First you need to download jieba in their cmd command line
pip3 install jieba / pip install jieba
from sklearn.feature_extraction.text Import CountVectorizer Import jieba DEF cut_word (text): # Chinese word segmentation return " " .join (List (jieba.cut (text))) # jieba.cut (text) returns a generator object to be converted into iterator # return "" .join (jieba.lcut (text)) # jieba.cut (text) directly returns a list list DEF auto_chinese_text_count_demo (): the Data = [ " you say how to do this " , " Tang Long loudly asked how is it " , " night to find a place to drink Jizhong how kind " , "Old Faithful brought them to Zhulao out there standing at the grave of cypress say you look at how this kind of terrain, if our people come from the city through large or small ferry crossing thousands of miles along the dike " ] data_new = [] for Sent in Data: data_new.append (cut_word (Sent)) Print ( " sentence after word: \ n- " , data_new) # . 1, a converter class instantiation Transfer = CountVectorizer (STOP_WORDS = [ " said " , " a " ]) # pause word should pre-clean up, here is the demonstration # 2, call fit_transform data_vector_value = transfer.fit_transform (data_new) Print ( " data_vector_value: \ the n-" , Data_vector_value.toarray ()) Print ( " feature name: \ the n- " , transfer.get_feature_names ()) return None IF __name__ == ' __main__ ' : auto_chinese_text_count_demo () output: After the sentence word: [ ' You say that the how to do ' , ' Tang Long wondered aloud how it ' , ' at night to find a place to drink Jizhong how kind ' , ' old faithful brought them standing there next to Zhulao Grand cypress grave saying how you look at this kind of terrain If our people come from the city through large or small ferry crossing thousands of miles along the dike ' ] data_vector_value: [[0 0 0 0 0 0 0 0 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 . 1 0 0 0 0. 1 0 0 0. 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 . 1 0 0 0. 1 0 0 0 0 0 0 0. 1 0 0. 1 0. 1 0 0 0 0 0 0 0 0 0] [ 1,010,101,101,100,111,010. 1. 1 1111111 ]] features name: [ ' them ' , ' Jizhong ' , ' Trinidad ' , ' Tang Long ' , ' terrain ' , ' local ' , ' grave ' , ' the city ' ,' Loud ' , ' Big Cypress ' , ' big ferry ' , ' how to do ' , ' how is it ' , ' how to ' , ' we ' , ' or ' , ' find ' , ' from ' , ' night ' , ' Zhulao out ' , ' down ' , ' crossing ' ,' Look' , ' After ' , ' Old Faithful collar ' , ' come ' , ' this ' , ' where ' ]