1. Parameters
sklearn.feature_extraction.text.CountVector is one of the text feature extraction methods provided by sklearn.feature_extraction.text.
4 text feature extraction methods in sklearn.feature_extraction.text:
- CounterVector
- TfidfVectorizer
- TfidfTransformer
- HashingVectorizer
Take a look at the parameters of this function:
sklearn.feature_extraction.text.CountVectorizer( input='content', #input, can be file name, file, text content encoding='utf-8', #default encoding decode_error='strict', # There are three ways to deal with encoding errors, {'strict', 'ignore', 'replace'} strip_accents=None, # To remove the tone, three kinds of {'ascii', 'unicode', None}, ascii processing speed is fast, only suitable for ASCII encoding, unicode is suitable for all characters, but slow lowercase=True, # Convert to lowercase preprocessor=None, #Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. tokenizer=None, # stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer='word', #stop words, some words that are very large but have no meaning, such as a ,the an max_df=1.0,# min_df=1, #The minimum number of times the word appears max_features=None, #Maximum feature vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
2. Examples
from sklearn.feature_extraction.text import CountVectorizer #bag of words model vectorizer = CountVectorizer() corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] # Perform word bag processing X = vectorizer.fit_transform(corpus) print(X.toarray())
result:
[[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]]
#Get the feature words print(vectorizer.get_feature_names())
result:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
As can be seen, sorted alphabetically
#The default length of extracted words is at least 2, analyzer = vectorizer.build_analyzer() print(analyzer("This is a text document to analyze."))
['this', 'is', 'text', 'document', 'to', 'analyze']Words with length 1, 'a', '.', have been filtered out
# process the new text vectorizer_result = vectorizer.transform(['Something completely new.']).toarray() print(vectorizer_result)
result:
[[0 1 0 0 0 0 0 0 0]]