sklearn.feature_extraction.text.CountVector

1. Parameters

sklearn.feature_extraction.text.CountVector is one of the text feature extraction methods provided by sklearn.feature_extraction.text.

4 text feature extraction methods in sklearn.feature_extraction.text:

  • CounterVector
  • TfidfVectorizer
  • TfidfTransformer
  • HashingVectorizer

Take a look at the parameters of this function:

sklearn.feature_extraction.text.CountVectorizer(
input='content', #input, can be file name, file, text content
 encoding='utf-8', #default encoding
 decode_error='strict', # There are three ways to deal with encoding errors, {'strict', 'ignore', 'replace'} 
strip_accents=None, # To remove the tone, three kinds of {'ascii', 'unicode', None}, ascii processing speed is fast, only suitable for ASCII encoding, unicode is suitable for all characters, but slow

 lowercase=True, # Convert to lowercase
preprocessor=None, #Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
 tokenizer=None, #
stop_words=None,
token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
 analyzer='word', #stop words, some words that are very large but have no meaning, such as a ,the an
 max_df=1.0,#
 min_df=1, #The minimum number of times the word appears
 max_features=None, #Maximum feature
vocabulary=None,
 binary=False,
 dtype=<class ‘numpy.int64’>)

2. Examples

from sklearn.feature_extraction.text import CountVectorizer
#bag of words model
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
 ]
# Perform word bag processing
X = vectorizer.fit_transform(corpus)
print(X.toarray())

result:

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
#Get the feature words
print(vectorizer.get_feature_names())

result:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

As can be seen, sorted alphabetically

#The default length of extracted words is at least 2,
analyzer = vectorizer.build_analyzer()
print(analyzer("This is a text document to analyze."))
['this', 'is', 'text', 'document', 'to', 'analyze']
Words with length 1, 'a', '.', have been filtered out
# process the new text
vectorizer_result = vectorizer.transform(['Something completely new.']).toarray()
print(vectorizer_result)

result:

[[0 1 0 0 0 0 0 0 0]]


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324725665&siteId=291194637