word2vector from parameter interpretation to actual combat

1, Word2Vector parameter explanation

Word2Vector is a module encapsulated by gensim, and gensim is the abbreviation of generate similatirity.

This article has the basis of word vector by default. parameter:

from  gensim.models import Word2Vec
#The following parameters are default values
Word2Vec(sentences=None, #sentences can be a list of word segmentation or a large corpus
        size=100, #The dimension of the feature vector
        alpha=0.025, #learning rate
        window=5,#In a sentence, the maximum distance between the current word and the predicted word
        min_count=5,#Minimum word frequency
        max_vocab_size=None,#
        sample=0.001, #threshold for random downsampling
        seed=1,#random number seed
        workers=3, #Number of processes
        min_alpha=0.0001, #The minimum value of the learning rate drop
        sg=0, #The choice of training algorithm, sg=1, use skip-gram, sg=0, use CBOW
        hs=0,# hs=1, using hierarchyca·softmax, hs=10, using negative sampling
        negative=5, #This value is greater than 0, use negative sampling to remove the number of 'noise words' (usually set to 5-20); if it is 0, do not use negative sampling
        cbow_mean=1, # is 0, using the sum of word vectors, 1, using the mean; only applicable to the case of cbow
        iter = 5, # iterative times
        null_word = 0,
        trim_rule = None, #Trim vocabulary rules, use None (minimum min_count will be used)
        sorted_vocab = 1, # sort vocabulary in descending order
        batch_words = 10000, # During training, the number of words in each batch
        compute_loss = False,
        callbacks = ())

2, kaggle movie review actual combat

  • Import required modules
import pandas as pd
import numpy as np
from gensim.models import word2vec
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import nltk.data
import re

  • Training data details
Dataset download link Kaggle movie review text sentiment analysis dataset
train = pd.read_csv('../Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv/labeledTrainData.tsv',header=0,delimiter='\t',quoting=3)
print(train.head())#The first 5 data
print(train.tail())#The last 5 data

result:

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3  "3630_4"          0  "It must be assumed that those who praised thi...
4  "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
              id  sentiment                                             review
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324742110&siteId=291194637