word2vec theory and practice

Guided reading

  • This article briefly introduces a toolkit (word2vec) for obtaining word vectors, which was open sourced by Google in 2013, and briefly introduces the two training models (Skip-gram, CBOW) and two acceleration methods. (Hierarchical Softmax, Negative Sampling).

1. word2vec

  • word2vec was originally an article published by Tomas Mikolov in ICLR in 2013 [ Efficient Estimation of Word Representations in Vector Space ], and the open source code is used to project all words into a K-dimensional vector space, and each word can be used with A K-dimensional vector representation. Because of its concise and efficient features, it has attracted widespread attention and is used in many NLP tasks to train corresponding word vectors.

1. Traditional word representation---one-hot representation

  • This approach represents each word as a very long vector. The dimension of this vector is the size of the vocabulary, most of which are 0, and only one dimension has a value of 1, which represents the current word.

  • If the word list is: [temperature, has, started, picked up, has], then the word vector of the word can be [1,0,0,0,0], [0,1,0,0,0], [ 0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1]. Such a representation method is simple and easy to understand, and the programming is also easy to implement. It can be completed only by taking the corresponding index. It can solve a considerable part of NLP problems, but there are still shortcomings, that is, word vectors and word vectors are mutually exclusive. Independent; we all know that there is a certain relationship between words and words, we cannot know whether two words are semantically similar through this word vector, and if the vocabulary is very large, each word They are all 1s in a vast sea of ​​0s. This high-dimensional and sparse representation may also cause dimensional disasters. In order to solve the above problems, there is a second representation method of word vector.

2、Distributed representation --- word embedding

  • word2vec uses this method to represent words as vectors, that is, to represent words as real vectors with limited dimension K through training. This non-sparse representation vector is easy to find the distance between them (Euclidean, cosine, etc.), so as to judge The semantic similarity between words and words also solves the problem that the above-mentioned one-hot method represents the independence between two words.

  • However, Distributed representation is not unique to word2vec. Distributed representation was first proposed by Hinton in his 1986 paper " Learning distributed representations of concepts ". Although this article did not say that words should be used as distributed representations, at least this advanced idea planted fire in people's hearts at that time, and after 2000, people gradually began to pay attention. The reason why word2vec has such a big impact is because it adopts a simplified model, which greatly improves the training speed and makes the technology of word embedding (that is, the distributed representation of words) practical and can be applied to many tasks. superior.

二 、Skip-Gram model and CBOW model

  • Let's first look at the structure diagram of the two models.

  • CBOWAs can be seen from the figure Skip-Gramabove, the two models both contain three-layer structures, namely 输入层, 投影层, 输出层; the CBOW model predicts the current word w(t) under the premise of knowing the context of the current word, Similar to cloze in reading comprehension; on the contrary, the Skip-Gram model predicts the context under the premise that the current word w(t) is known.

  • For CBOWand Skip-Gramtwo models, word2vec gives two sets of frameworks for training fast and good word vectors, they are Hierarchical Softmaxand Negative Sampling, and these two acceleration methods will be introduced below.

三 、Negative Sampling

1、 Negative Sampling

  • For example, in our training sample, the central word is w, and there are 2c words in its surrounding context, denoted as context(w). Since this central word is windeed and context(w)related, it is a real one 正例. By Negative Samplingperforming negative sampling, we get neg (the number of negative samples) different central words from w wi, i=1, 2, ..neg, so that context(w) and wi constitute neg that does not really exist. 负例. Using this positive example and negative negative examples, we perform binary logistic regression (which can be understood as one 二分类问题), and obtain the model parameters corresponding to each word wi and the word vector of each word corresponding to the negative sampling.

2、 How to do Negative Sampling?

  • Let's take a look at how to perform negative sampling and get neg negative examples. word2vecThe sampling method is not complicated. If the size of the vocabulary is V, then we divide a line segment of length 1 into V parts, each corresponding to a word in the vocabulary. Of course, the length of the line segment corresponding to each word is different. The line segment corresponding to the high frequency word is long, and the line segment corresponding to the low frequency word is short (according to the word frequency sampling, the more times it occurs, the greater the probability of negative sampling). The length of the line segment for each word w is determined by:
    -

  • Before sampling, we divide the line segment of length 1 into M equal parts, where M >> V, which can ensure that the line segment corresponding to each word will be divided into corresponding small blocks, and each of the M parts will be When it falls on the line segment corresponding to a certain word (as shown in the figure below), when sampling, we only need to randomly generate the number of negs, and the corresponding position is the sampled negative word.

四 、Hierarchical Softmax

  • As shown in the figure below: The network structure is very simple, it only contains three layers of network structure, input layer, projection layer, and output layer.
  • The input layer to the projection layer is to add all the vectors of the input layer to the projection layer. For example, the input is three 4-dimensional word vectors: (1,2,3,4), (9,6,11,8) , (5, 10, 7, 12), then the word vector after our word2vec mapping is (5, 6, 7, 8). For the CBOWmodel, the context word vector is added. However, for the Skip-Grammodel, it is Simple pass-by-value.
  • The final output is to construct a Huffman tree, how to construct a simple Huffman tree. It is not repeated here; here, all the leaf nodes of the Huffman tree are all the words in the vocabulary, and the weight is the number of times each word appears in the vocabulary, that is, the word frequency.

  • Generally, after the Huffman tree is obtained, we will Huffman encoding the leaf nodes. Since the leaf nodes with high weight are closer to the root node, and the leaf nodes with low weight will be far away from the root node, our high-weight node coding value is shorter. , while low weight values ​​encode longer values. This ensures that the weighted path of the tree is the shortest, which is also in line with our information theory, that is, we hope that the more commonly used words (words with higher word frequency) have shorter encodings. The general encoding rule is left 0 and right 1, but this is all It is artificially stipulated that the opposite coding rules are used in word2vec, and it is agreed that the weight of the left subtree is not less than the weight of the right subtree.

  • How to "step through the Huffman tree"?
    • In word2vec, the binary logistic regression method is adopted, that is, it is stipulated that walking along the left subtree, then it is the negative class (Huffman tree code 1), and walking along the right subtree, then it is the positive class (Huffman tree code 1). tree code 0).
  • What are the benefits of using Huffman trees?
    • First of all, because it is a binary tree, the amount of calculation before is V, and now it becomes log2V.
    • Secondly, since the high-frequency words are close to the root of the tree using the Huffman tree, the high-frequency words need less time to be found, which is in line with our greedy optimization idea.

Five, Demo

6. Experiment Result

1、 Word-Similar Performance

  • enwiki-20150112_text.txtWe tested it on the English corpus (12G), using the method and site ( http://www.wordvectors.org/index.php ) provided by the paper Community Evaluation and Exchange of Word Vectors at wordvectors.org ) to calculate the similarity between words.
  • The results are shown in the following figure: Since the pytorch-versiontraining speed is slow and the demo is not yet perfect, it is only tested and compared with the Cpp-versionword2vec source code ( ). C-versionThe above comparative experiments are all completed under the same parameter settings.

  • parameter settings
    • model: skip-gramloss: Negative Samplingneg: 10
    • dim: 100lr: 0.025windows size: 5minCount: 10iters: 5

  • The above results show that the same performance as the trained word vector can be achieved, and it is even slightly higher than the Cpp-versiontrained word vector.C-versionC-version

2、 Train Time Performance

  • Since the above experiments are carried out on different servers and different numbers of threads, there is no comparison in the training time of the above experiments. For the convenience and speed of testing, about 1G of files enwiki-20150112_text.txtare and two word vectors are retrained. , look at the training time, the following figure is the experimental result.

  • It can be seen from the above experimental results that Cpp-versionthe C-versiontraining time is not much different from that of .

References

[1] Efficient Estimation of Word Representations in Vector Space
[2] Learning distributed representations of concepts
[3] Distributed Representations of Words and Phrasesand their Compositionality
[4] Community Evaluation and Exchange of Word Vectors at wordvectors.org
[5] https://blog.csdn.net/itplus/article/details/37998797(word2vec 中的数学原理详解)
[6] http://www.cnblogs.com/pinard/p/7249903.html(word2vec 原理)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324786950&siteId=291194637