Text presentation in NLP--BOW&Paragraph Vector

Bag of words model (BOW)

We know that one of the most direct ways of presenting words is one-hot encoding, and the bag of words model (BOW) based on this in text modeling. It is undeniable that this is a very direct form of presentation, but this This method has two very big problems.
1. The word order in the text cannot be reflected. For example, the vectors of "I love you" and "You love me" in BOW are exactly the same, which is obviously not reasonable. Of course, the bag-of-n-gram-word model can moderately alleviate this problem, but at the same time it brings back high-dimensional solving problems and the feature matrix is ​​too sparse (very many 0).
2. The semantic similarity of words cannot be reflected, such as'good' and'nice', and'good' and'bad'. The semantics of the first group of words are similar, and the second group is completely opposite, but through one-hot encoding. It can't be reflected at all.

Paragraph Vector

Based on the various drawbacks of BOW, the two big guys of Google proposed the Paragraph Vector in the 2014 paper–Distributed Representations of Sentences and Documents. In fact, they use a low-dimensional dense vector to represent a paragraph or sentence. The specific method used is Doc2vec, does it sound a bit like word2vec? This is indeed the case, because word2vec was also invented by these two, and the Doc2vec method also borrows word2vec to a large extent.

Therefore, I will review word2vec first. By definition, word2vec is a combination of related models used to generate word vectors. The two main models are CBOW and skip-gram. At the same time, it should be emphasized that no matter which model, although they all use neural networks, they can not be regarded as a deep neural network model, because in addition to input and output layers, they only use a hidden layer.

Okay, let's start with CBOW first. First picture:
Insert picture description here

In general, CBOW predicts the head word by the words surrounding a head word. The specific ideas are as follows:
1. Suppose we need m words before and after the head word to predict the head word t. The first thing we need is the one-hot encoding of these 2m words.
Then take these 2m words and map them through a parameter matrix. This matrix is ​​called the input word matrix, we record it as V, shape=(n,|V|), where n represents the dimension of the word vector (embedding), and |V| represents the total number of vocabulary in the corpus. Assuming that when predicting a certain central word, we happen to use the i-th vocabulary, then the (n, 1) dim matrix obtained by mapping this word through V is the embedding result of this word. Let us think carefully about the meaning of this parameter matrix. Our input is a one-hot encoding, such as [0,0,0,1,0...,0], assuming it is the i-th word and the i-th bit is 1, the rest are all 0, then the result of such a vector multiplied by V is essentially the i-th column of V, and because we get the word vector of the i-th word through this matrix mapping, then In fact, we can understand that this V is the word vector collection of |V| words in corpus that we need! The process of training the neural network is the process of continuous and finer iteration of word vectors. The last thing we need is this matrix after multiple iterations. OK, let's go back to our thinking...

2. After we get the one-hot encoding of 2m words, we use V to multiply the 2m one-hot encodings. According to my explanation of V above, it is easy to understand, we get the specific 2m words Word vector (of course at the beginning of training, this is not the final result)

3. We average the 2m word vectors to get t_hat as an approximation of t.
One thing that must be mentioned here is that this idea is based on a very classic assumption, and also: in a paragraph, words with similar meanings often appear together-You shall know the meaning of a word by the company it keeps.
So we can approximate it with words near the head word t. My personal idea here is to use weighted average, and words closer to the central word should be assigned more weight.

4. Multiply the word vector t_hat of the central word t obtained in this round of training with a U matrix to obtain a score vector z, z = Ut_hat ∈ R|V|.
Here U is the output word matrix, shape=(|V|,n), which is equivalent to the transpose of V. The meaning of this step is to make dot product through the word vectors of t and the rest of the words in corpus, and you can see which words are similar in semantics to t, then the value of this dimension in z will be higher.

5. By using the softmax function, we get the probability of transforming z into each dimension.
y_hat = softmax(z) ∈ R|V|.

6. Compare y_hat with the one-hot encoding of real t to check the training level of the model.
OK, now we have a clear idea of ​​how the model works, but if we want to train our model, of course we need to define the objective function. Here we naturally want to minimize the distance between y_hat and y, so what is the distance What about measurement? Here we use cross-entropy.
Insert picture description here
Considering the particularity of y, there is only one dimension of 1, therefore, H(y_hat,y) can be further simplified to
Insert picture description here
and then continuously optimize the word vector for each word through SGD.

Regarding the Skip-gram model, it is logically the opposite of CBOW, which predicts surrounding words by given the central word. The general idea is very similar to CBOW, so I won't expand it here.

Here again, regarding the parameter optimization of CBOW and skip-gram model, one of the bigger problems is that the normalization of softmax requires |V| dimension sum, which is a very large amount of calculation, so At this point, there are two very famous optimization ideas, namely Negative sampling and Hierarchical Softmax. Interested friends can read some of the articles I listed in the references, and I will not expand them here.

So far, I took CBOW as an example to introduce the principle of word2vec, and then we start to enter the part of doc2vec. Like word2vec, doc2vec also has two sets of training methods, one of which is called PV-DM (Distributed Memory Model of paragraph vectors), which is very similar to the CBOW I mentioned above. Let's take a look below.
First post the picture in the original Distributed Representations of Sentences and Documents:
Insert picture description here
Compared with CBOW, PV-DM contains not only a word vector in the input layer, but also a paragraph vector. Each unique paragraph is mapped to a paragraph vector through a D matrix, and each unique word is mapped to a word vector through a W matrix. This part is consistent with CBOW. Then in average/concatenate, we not only take the word vector, but also include the generated paragraph vector. Then use the resulting vector to predict our target word. As shown in FIG.
OK, now let's think about this paragraph vector carefully. It can not only be understood as an additional word vector. At the same time it can be more understood as a memory. Specifically, in each training, we slide to intercept a certain part of the paragraph, but the paragraph vector is shared and used as a representative of the remaining part of the paragraph. In multiple training sessions of the same paragraph, the paragraph vector will be repeatedly trained. In continuous but training, it will tighten the theme of the paragraph more accurately.
Next, when we complete training through algorithms such as Stochastic Gradient Descent (SGD), we can use the paragraph vector obtained from each paragraph training as a feature of subsequent machine learning, for example, to solve some classification problems. However, the question here is how to get the corresponding paragraph vector of the extra paragraph? For example, test set data. Our idea here is to use gradient descent to train the value of the paragraph vector corresponding to the extra paragraph while keeping the word vector matrix W and the softmax parameter of the network tail unchanged by adding the vector represented by the extra paragraph to matrix D.

Of course, just like word2vec contains CBOW and skip-gram, Doc2vec also contains two models, except for the PV-DM in the above example, and another PV-DBOW (Distributed Bag of Words of paragraph vector), similar to skip-gram , I will not expand it in detail here. Interested friends can read the original author’s paper, and the link will be provided in the reference.

So the above is my basic summary of text presentation in NLP, from the basic bag-of-words model to word2vec and doc2vec that introduce the idea of ​​embedding. Because I am currently doing NLP-related projects, I review and summarize the basic theories by the way. I also hope to help friends in need.

参考:1.https://cs.stanford.edu/~quocle/paragraph_vector.pdf–Distributed Representations of Sentences and Documents
2. https://www.pianshen.com/article/9749374313/

Guess you like

Origin blog.csdn.net/weixin_44607838/article/details/109280203