word2vector 2

Continue with the previous one.

We found that building such a three-layer network requires too many features, which greatly increases the computation. So we have to make some modifications,

1. Treat common word pairs or phrases as a single "word" in the model. For example: "I wipe" means different from "I" & "Wipe". 2. Subsampling
frequent words to reduce the number of training instances.
3. Modify the optimization objective with a technique of "negative sampling", which updates only a small part of the model's weights per training sample.

 subsampling:

There are two problems with "the".

1. When looking at word pairs, ("fox", "the") doesn't tell us much about the meaning of "fox". "This" appears in the context of almost every word.
2. We will have more samples ("the", ...) than we need to learn a good vector for "the".

For every word we encounter in the training text, we have the potential to effectively remove it from the text. The probability that we delete is related to the word frequency.

If the word frequency is high, we will delete it, and if the word frequency is low, we will subsample it.

 

Negative sampling

  For each sample import, not all weights but some are changed.

When training the network on word pairs ("fox", "fast"), the "label" or "correct output" of the network is a hotspot vector. That is, output 1 for the output neuron corresponding to "fast", and output 0 for all other thousands of output neurons.

For negative sampling, we will randomly select a small subset of "negative" words (say 5) to update the weights. (In this case, the word "negation" is the word we want the network to output 0). We'll also update the weights of our "positive" words (in our current example, this is the word "fast").

The output layer of our model has a 300 x 10,000 weight matrix. So we just update the weight of our positive word ("fast"), plus the weight of the other 5 words we want to output. This has a total of 6 output neurons with a total of 1,800 weight values. That's only 0.06% of the output layer's 3M weight!

In the hidden layer , only the weights of the input words are updated (this is the case whether you use negative sampling or not).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325223951&siteId=291194637