Word Vectors详解(1)

We want to represent a word with a vector in NLP. There are many methods.

1 one-hot Vector

Represent every word as an $\mathbb{R}^{|\it{V}|*1}$ vector with all 0s and one 1 at the index of that word in the sorted english language. Where $V$ is the set of vocabularies.

2 SVD Based Methods

2.1 Window based Co-occurrence Matrix

Representing a word by means of its neighbors.
In this method we count the number of times each word appears inside a window of a particular size around the word of interest.

For example:
这里写图片描述

The matrix is too large. We should make it smaller with SVD.

Generate $|\it{V}|*|\it{V}|$ co-occurrence matrix, $X$ .
Apply SVD on $X$ to get $X = USV^T$ .
Select the first $k$ columns of $U$ to get a $k$ -dimensional word vectors.
$\frac{\sum^k_{i=1}\sigma_i}{\sum^{|\it{V}|}_{i=1}\sigma_i}$ indicates the amount of variance captured by the first $k$ dimensions.

2.2 shortage

SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a $m*n$ matrix is $O(mn^2)$

3 Iteration Based Methods - Word2Vec

3.1 Language Models (Unigrams, Bigrams, etc.)

We need to create such a model that will assign a probability to a sequence of tokens.

For example
* The cat jumped over the puddle. —high probability
* Stock boil fish is toy. —low probability

Unigrams:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:

P (w 1, w 2, . . ., w n) = \prod i = 1 n P (w i)

$P(w_1,w_2,...,w_n)=\prod^n_{i=1}P(w_i)$

However, we know the next word is highly contingent upon the previous sequence of words. This model is bad.

Bigrams:
We let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.

P (w 1, w 2, . . ., w n) = \prod i = 2 n P (w i | w i - 1)

$P(w_1,w_2,...,w_n)=\prod^n_{i=2}P(w_i|w_{i-1})$

3.2 Continuous Bag of words Model (CBOW)

Example Sequence:
“The cat jumped over the puddle.”

What is Continuous Bag of words Model?
We treat {“the”, “cat” , “over”, “puddle”} as a context. And the word “jumped” is the center word. Context should be able to predict the center world. This type of model we call a Continuous Bag of words Model.

Known parameters:
If the index of center word is $c$ , then the indexes of context are ${c-m, ..., c-1, c+1, ..., c+m}$ .
The input of the model is the one-hot vector of context. We represent it with $x^{(c-m)}...x^{(c-1)},x^{(c+1)}...x^{(c+m)}$ .
And the outputs is the one-hot vector of center word.We represent it with $y^{(c)}$ .

Parameters we need to learn:
$\cal{V} \in \mathbb{R}^{n*|\it{V}|}$ : Input word matrix
$v_i$ : i-th column of $v_i$ , the input vector representation of word $w_i$
$\cal{U}\in \mathbb{R}^{|\it{V}|*n}$ : Output word matrix
$u_i$ : i-th row of $u_i$ , the output vector representation of word $w_i$
Where $n$ is an arbitrary size which defines the size of our embedding space.

How does it work:
1. We get our embedded word vectors for the context:

v c - m =  x c - m, v c - m + 1 =  x c - m + 1, . . .

$v_{c-m}=\cal{V}x^{c-m},v_{c-m+1}=\cal{V}x^{c-m+1},...$
2. Average these vectors:

v ˆ = v c - m + v c - m + 1 + . . . 2 m

$\widehat{v}=\frac{v_{c-m}+v_{c-m+1}+...}{2m}$
3. Generate a score vector

z=vˆ $z=\cal{U}\widehat{v}$ . As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
4. Turn the scores into probabilities

yˆ=softmax(z)∈ℝ|V| $\widehat{y}=softmax(z)\in\mathbb{R}^{|\it{V}|}$
5. We desire our probabilities generated

yˆ $\widehat{y}$ to match the true probabilities

y(c) $y^{(c)}$ .

How to learn $\cal{U},\cal{V}$ :
learn them with stochastic gradient descent. So we need a loss function.
We use cross-entropy to measure the distance between two distributions:

H (y ˆ, y) = - \sum i = 1 | V | y i log (y i^)

$H(\widehat{y},y)=-\sum^{|\it{V}|}_{i=1}y_i\log(\hat{y_i})$
Consider

yˆ $\widehat{y}$ is a one-hot vector. Simplifies to simply:

H (y ˆ, y) = - y i log (y i^) = - l o g (y i^)

$H(\widehat{y},y)=-y_i\log(\hat{y_i})=-log(\hat{y_i})$
We formulate our optimization objective as:

m i n i m i z e J = - log P (w c | w c - m, . . ., w c + m) = - log P (u c | v ˆ) = - log exp ( u T c v ˆ ) \sum | V | j = 1 exp ( u T j v ˆ ) = - u T c v ˆ + log \sum j = 1 | V | exp (u T j v ˆ)

$\begin{split} \rm{minimize} \ \it{J} &= -\log P(w_c|w_{c-m},...,w_{c+m}) \\ &= -\log P(u_c|\widehat{v}) \\ &= -\log \frac{\exp(u^T_c\widehat{v})}{\sum_{j=1}^{|V|}\exp(u^T_j\widehat{v})} \\ &= -u^T_c\widehat{v} + \log\sum_{j=1}^{|V|}\exp(u^T_j\widehat{v}) \end{split}$

We use stochastic gradient descent to update $\cal{V,U}$ .