(六) word2vec原理

Word2Vec 的有两种训练模型：CBOW (Continuous Bag-of-Words Model) 和 Skip-gram (Continuous Skip-gram Model)。

1、步骤（以CBOW为例）

（1）处理语料库：把语料库划分成一连串的单词，把这些一连串的单词去重，构建词汇表word_to_ix，即word_to_ix={单词1，单词2，…，单词n}
（2）构建CBOW模型的X，y：假设上下文长度CONTEXT_SIZE = 2，则处理方式是，当前的单词为y，左右两边的单词为X，以语料库"we are about to study NLP"为例，则可以构建两个样本(X,y)，即【（x=we are to study，y=about）,（x=are about study NLP，y=to）】，实际上，样本还需要映射到词汇表word_to_ix。
（3）把(X,y)输入网络训练

2、网络结构

如图1所示，前面处理的数据为(X,y)，实际上是先经过nn.Embedding(vocabulary_size, embedding_size)层，即输入为词汇表维度为 $n$ 的向量 $(w_1,w_2,...,w_n)$ ，输出为维度为 $m$ 的词向量 $(x_1,w_x,...,x_m)$ ，输入的词汇表向量指的是该词的one-hot形式，即只在出现该词的位置为1，其他位置为0，而前面介绍的（x=we are to study，y=about）实际上是（x=we ，y=about），（x=are ，y=about），（x=to ，y=about），（x=study ，y=about）作为4条记录为输入，而这些nn.Embedding(vocabulary_size, embedding_size)层可以把这4条记录作为一次性输入处理，输出则是则4条记录的平均值，即隐含层的输出则是一个向量而不是4个向量。实际上，隐含层输入的向量（即输入层的输出）就是每个词的最终表现形式，即训练好网络之后，把一个词作为Embedding层的输入，就可以得到该词的向量表示。为什么会这样呢？分析如下：
假设有【我 | 非常 | 喜欢 | 学习 | 自然语言处理】，【我 | 非常 | 爱 | 学习 | 自然语言处理】，假设输入是【我 | 非常 | 学习 | 自然语言处理】，输出是【喜欢】，【爱】，用 $(a,c,d)$ 分别表示 $(【我 | 非常 | 学习 | 自然语言处理】,【喜欢】，【爱】)$ ，则有
$a\cdot w_1 \cdot w_2 =c \\ a\cdot w_1 \cdot w_2 =d \tag{1}$
输出 $c,d$ 即【喜欢】，【爱】，即我们的目标是使得这两个词的向量表示尽可能相似，即 $c \cdot w_1= d \cdot w_2$ ，推理如下：因为 $c,d$ 的表示是不相同的，其他 $a\cdot w_1 \cdot w_2$ 都是共用的部分，那么训练的目标就是调节 $w_1,w_2$ 使得他们的结果尽可能相同，假设 $c=0,d=1$ ，那么调节到 $a\cdot w_1 \cdot w_2=0.5$ 则可以使得他们尽可能接近，此时 $c \cdot w_1=a\cdot w_1 \cdot w_2 \cdot w_1$ ， $d \cdot w_1=a\cdot w_1 \cdot w_2 \cdot w_1$ 使得结果比较接近，那么训练目标完成。(注意，以上等号不代表相等，只是接近)
在这里插入图片描述

图.1 网络结构

3、Skip-gram

Skip-gram (Continuous Skip-gram Model)将一个词所在的上下文中的词作为输出，而那个词本身作为输入，也就是说，给出一个词，希望预测可能出现的上下文的词。通过在一个大的语料库训练，得到一个从输入层到隐含层的权重模型。即Skip-gram相当于CBOW把输入输出反过来，对于Skip-gram而言，由于输入是词本身，即一个词，所以在经过nn.Embedding(vocabulary_size, embedding_size)层时的输出不需要求平均值，而对于输出是上下文中的词好像是几个输出，实际上同CBOW的输入一样，都是类似的处理，输出时候只有一个向量而不是几个。

4、代码

import torch
import torch.autograd as autograd
import torch.nn as nn
import numpy as np
torch.manual_seed(1)


class CBOW(nn.Module):
    def __init__(self, embedding_size, corpus):
        super(CBOW, self).__init__()

        vocabulary = np.unique(np.array(corpus))
        vocabulary_size = vocabulary.shape[0]

        self.v_embedding = nn.Embedding(vocabulary_size, embedding_size)
        # Output layer.
        self.linear = nn.Linear(embedding_size, vocabulary_size)
        self.vocabulary_index = dict(zip(vocabulary, range(len(vocabulary))))

    def forward(self, x):
        idx = []
        for input_words in x:
            idx.append([self.vocabulary_index[w] for w in input_words])
        idx = torch.LongTensor(idx)
        temp=self.v_embedding(autograd.Variable(idx))
        linear_in =temp.mean(dim=1)
        return self.linear(linear_in)

    def det_row(self, words):
        temp=[self.vocabulary_index[w] for w in words]
        return autograd.Variable(
            torch.LongTensor(temp))

    def train_model(self, batch_size, X, Y, epochs=100):
        iterations = X.shape[0] // batch_size
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.SGD(self.parameters(), lr=0.1)

        for epoch in range(epochs):

            c = 0
            for i in range(iterations):
                x = X[c: c + batch_size]
                y = self.det_row(Y[c: c + batch_size])
                c += batch_size

                y_pred = self.forward(x)

                optimizer.zero_grad()
                loss = criterion(y_pred, y)#y_pred是[vocabulary_size]的概率分布，类似多分类
                loss.backward()
                optimizer.step()

            if epoch % 15:
                print(loss.data[0])

    def getwords(self,x):
        idx = []
        for input_words in x:
            idx.append([self.vocabulary_index[w] for w in input_words])
        idx = torch.LongTensor(idx)
        temp = self.v_embedding(autograd.Variable(idx))
        print(temp)

    def getword(self,x):
        idx = [self.vocabulary_index[x]]
        idx = torch.LongTensor(idx)
        temp = self.v_embedding(autograd.Variable(idx))
        print(temp)


if __name__ == '__main__':
    CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
    raw_text = """We are about to study the idea of a computational process. Computational processes are abstract
    beings that inhabit computers. As they evolve, processes manipulate other abstract
    things called data. The evolution of a process is directed by a pattern of rules
    called a program. People create programs to direct processes. In effect,
    we conjure the spirits of the computer with our spells.""".lower().split()
    word_to_ix = {word: i for i, word in enumerate(set(raw_text))}
    X = []
    Y = []
    for i in range(2, len(raw_text) - 2):
        context = [raw_text[i - 2], raw_text[i - 1], raw_text[i + 1], raw_text[i + 2]]
        target = raw_text[i]
        X.append(context)
        Y.append(target)

    X = np.array(X)
    Y = np.array(Y)

    model = CBOW(embedding_size=10,
                 corpus=raw_text)

    model.train_model(batch_size=1,
                      X=X,
                      Y=Y,
                      epochs=50)

    model.getword("are")
    #model.getwords(X[:2])
    pass

1、步骤（以CBOW为例）

2、网络结构

3、Skip-gram

4、代码

猜你喜欢