word2vec 学习笔记

1、简介 

word2vec是一个简化的神经网络模型,只有输入层,映射层,输出层

2、架构

  • CBOW架构,以该词的上下文预测该词;

  • Skip-Gram架构,以该词预测该词的上下文。

   其中CBOW计算量相对于Skip-Gram较低一些,具体可见下文两种方式的目标函数。

3、优化方式

  •    Hierarchical Softmax (赫夫曼树+逻辑回归)

  •    Negative Sampling (负采样)

3.1 Hierarchical Softmax (赫夫曼树+逻辑回归)

根据词频构建Huffman tree,下文公式的中 d_{w}^_{j} 为词w的路径 j 节点对应的Huffman编码(0,1)。

3.1.1目标函数

 max (L(\theta )) \Rightarrow max(log(L(\theta )) ) \Rightarrow max (l(\theta)) \\l(\theta)=log(L(\theta )

3.1.2 CBOW架构目标函数

L(\theta ) = \prod_{w \in C} p({w}|Context(w))= \prod_{w \in C} \prod_{j=1}^{l_w} p({d_{w}^_{j}}|x_{w},\theta_{w}^{j-1})

= \prod_{w\in C} \prod_{j=2}^{l_w} {\left [ \delta \left(x_w \theta_{w}^{j-1} \right ) \right ]^{ {d_{w}^_{j}}} } * {\left [ 1-\delta \left(x_w \theta_{w}^{j-1} \right ) \right ]^{1-{ {d_{w}^_{j}}}} }

l(\theta) = log L(\theta ) = \sum_{w \in C} \sum_{j=2}^{l_w} log \left ( p \left(d_{j}|x_{w},\theta_{w}^{j-1} \right) \right)

= \sum_{w\in C}\sum_{j=2}^{l_w} {log \left ( \left [ \delta \left(x_w \theta_{w}^{j-1} \right ) \right ]^{ {d_{w}^_{j}}} } * {\left [ 1-\delta \left(x_w \theta_{w}^{j-1} \right ) \right ]^{1-{ {d_{w}^_{j}}}} } \right )

= \sum_{w\in C} \sum_{j=2}^{l_w} { \left ( \left { {d_{w}^_{j}}} \cdot log ( \delta \left(x_w \theta_{w}^_{j-1} \right ) \right ) } + {\left (1-{ {d_{w}^_{j}}}) \cdot log ( 1-\delta \left(x_w \theta_{w}^_{j-1} \right ) \right )} \right )

3.1.3 Skip-Gram架构目标函数

L(\theta ) = \prod_{w \in C} p(Context(w)|w) = \prod_{w \in C} \prod_{ u \in Context(w)} p(u|w)

\prod_{w \in C} \prod_{ u \in Context(w)} {\prod_{j=2}^_{l_u}} {{[{\delta({u_w\theta_u^_{j-1}} )}]^_{d_u^_j}}} \cdot {[1- {\delta({u_w\theta_u^_{j-1}} )}]^_{1-d_u^_j}}

l(\theta) = log L(\theta ) =\sum_{w \in C} \sum_{ u \in Context(w)} {\sum_{j=2}^_{l_u}} log \left( {{[{\delta({u_w\theta_u^_{j-1}} )}]^_{d_u^_j}}} \cdot {[1- {\delta({u_w\theta_u^_{j-1}} )}]^_{1-d_u^_j}} \right)

=\sum_{w \in C} \sum_{ u \in Context(w)} {\sum_{j=2}^_{l_u}} \left( {{{d_u^_j} \cdot log({\delta({u_w\theta_u^_{j-1}} )})}} + {({1-d_u^_j}) \cdot log(1- {\delta({u_w\theta_u^_{j-1}} )})^_} \right)

3.2 Negative Sampling(负采样)

负采样规则,某词被采到的概率和该词的出现的频率成正相关(去除到无意义的高频词,如:是,的,等)。

3.2.1目标函数

 max (L(\theta )) \Rightarrow max(log(L(\theta )) ) \Rightarrow max (l(\theta)) \\l(\theta)=log(L(\theta )

3.2.2 CBOW架构目标函数

L(\theta ) = \prod_{w \in C} \prod_{u\in w \bigcup NEG(w)}p(u|Context(w)) \\ = \prod_{w \in C } \prod_{u\in w \bigcup NEG(w)} {\left [ \delta \left(x_w \theta_{u} \right ) \right ]^{ {l_{w}^_{u}}} } * {\left [ 1-\delta \left(x_w \theta_{u} \right ) \right ]^{1-{ {l_{w}^_{u}}}} }

= \sum_{w\in C} \sum_{u\in w \bigcup NEG(w)} { \left ( \left { {l_{w}^_{u}}} \cdot log ( \delta \left(x_w \theta_{u} \right ) \right ) } + {\left (1-{ {l_{w}^_{u}}}) \cdot log ( 1-\delta \left(x_w \theta_{u}\right ) \right )} \right )

3.2.3 Skip-Gram架构目标函数

参考1,https://blog.csdn.net/itplus/article/details/37969519

猜你喜欢

转载自blog.csdn.net/ziyue246/article/details/82222565
今日推荐