Factorization Machines ---- FM模型论文阅读笔记及相关推导

Introduction

在类似协同过滤的场景下,SVM的作用不如一些如PARAFA等直接进行矩阵因子分解的模型。

Why:
因为在含有大量稀疏数据的场景下,SVM不能从复杂的核空间中学到可靠的超平面。

FM的优点:

  1. 能在高维稀疏数据的场景下进行参数估计(SVM并不擅长)。
  2. 能关联变量间的相互作用。
  3. 线性的计算时间,线性的参数量
  4. 可以使用任意实数域的特征向量进行预测(其他因子分解模型对输入数据非常严格)

Prediction under sparsity

最普遍的CTR场景是通过训练集
D = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . } D=\{(x^{(1)} ,y^{(1)}),(x^{(2)} ,y^{(2)}),...\}

估计一个函数:
y : R n T y:R^n \to T

x R n x \in R^n 特征向量映射到目标域 T T

Factorization Machine Model

定义

y ^ ( x ) : = w 0 + i = 1 n w i x i + w ^ i , j x i x j \hat{y}(\boldsymbol x) := w_0 + \sum_{i=1}^{n}w_ix_i+\hat w_{i,j}x_ix_j
可转化为:
y ^ ( x ) : = w 0 + i = 1 n w i x i + i = 1 n j = i + 1 n < v i , v j > x i x j \hat{y}(\boldsymbol x) := w_0 + \sum_{i=1}^{n}w_ix_i+\sum_{i=1}^{n}\sum_{j=i+1}^{n}<v_i,v_j>x_ix_j
其中:

(1)
w 0 R , w R n , V R n × k w_0 \in \mathbb{R} ,\boldsymbol w \in \mathbb{R}^n,\boldsymbol V \in \mathbb{R}^{n \times k}
(2)<·,·> 为两个K维向量的点乘(K为超参数)
< v i , v j > : = f = 1 k v i , j v j , f <v_i,v_j> := \sum_{f=1}^{k}v_{i,j}·v_{j,f}
因为实践中通常没有足够数据去预估 W ^ \hat W 因此K值选择数值较小的值。

(3)
w ^ i , j : = < v i , v j > \hat \boldsymbol w_{i,j} := <v_i,v_j>
代表第i个变量和第j个变量的相互关系(interaction),因为任意正定矩阵存在一个矩阵 V \boldsymbol V W = V V T \boldsymbol W = \boldsymbol V · \boldsymbol V^\mathrm{T} ,因此使用因子分解后的 V V 进行转化。

推导

在数据非常稀疏的场景下,由于大部分特征 x i , x j x_{i},x_{j} 的值为0,因此很难直接求解出 W ^ \hat W ,因此通过引入辅助变量 V i = ( v i 1 , v i 2 , . . . , v i k ) V_{i}=(v_{i1},v_{i2},...,v_{ik})
V = ( v 11 v 12 . . . v 1 k v 21 v 22 . . . v 2 k v n 1 v n 2 . . . v n k ) n × k = ( v 1 v 2 v n ) V = \begin{pmatrix} v_{11}&v_{12}&... &v_{1k} \\ v_{21}&v_{22}&... &v_{2k} \\ \vdots &\vdots& & \vdots\\ v_{n1}&v_{n2}&... &v_{nk} \end{pmatrix}_{n \times k}=\begin{pmatrix} \boldsymbol v_{1} \\ \boldsymbol v_{2} \\ \vdots \\ \boldsymbol v_{n} \end{pmatrix}
因此:
W ^ = V V T = ( v 1 v 2 v n ) ( v 1 T v 2 T . . . v n T ) \hat W= \boldsymbol V · \boldsymbol V^\mathrm{T} = \begin{pmatrix} \boldsymbol v_{1} \\ \boldsymbol v_{2} \\ \vdots \\ \boldsymbol v_{n} \end{pmatrix}·\begin{pmatrix} \boldsymbol v_1^\mathrm{T} & \boldsymbol v_2^\mathrm{T} & ... & \boldsymbol v_n^\mathrm{T} \end{pmatrix}

求解 i = 1 n j = i + 1 n < v i , v j > x i x j \sum_{i=1}^n\sum_{j=i+1}^n<v_i,v_j>x_ix_j

由于
i = 1 n j = i + 1 n < v i , v j > x i x j = ( < v 1 , v 1 > x 1 x 1 < v 1 , v 2 > x 1 x 2 < v 1 , v 3 > x 1 x 3 . . . < v 1 , v k > x 1 x k < v 2 , v 1 > x 2 x 1 < v 2 , v 2 > x 2 x 2 < v 2 , v 3 > x 2 x 3 . . . < v 2 , v k > x 2 x k < v n , v 1 > x n x 1 < v n , v 2 > x n x 2 < v n , v 3 > x n x 3 . . . < v n , v k > x n x k ) n × k \sum_{i=1}^n\sum_{j=i+1}^n<v_i,v_j>x_ix_j = \begin{pmatrix} <v_1,v_1>x_1x_1&{\color{Red} <v_1,v_2>x_1x_2}& {\color{Red} <v_1,v_3>x_1x_3}&... &{\color{Red} <v_1,v_k>x_1x_k} \\ <v_2,v_1>x_2x_1&<v_2,v_2>x_2x_2& {\color{Red} <v_2,v_3>x_2x_3}&... &{\color{Red} <v_2,v_k>x_2x_k} \\ \vdots &\vdots& \vdots& & \vdots\\ <v_n,v_1>x_nx_1&<v_n,v_2>x_nx_2& <v_n,v_3>x_nx_3&... &{ <v_n,v_k>x_nx_k} \\ \end{pmatrix}_{n \times k}

i = 1 n j = i + 1 n < v i , v j > x i x j \sum_{i=1}^n\sum_{j=i+1}^n<v_i,v_j>x_ix_j 为上述实对称矩阵去除主对角线的上三角(红色部分)。

设该上三角为 A = i = 1 n j = i + 1 n < v i , v j > x i x j A=\sum_{i=1}^n\sum_{j=i+1}^n<v_i,v_j>x_ix_j
2 A + i = 1 n < v i , v j > x i x i = i = 1 n j = 1 n < v i , v j > x i x j 2A+\sum_{i=1}^n<v_i,v_j>x_ix_i = \sum_{i=1}^n\sum_{j=1}^n<v_i,v_j>x_ix_j
2 A = i = 1 n j = 1 n < v i , v j > x i x j i = 1 n < v i , v j > x i x i 2A= \sum_{i=1}^n\sum_{j=1}^n<v_i,v_j>x_ix_j - \sum_{i=1}^n<v_i,v_j>x_ix_i
A = 1 2 i = 1 n j = 1 n < v i , v j > x i x j 1 2 i = 1 n < v i , v j > x i x i A= \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n<v_i,v_j>x_ix_j -\frac{1}{2} \sum_{i=1}^n<v_i,v_j>x_ix_i

因此
i = 1 n j = i + 1 n < v i , v j > x i x j = 1 2 i = 1 n j = 1 n < v i , v j > x i x j 1 2 i = 1 n < v i , v j > x i x i = 1 2 ( i = 1 n j = 1 n f = 1 k v i , f v j , f x i x j i = 1 n f = 1 k v i , f , v j , f x i x i = 1 2 f = 1 k ( ( i = 1 n v i , f x i ) ( j = 1 n v j , f x j ) i = 1 n v i , f 2 x i 2 ) = 1 2 f = 1 k ( ( i = 1 n v i , f x i ) 2 i = 1 n v i , f 2 x i 2 ) \sum_{i=1}^n\sum_{j=i+1}^n<v_i,v_j>x_ix_j\\ =\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n<v_i,v_j>x_ix_j -\frac{1}{2} \sum_{i=1}^n<v_i,v_j>x_ix_i\\ =\frac{1}{2} (\sum_{i=1}^n\sum_{j=1}^n\sum_{f=1}^kv_{i,f}v_{j,f}x_ix_j - \sum_{i=1}^n\sum_{f=1}^kv_{i,f},v_{j,f}x_ix_i\\ =\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i,f}x_i)(\sum_{j=1}^nv_{j,f}x_j)-\sum_{i=1}^nv_{i,f}^2x_i^2)\\ =\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i,f}x_i)^2-\sum_{i=1}^nv_{i,f}^2x_i^2)

使用SGD对模型进行训练,梯度如下:
θ y ^ ( x ) = { 1  if  θ = w 0 x i  if  θ = w i x i j = 1 n v j , f x j v i , f x i 2  if  θ = v i , f \frac{\partial }{\partial \theta}\hat y(\boldsymbol x)=\begin{cases} 1& \text{ if } \theta= w_0\\ x_i& \text{ if } \theta= w_i\\ x_i\sum_{j=1}^nv_{j,f}x_j-v_{i,f}x_i^2& \text{ if } \theta= v_{i,f} \end{cases}

猜你喜欢

转载自blog.csdn.net/weixin_40124413/article/details/100607847