Why

在推荐领域，经常要对categorical特征进行组合。但是一般的组合方式存在两个问题：

one-hot之后的特征具有高度稀疏性，维度灾难。
一般的线性模型没有考虑到特征之间的关系。$y=w_0+\sum_{i=1}^{n}w_ix_i$

What

FM (Factorization Machine) 就是针对在特征组合过程中遇到的上述问题而提出的一种高效的解决方案。[1]

1. 特征间关系

首先解决上述第二个问题，建立起特征间的关系，可以将一般的线性模型改写，变成二阶多项式模型。这里只讨论二阶，并不讨论高阶多项式。
\[ y = w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^nw_{ij}x_ix_j \tag{1} \]

2. 参数向量化

有了上面的公式，接下来需要考虑的是，如何进行参数求解。因为大量特征进行one-hot表示之后具有高度稀疏性的问题，所以上述公式中的$x_ix_j$ 同样会产生大量的0值。参数学习不充分，直接导致$w_{ij}$ 无法通过训练得到。解释：令$x_ix_j=X$，则$\frac{\partial{y}}{\partial{w_{ij}}}=X$ ，又因 $X=0$，所以$w_{ij}^{new}=w_{ij}^{old}+{\alpha}X=w_{ij}^{old}$ ，梯度为0参数无法更新。

导致这种情况出现的根源在于：特征过于稀疏。我们期望的是找到一种方法，使得 $w_{ij}$ 的求解不受特征稀疏性的影响。

受 矩阵分解 的启发，对于每一个特征$x_i$ 引入辅助向量（隐向量）$V_i=(v_{i1},v_{i2},\cdots,v_{ik})$，然后利用$V_iV_j^T$ 对$w_{ij}$ 进行求解。即，做如下假设： $w_{ij} \approx V_iV_j^T$ 。

引入隐向量的好处是：

二阶项的参数量由原来的 $\frac{n(n-1)}{2}$ 降为 $kn$ 。
原先参数之间并无关联关系，但是现在通过隐向量可以建立关系。如，之前 $w_{ij}$ 与 $w_{ik}$ 无关，但是现在 $w_{ij}=\langle V_i,V_j\rangle ,w_{ik}=\langle V_i,V_k\rangle $ 两者有共同的 $V_i$ ，也就是说，所有包含 $x_ix_j$ 的非零组合特征（存在某个 $j\neq i$ ，使得 $x_ix_j\neq 0$ ）的样本都可以用来学习隐向量 $V_i$ ，这很大程度上避免了数据稀疏性造成的影响。[2]

现在可以将公式（1）进行改写：
\[ y = w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^n\langle V_i,V_j\rangle x_ix_j \tag{2} \]
重心转移到如何求解公式（2）后面的二阶项。

首先了解对称矩阵上三角求和，设矩阵为 $M$：
\[ M=\left( \begin{matrix} m_{11} & m_{12} & \cdots & m_{1n} \\ m_{21} & m_{22} & \cdots & m_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ m_{n1} & m_{n2} & \cdots & m_{nn} \\ \end{matrix} \right)_{n*n} \]
其中，$m_{ij}=m_{ji}$ 。

令上三角元素和为 $A$ ，即 $\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}m_{ij}=A$ 。那么，$M$ 的所有元素之和等于 $2*A+tr(M)$ ，$tr(M)$为矩阵的迹。
\[ \sum_{i=1}^n\sum_{j=1}^nm_{ij}=2*\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}m_{ij} + \sum_{i=1}^{n}m_{ii} \]
可得，
\[ A=\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}m_{ij}=\frac{1}{2}*\left\{\sum_{i=1}^n\sum_{j=1}^nm_{ij}-\sum_{i=1}^{n}m_{ii}\right\} \]
有了这个前提，可以对公式（2）的二阶项进行推导：
\[ \begin{align} & \sum_{i=1}^{n-1}\sum_{j=i+1}^n\langle V_i,V_j\rangle x_ix_j \notag \\ ={} & \frac{1}{2}*\left\{\sum_{i=1}^{n}\sum_{j=i}^{n}\langle V_i,V_j\rangle x_ix_j-\sum_{i=1}^{n}\langle V_i,V_i\rangle x_ix_i\right\} \notag \\ ={} & \frac{1}{2}*\left\{\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{f=1}^{k}v_{if}v_{jf}x_ix_j-\sum_{i=1}^{n}\sum_{f=1}^{k}v_{if}v_{if}x_ix_i\right\} \notag \\ ={} & \frac{1}{2}*\sum_{f=1}^{k}\left\{\sum_{i=1}^{n}\sum_{j=1}^{n}v_{if}x_iv_{jf}x_j-\sum_{i=1}^{n}v_{if}^{2}x_{i}^2\right\} \notag \\ ={} & \frac{1}{2}*\sum_{f=1}^{k}\left\{\left(\sum_{i=1}^{n}v_{if}x_i\right)\left(\sum_{j=1}^{n}v_{jf}x_j\right)-\sum_{i=1}^{n}v_{if}^{2}x_{i}^2\right\} \notag \\ ={} & \frac{1}{2}*\sum_{f=1}^{k}\left\{\left(\sum_{i=1}^{n}v_{if}x_i\right)^{2}-\sum_{i=1}^{n}v_{if}^{2}x_{i}^2\right\} \notag \\ \end{align}\tag{3} \]
结合（2）（3），可以得到：
\[ \begin{align} y ={} & w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^n\langle V_i,V_j\rangle x_ix_j \notag \\ ={} & w_0+\sum_{i=1}^nw_ix_i+\frac{1}{2}*\sum_{f=1}^{k}\left\{\left(\sum_{i=1}^{n}v_{if}x_i\right)^{2}-\sum_{i=1}^{n}v_{if}^{2}x_{i}^2\right\} \notag \\ \end{align} \tag{4} \]
至此，我们得到了想要的模型表达式。

为什么要将公式（2）改写为公式（4），是因为在改写之前，计算 $y$ 的复杂度为 $O(kn^2)$ ，改写后的计算复杂度为 $O(kn)$ ，提高模型推断速度。

How

到目前为止我们只将模型定义出来了，如何求解呢？可以使用最普通的梯度下降法对参数进行求解，下面计算模型各参数的梯度表达式：

当参数为 $w_0$ 时，$\frac{\partial{y}}{\partial{w_0}}=1$ 。

当参数为 $w_i$ 时，$\frac{\partial{y}}{\partial{w_i}}=x_i$ 。

当参数为 $v_{if}$ 时，只需要关注模型高阶项，当计算参数 $v_{if}$ 的梯度时，其余无关参数可看做常数。
\[ \begin{align} \frac{\partial{y}}{\partial{v_{if}}} ={} & \partial{\frac{1}{2}\left\{\left(\sum_{i=1}^{n}v_{if}x_i\right)^{2}-\sum_{i=1}^{n}v_{if}^{2}x_{i}^2\right\}}/\partial{v_{if}} \notag \\ ={} & \frac{1}{2}* \left\{ \frac{ \partial{ \left\{ \sum_{i=1}^{n}v_{if}x_i \right\}^2 } }{\partial{v_{if}}} - \frac{ \partial{ \left\{ \sum_{i=1}^{n}v_{if}^{2}x_{i}^2 \right\} } }{\partial{v_{if}}} \right\} \notag \\ \end{align} \tag{5} \]
其中：
\[ \frac{ \partial{ \left\{ \sum_{i=1}^{n}v_{if}^{2}x_{i}^2 \right\} } }{\partial{v_{if}}} = 2x_{i}^2v_{if} \tag{6} \]
令 $\lambda=\sum_{i=1}^{n}v_{if}x_i$ ，则：
\[ \begin{align} \frac{ \partial{ \left\{ \sum_{i=1}^{n}v_{if}x_i \right\}^2 } }{\partial{v_{if}}} ={} & \frac{\partial{\lambda^2}}{\partial{v_{if}}} \notag \\ ={} & \frac{\partial{\lambda^2}}{\partial{\lambda}} \frac{\partial{\lambda}}{\partial{v_{if}}} \notag \\ ={} & 2\lambda*\frac{\partial{\sum_{i=1}^{n}v_{if}x_i}}{\partial{v_{if}}} \notag \\ ={} & 2\lambda*x_i \notag \\ ={} & 2*x_i*\sum_{j=1}^{n}v_{jf}x_j \notag \\ \end{align} \tag{7} \]
结合公式（5~7），可得：
\[ \frac{\partial{y}}{\partial{v_{if}}} = x_i\sum_{j=1}^{n}v_{jf}x_j-x_{i}^2v_{if} \tag{8} \]
综上，最终模型各参数的梯度表达式如下：
\[ \begin{equation} \frac{\partial{y}}{\partial{\theta}} = \begin{cases} 1, & \text{if } \theta \text{ is } w_0; \\ x_i, & \text{if } \theta \text{ is } w_i; \\ x_i\sum_{j=1}^{n}v_{jf}x_j-x_{i}^2v_{if}, & \text{if } \theta \text{ is } v_{if}. \end{cases} \end{equation} \notag \]

1. 性能分析

由上节可知，FM预测的时间复杂度为 $O(kn)$ 。分析训练的复杂度，依据参数的梯度表达式，$\sum_{j=1}^{n}v_{jf}x_{j}$ 与 $i$ 无关，在参数更新时可以首先将所有的 $\sum_{j=1}^{n}v_{jf}x_{j}$ 计算出来，复杂度为 $O(kn)$ ，后续更新所有参数的时间复杂度均为 $O(1)$ ，参数量为 $1+n+kn$ ，所以最终训练的时间复杂度同样为 $O(kn)$ ，其中 $n$ 为特征数，$k$ 为隐向量维数。

FM训练与预测的时间复杂度均为 $O(kn)$ ，是一种十分高效的模型。

优点 [1]：

In total, the advantages of our proposed FM are:

1) FMs allow parameter estimation under very sparse data where SVMs fail.

2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances.

3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-of- the-art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC.

缺点 [6]：

只能对特征进行二阶交叉，仍然需要对交叉特征进行特征工程。

2. 理论实践

FM既可以应用在回归任务，也可以应用在分类任务中。如，在二分类任务中只需在公式（2）最外层套上 $sigmoid$ 函数即可，上述解析都是基于回归任务来进行推导的。

关于模型最终的损失函数同样可以有多种形式，如回归任务可以使用 $MSE$ ，分类任务可以使用 $Cross Entropy$ 等。

虽然知道可以通过引入辅助向量进行计算，但是辅助向量是如何与特征 $x_i$ 建立联系的，换句话说，如何通过 $x_i$ 得到辅助向量 $V_i$ ？在使用神经网络实现FM的过程中，将 $x_i$ 的 $embedding$ 作为辅助向量，最终得到的 $embedding$ 向量组也可以看作是对应特征的低维稠密表征，可以应用到其他下游任务中。

本文使用了 $MovieLens 100K Dataset$ [3] 作为实验输入，特征组分别为用户编号、电影编号，用户对电影的历史评分作为 $Label$ 。

具体代码实现如下：

# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
from scipy.sparse import csr
from itertools import count
from collections import defaultdict
import tensorflow as tf


def vectorize_dic(dic, label2index=None, hold_num=None):
  
    if label2index == None:
        d = count(0)
        label2index = defaultdict(lambda: next(d))  # 数值映射表

    sample_num = len(list(dic.values())[0])  # 样本数
    feat_num = len(list(dic.keys()))  # 特征数
    total_value_num = sample_num * feat_num

    col_ix = np.empty(total_value_num, dtype=int)

    i = 0
    for k, lis in dic.items():
        col_ix[i::feat_num] = [label2index[str(k) + str(el)] for el in lis]
        i += 1

    row_ix = np.repeat(np.arange(sample_num), feat_num)
    data = np.ones(total_value_num)

    if hold_num is None:
        hold_num = len(label2index)

    left_data_index = np.where(col_ix < hold_num)  # 为了剔除不在train set中出现的test set数据

    return csr.csr_matrix(
        (data[left_data_index], (row_ix[left_data_index], col_ix[left_data_index])),
        shape=(sample_num, hold_num)), label2index

def batcher(X_, y_, batch_size=-1):

    assert X_.shape[0] == len(y_)

    n_samples = X_.shape[0]
    if batch_size == -1:
        batch_size = n_samples
    if batch_size < 1:
        raise ValueError('Parameter batch_size={} is unsupported'.format(batch_size))

    for i in range(0, n_samples, batch_size):
        upper_bound = min(i + batch_size, n_samples)
        ret_x = X_[i:upper_bound]
        ret_y = y_[i:upper_bound]
        yield(ret_x, ret_y)

def load_dataset():
    cols = ['user', 'item', 'rating', 'timestamp']
    train = pd.read_csv('data/ua.base', delimiter='\t', names=cols)
    test = pd.read_csv('data/ua.test', delimiter='\t', names=cols)

    x_train, label2index = vectorize_dic({'users': train.user.values, 'items': train.item.values})
    x_test, label2index = vectorize_dic({'users': test.user.values, 'items': test.item.values}, label2index, x_train.shape[1])

    y_train = train.rating.values
    y_test = test.rating.values

    x_train = x_train.todense()
    x_test = x_test.todense()

    return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = load_dataset()

print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

vec_dim = 10
batch_size = 1000
epochs = 10
learning_rate = 0.001
sample_num, feat_num = x_train.shape

x = tf.placeholder(tf.float32, shape=[None, feat_num], name="input_x")
y = tf.placeholder(tf.float32, shape=[None,1], name="ground_truth")

w0 = tf.get_variable(name="bias", shape=(1), dtype=tf.float32)
W = tf.get_variable(name="linear_w", shape=(feat_num), dtype=tf.float32)
V = tf.get_variable(name="interaction_w", shape=(feat_num, vec_dim), dtype=tf.float32)

linear_part = w0 + tf.reduce_sum(tf.multiply(x, W), axis=1, keep_dims=True)
interaction_part = 0.5 * tf.reduce_sum(tf.square(tf.matmul(x, V)) - tf.matmul(tf.square(x), tf.square(V)), axis=1, keep_dims=True)
y_hat = linear_part + interaction_part
loss = tf.reduce_mean(tf.square(y - y_hat))
train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for e in range(epochs):
        step = 0
        print("epoch:{}".format(e))
        for batch_x, batch_y in batcher(x_train, y_train, batch_size):
            sess.run(train_op, feed_dict={x:batch_x, y:batch_y.reshape(-1, 1)})
            step += 1
            if step % 10 == 0:
                for val_x, val_y in batcher(x_test, y_test):
                    train_loss = sess.run(loss, feed_dict={x:batch_x, y:batch_y.reshape(-1, 1)})
                    val_loss = sess.run(loss, feed_dict={x:val_x, y:val_y.reshape(-1, 1)})
                    print("batch train_mse={}, val_mse={}".format(train_loss, val_loss))

    for val_x, val_y in batcher(x_test, y_test):
        val_loss = sess.run(loss, feed_dict={x: val_x, y: val_y.reshape(-1, 1)})
        print("test set rmse = {}".format(np.sqrt(val_loss)))

实验结果：

epoch:0
batch train_mse=19.54930305480957, val_mse=19.687997817993164
batch train_mse=16.957233428955078, val_mse=19.531404495239258
batch train_mse=18.544944763183594, val_mse=19.376962661743164
batch train_mse=18.870519638061523, val_mse=19.222412109375
batch train_mse=18.769777297973633, val_mse=19.070764541625977
batch train_mse=19.383392333984375, val_mse=18.915040969848633
batch train_mse=17.26403045654297, val_mse=18.75937843322754
batch train_mse=17.652183532714844, val_mse=18.6033935546875
batch train_mse=18.331804275512695, val_mse=18.447608947753906
......
epoch:9
batch train_mse=1.394300103187561, val_mse=1.4516444206237793
batch train_mse=1.2031371593475342, val_mse=1.4285767078399658
batch train_mse=1.1761484146118164, val_mse=1.4077649116516113
batch train_mse=1.134848952293396, val_mse=1.3872103691101074
batch train_mse=1.2191411256790161, val_mse=1.3692644834518433
batch train_mse=1.572729468345642, val_mse=1.3509554862976074
batch train_mse=1.3323310613632202, val_mse=1.3339732885360718
batch train_mse=1.1601723432540894, val_mse=1.3183823823928833
batch train_mse=1.2751621007919312, val_mse=1.3023829460144043
test set rmse = 1.1405380964279175

reference

[1] Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International Conference on Data Mining (pp. 995-1000). IEEE.
[2] https://tech.meituan.com/2016/03/03/deep-understanding-of-ffm-principles-and-practices.html
[3] https://grouplens.org/datasets/movielens/100k/
[4] https://www.jianshu.com/p/152ae633fb00
[5] http://nowave.it/factorization-machines-with-tensorflow.html
[6] https://blog.csdn.net/xxiaobaib/article/details/92801244
[7] [https://github.com/wyl6/Recommender-Systems-Samples/blob/master/RecSys%20Traditional/MF/FM/fm_tensorflow.py](

推荐系统系列（一）：FM理论与实践

Why