点击量预测(CTR)——FFM(Field-aware Factorization Machine)理论与实践

FFM(Field-aware Factorization Machine)

2015年,基于FM提出的FFM在多项式CTR预估大赛中夺魁,并被Criteo、美团等公司深度应用在推荐系统中、CTR预估等邻域。在FM模型中,每一个特征对应这一个隐变量,但在FFM模型中引入了特征域感知(Field-aware)这一概念,每个特征对应每个域分别有一个隐变量。使模型的表达能力更强。

FFM的paper地址如下:https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
FFM主要特点: 在FM基础上引入了特征域的概念;
FFM模型的优点:

  1. 在高维稀疏性数据集中表现很好;
  2. 相比与FM模型精度更高,加强了特征之间的联系,使模型的表达能力更强。

FFM模型的缺点:

  1. 时间复杂度高。相比于FM模型,时间复杂度从O( k n kn kn)增加到O( k n 2 kn^{2} kn2);
  2. 参数多容易过拟合,必须设置正则化方法,并且早停的训练策略。

算法原理

为了有助于理解,举一个例子,比如性别=[男]、性别=[女],经过one-hot编码后得到性别=[1,0]、性别=[0,1]这个两个特征均代表性别,都属于一个域中。同理在同一类别的特征经过one-hot编码后的特征都可以放在同一个域中,比如时间、年龄、职位等。

假设有n个特征,其中域有f个,那么FFM二次项中有n*f隐向量。而在FM模型中,所有特征的隐向量只有一个。FM模型可以被看作特殊的FFM的特例,相当于是把所有的特征放在了一个域中。其可以推导的模型方程:
FFM模型公式
其中 f j f_{j} fj表示第 j 特征对应的域。

下面以一个简单的例子说明FFM特征交叉的方式。用户输入记录如图:
输入记录图
为了方便说明,然后将所有的特征和对应的特征域映射成整数编号。如图所示:
用户数据图
根据公式 ∑ i = 1 n ∑ j = i + 1 n < v i , f j , v j , f i > x i x j \sum_{i=1}^{n}\sum_{j=i+1}^{n}<v_{i,f_{j}},v_{j,f_{i}}>x_{i}x_{j} i=1nj=i+1n<vi,fj,vj,fi>xixj,其中n=4,FFM的特征交叉有10项,如图所示:
特征交叉
蓝色字体为特征号,红色字体为域编号,绿色字体为特征值。

具体FFM模型分析细节看论文。

代码实现

采取的数据是movielens 100.为了操作的方便,只为了展示FM实现的过程,只选取了uid、itemId作为输入特征,rating作为lable。

数据集

u.item: 电影信息数据

 movie id | movie title | release date | video release date |IMDb URL |unknown | Action | Adventure | Animation |Children's | Comedy | Crime |Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery |Romance | Sci-Fi |Thriller | War | Western
 
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

u.user: 用户信息数据

user id | age | gender | occupation | zip code

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067

ua.base: 训练数据集
ua.test: 测试数据集

user id | item id | rating | timestamp
1	1	5	874965758
1	2	3	876893171
1	3	4	878542960

数据处理

将uid和itemId使用one-hot编码,将rating作为输出标签,其评分等级为[0-5],大于3为1(表示用户感兴趣)小于3为0(表示用户不感兴趣)。

# 数据加载
def loadData():

    # user信息(只取uid)
    userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
    uid_ = userInfo['uid']
    userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
    userId_dum['uid']=uid_

    # item信息(只取itemId)
    header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
              'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
              'Thriller', 'War', 'Western']
    ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
    ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
    item_id_ = ItemInfo['item_id']
    item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
    item_Id_dum['item_id']=item_id_

    # 训练数据
    trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    trainData = trainData.drop(columns=['time'])

    trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)

    Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')

    X_train = pd.merge(trainData, userId_dum, how='left')
    X_train = pd.merge(X_train, item_Id_dum, how='left')
    X_train=X_train.drop(columns=['uid','item_id','rating'])


    # 测试数据
    testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    testData = testData.drop(columns=['time'])

    testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
    Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')

    X_test = pd.merge(testData, userId_dum, how='left')
    X_test = pd.merge(X_test, item_Id_dum, how='left')
    X_test=X_test.drop(columns=['uid','item_id','rating'])

    # 对应域 uid itemid
    # user信息 uid
    # item信息 itemid
    field_index={
    
    }
    userField=['uid']
    itemField=['itemId']
    field=userField+itemField

    # 每个域的长度
    userFieldLen=[len(uid_)]
    itemFieldLen=[len(item_id_)]
    field_len = userFieldLen + itemFieldLen
    j=0
    for field_n in  range(len(field)):
        for i in range(field_len[field_n]):
            field_index[j]=field_n
            j+=1

    return X_train.values,Y_train.values,X_test.values,Y_test.values,field_index,len(field)

FFM模型


class FFM():
    def __init__(self,vec_dim,learning_rate,feature_length,field_index,field_len):
        self.vec_dim=vec_dim
        self.learning_rate=learning_rate
        self.feature_length=feature_length
        self.field_index=field_index
        self.field_len=field_len
        # 创建输入占位符
    def add_input(self):
        self.X = tf.placeholder(shape=[None, self.feature_length], dtype=tf.float32, name='input_X')
        self.Y = tf.placeholder(shape=[None, 2], dtype=tf.float32, name='input_y')

    # 创建计算规则
    def inference(self):
        with tf.variable_scope('linear_layer'):
            w0 = tf.get_variable(name='w0', shape=[2], dtype=tf.float32)
            self.w = tf.get_variable(name='w', shape=[self.feature_length, 2],dtype=tf.float32)
            self.linear_layer = tf.add(tf.matmul(self.X, self.w) , w0)
        with tf.variable_scope('interaction_layer'):
            self.v = tf.get_variable('v', shape=[self.feature_length, self.field_len, self.vec_dim])
            self.interaction_layer = tf.constant(0, dtype='float32')
            for i in range(self.feature_length):
                for j in range(i+1,self.feature_length):
                    self.interaction_layer += tf.multiply(tf.reduce_sum(tf.multiply(self.v[i,self.field_index[i]], self.v[j,self.field_index[j]])),
                                                          tf.multiply(self.X[:,i], self.X[:,j]))

        self.y_out = tf.add(self.linear_layer, tf.transpose([self.interaction_layer]))


        # 损失函数计算
    def add_loss(self):

        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))

    #计算accuracy
    def add_accuracy(self):
        # accuracy
        self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))

        # self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')


    #训练
    def train(self):
        optimizer =  tf.train.AdagradOptimizer(self.learning_rate)
        self.train_op = optimizer.minimize(self.loss)
    #构建图
    def build_graph(self):
        self.add_input()
        self.inference()
        self.add_loss()
        self.add_accuracy()
        self.train()

训练和测试

def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
    num = len(X_train) // batch_size+1
    for step in range(epochs):
        print("epochs{0}:".format(step+1))
        for i in range(num):
            index = np.random.choice(len(X_train), batch_size)
            batch_x = X_train[index]
            batch_y = Y_train[index]
            feed_dict = {
    
    model.X: batch_x,
                         model.Y: batch_y}
            sess.run(model.train_op, feed_dict=feed_dict)


            # print("Iteration {0}: with minibatch  training loss = {1}"
            #       .format(step+1, loss))

            if (i+1)%100==0:
                loss ,accuracy= sess.run([model.loss,model.accuracy], feed_dict=feed_dict)
                print("Iteration {0}: with minibatch training loss = {1} accuracy = {2}"
                      .format(step+1, loss,accuracy))

def test_model(sess,model,X_test,Y_test,batch_size):

    # num = len(X_test) // batch_size+1
    #
    # for i in range(num):
    #     index = np.random.choice(len(X_test), batch_size)
    #     batch_x = X_test[index]
    #     batch_y = np.transpose([Y_test[index]])
    #
    #     feed_dict = {model.X: batch_x,
    #                  model.Y: batch_y}
    #     y_out,loss= sess.run([model.y_out,model.loss], feed_dict=feed_dict)
    #
    #     print(loss)
    loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy ], feed_dict={
    
    model.X: X_test, model.Y: Y_test})

    print("loss={0} accuracy={1}".format(loss,accuracy))

完整代码

# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import tensorflow as tf
import sklearn.metrics  as metrics
# 数据加载
def loadData():

    # user信息(只取uid)
    userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
    uid_ = userInfo['uid']
    userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
    userId_dum['uid']=uid_

    # item信息(只取itemId)
    header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
              'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
              'Thriller', 'War', 'Western']
    ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
    ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
    item_id_ = ItemInfo['item_id']
    item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
    item_Id_dum['item_id']=item_id_

    # 训练数据
    trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    trainData = trainData.drop(columns=['time'])

    trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)

    Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')

    X_train = pd.merge(trainData, userId_dum, how='left')
    X_train = pd.merge(X_train, item_Id_dum, how='left')
    X_train=X_train.drop(columns=['uid','item_id','rating'])


    # 测试数据
    testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    testData = testData.drop(columns=['time'])

    testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
    Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')

    X_test = pd.merge(testData, userId_dum, how='left')
    X_test = pd.merge(X_test, item_Id_dum, how='left')
    X_test=X_test.drop(columns=['uid','item_id','rating'])

    # 对应域 uid itemid
    # user信息 uid
    # item信息 itemid
    field_index={
    
    }
    userField=['uid']
    itemField=['itemId']
    field=userField+itemField

    # 每个域的长度
    userFieldLen=[len(uid_)]
    itemFieldLen=[len(item_id_)]
    field_len = userFieldLen + itemFieldLen
    j=0
    for field_n in  range(len(field)):
        for i in range(field_len[field_n]):
            field_index[j]=field_n
            j+=1

    return X_train.values,Y_train.values,X_test.values,Y_test.values,field_index,len(field)

class FFM():
    def __init__(self,vec_dim,learning_rate,feature_length,field_index,field_len):
        self.vec_dim=vec_dim
        self.learning_rate=learning_rate
        self.feature_length=feature_length
        self.field_index=field_index
        self.field_len=field_len
        # 创建输入占位符
    def add_input(self):
        self.X = tf.placeholder(shape=[None, self.feature_length], dtype=tf.float32, name='input_X')
        self.Y = tf.placeholder(shape=[None, 2], dtype=tf.float32, name='input_y')

    # 创建计算规则
    def inference(self):
        with tf.variable_scope('linear_layer'):
            w0 = tf.get_variable(name='w0', shape=[2], dtype=tf.float32)
            self.w = tf.get_variable(name='w', shape=[self.feature_length, 2],dtype=tf.float32)
            self.linear_layer = tf.add(tf.matmul(self.X, self.w) , w0)
        with tf.variable_scope('interaction_layer'):
            self.v = tf.get_variable('v', shape=[self.feature_length, self.field_len, self.vec_dim])
            self.interaction_layer = tf.constant(0, dtype='float32')
            for i in range(self.feature_length):
                for j in range(i+1,self.feature_length):
                    self.interaction_layer += tf.multiply(tf.reduce_sum(tf.multiply(self.v[i,self.field_index[i]], self.v[j,self.field_index[j]])),
                                                          tf.multiply(self.X[:,i], self.X[:,j]))

        self.y_out = tf.add(self.linear_layer, tf.transpose([self.interaction_layer]))


        # 损失函数计算
    def add_loss(self):

        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))

    #计算accuracy
    def add_accuracy(self):
        # accuracy
        self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))

        # self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')


    #训练
    def train(self):
        optimizer =  tf.train.AdagradOptimizer(self.learning_rate)
        self.train_op = optimizer.minimize(self.loss)
    #构建图
    def build_graph(self):
        self.add_input()
        self.inference()
        self.add_loss()
        self.add_accuracy()
        self.train()
def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
    num = len(X_train) // batch_size+1
    for step in range(epochs):
        print("epochs{0}:".format(step+1))
        for i in range(num):
            index = np.random.choice(len(X_train), batch_size)
            batch_x = X_train[index]
            batch_y = Y_train[index]
            feed_dict = {
    
    model.X: batch_x,
                         model.Y: batch_y}
            sess.run(model.train_op, feed_dict=feed_dict)


            # print("Iteration {0}: with minibatch  training loss = {1}"
            #       .format(step+1, loss))

            if (i+1)%100==0:
                loss ,accuracy= sess.run([model.loss,model.accuracy], feed_dict=feed_dict)
                print("Iteration {0}: with minibatch training loss = {1} accuracy = {2}"
                      .format(step+1, loss,accuracy))

def test_model(sess,model,X_test,Y_test,batch_size):

    # num = len(X_test) // batch_size+1
    #
    # for i in range(num):
    #     index = np.random.choice(len(X_test), batch_size)
    #     batch_x = X_test[index]
    #     batch_y = np.transpose([Y_test[index]])
    #
    #     feed_dict = {model.X: batch_x,
    #                  model.Y: batch_y}
    #     y_out,loss= sess.run([model.y_out,model.loss], feed_dict=feed_dict)
    #
    #     print(loss)
    loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy ], feed_dict={
    
    model.X: X_test, model.Y: Y_test})

    print("loss={0} accuracy={1}".format(loss,accuracy))


if __name__ == '__main__':

    X_train,Y_train,X_test,Y_test,field_index,field_len=loadData()
    learning_rate = 0.001
    batch_size = 64
    vec_dim = 10
    feature_length = X_train.shape[1]
    model = FFM(vec_dim  ,learning_rate ,feature_length,field_index,field_len)

    model.build_graph()


    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())
        print('start training...')
        train_model(sess,model,X_train,Y_train,batch_size,epochs=1)
        print('start testing...')

        test_model(sess,model,X_test,Y_test,batch_size)

注: 由于使用的是uid、itemId使得维度太大,导致FFM模型中参数太多,所以导致运行时间太久,如果想看最后的执行效果请自己改下输入特征,不要将id进行one-hot编码。

猜你喜欢

转载自blog.csdn.net/weixin_41044112/article/details/107822319
今日推荐