文章目录
FM(Factorization Machines)
FM(Factorization Machines)常用于CTR预测,在LR(Logistic Regression)模型的基础上,加上了特征交叉组合,一般用在二维特征交叉。
FM的paper地址如下:https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
FM主要特点:在LR模型上,加入了特征交叉组合。
FM模型的优点:
- 在数据非常稀疏的情况下进行合理的参数估计;
- FM可以看作是在LR模型的基础上添加了二维特征交叉组合,其模型的时间复杂度是线性的;
- FM是一个通用模型,它可以用于任何特征为实值的情况,如,基于MF、SVD、PTTF、FPMC等。
FM模型缺点:
- 每个特征只引入了一个隐向量,不同类型特征之间交叉没有区分性。FFM模型正是这一点作为切入点进行改进的。
算法原理
LR(逻辑回归)
在一般的线性模型中,只单独考虑了各个特征,其表达能力不强,没有考虑特征与特征之间的相互联系,无法进行特征交叉、特征筛选等一系列操作。
一般的线性模型公式:
为了表述特征间的相关性,增强其表达能力,引入了特征交叉组合,FM模型正是以特征交叉为切入点改进的。
FM(Factorization Machines)
FM二阶部分的数学形式,与POLY2相比,其主要区别是用两个向量的内积
( w j 1 ∙ w j 2 ) (w_{j1}\bullet w_{j2}) (wj1∙wj2)取代了单一的权重系数 w h ( j 1 , j 2 ) w_{h(j1,j2)} wh(j1,j2)
FM二阶部分改为如下形式:
在本质上,FM引入隐向量的做法,与矩阵分解用隐向量代表用户和物品的做法异曲同工,FM将矩阵分解中单纯的用户、物品隐向量扩展到了所有的特征上。
公式改写
为了使FM能更好地解决数据稀疏性的问题,引入了隐向量的概念。与POLY2相比,FM虽然丢失了某些具体特征组合的精准记忆能力,但是泛化能力大大提升了。
引入隐向量的好处:
- 二阶项的参数量由原来的 n ( n − 1 ) 2 \frac{n(n-1)}{2} 2n(n−1)降到kn,提高了模型的推断速度。
- 原先参数之间并无关联关系,但是现在通过隐向量可以建立关系。
模型求解
通过梯度下降的方法,求解FM模型里面的参数w0、wi、vi,f。最终模型各参数的梯度表达式如下:
代码实现
采取的数据是movielens 100.为了操作的方便,只为了展示FM实现的过程,只选取了uid、itemId作为输入特征,rating作为lable。
数据集
u.item: 电影信息数据
movie id | movie title | release date | video release date |IMDb URL |unknown | Action | Adventure | Animation |Children's | Comedy | Crime |Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery |Romance | Sci-Fi |Thriller | War | Western
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
u.user: 用户信息数据
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
ua.base: 训练数据集
ua.test: 测试数据集
user id | item id | rating | timestamp
1 1 5 874965758
1 2 3 876893171
1 3 4 878542960
数据处理
将uid和itemId使用one-hot编码,将rating作为输出标签,其评分等级为[0-5],大于3为1(表示用户感兴趣)小于3为0(表示用户不感兴趣)。
# 数据加载
def loadData():
# user信息(只取uid)
userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
uid_ = userInfo['uid']
userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
userId_dum['uid']=uid_
# item信息(只取itemId)
header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
item_id_ = ItemInfo['item_id']
item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
item_Id_dum['item_id']=item_id_
# 训练数据
trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
trainData = trainData.drop(columns=['time'])
trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')
X_train = pd.merge(trainData, userId_dum, how='left')
X_train = pd.merge(X_train, item_Id_dum, how='left')
X_train=X_train.drop(columns=['uid','item_id','rating'])
# 测试数据
testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
testData = testData.drop(columns=['time'])
testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')
X_test = pd.merge(testData, userId_dum, how='left')
X_test = pd.merge(X_test, item_Id_dum, how='left')
X_test=X_test.drop(columns=['uid','item_id','rating'])
return X_train.values,Y_train.values,X_test.values,Y_test.values
FM模型
class FM():
def __init__(self,vec_dim ,learning_rate ,feature_length):
"""
初始化参数
:param vec_dim: 隐藏因子个数
:param learning_rate: 学习率
:param feature_length:特征数
"""
self.vec_dim=vec_dim
self.learning_rate=learning_rate
self.feature_length=feature_length
# 创建输入占位符
def add_input(self):
self.X = tf.placeholder(shape=[None, self.feature_length], dtype=tf.float32, name='input_X')
self.Y = tf.placeholder(shape=[None, 2], dtype=tf.float32, name='input_y')
# 创建计算规则
def inference(self):
with tf.variable_scope('linear_layer'):
w0 = tf.get_variable(name='w0', shape=[2], dtype=tf.float32)
self.w = tf.get_variable(name='w', shape=[self.feature_length, 2],dtype=tf.float32)
self.linear_layer = tf.add(tf.matmul(self.X, self.w) , w0)
with tf.variable_scope('interaction_layer'):
self.v = tf.get_variable(name='v', shape=[self.feature_length, self.vec_dim],dtype=tf.float32)
self.interaction_layer = tf.multiply(0.5,
tf.reduce_sum(
tf.subtract(
tf.pow(tf.matmul(self.X, self.v), 2),
tf.matmul(self.X, tf.pow(self.v, 2))),
1, keep_dims=True))
self.y_out = tf.add(self.linear_layer, self.interaction_layer)
# 损失函数计算
def add_loss(self):
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))
#计算accuracy
def add_accuracy(self):
# accuracy
self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
#训练
def train(self):
optimizer = tf.train.FtrlOptimizer(self.learning_rate, l1_regularization_strength=2e-2,
l2_regularization_strength=0)
self.train_op = optimizer.minimize(self.loss)
#构建图
def build_graph(self):
self.add_input()
self.inference()
self.add_loss()
self.add_accuracy()
self.train()
训练和测试
def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
num = len(X_train) // batch_size+1
for step in range(epochs):
print("epochs{0}:".format(step+1))
for i in range(num):
index = np.random.choice(len(X_train), batch_size)
batch_x = X_train[index]
batch_y = Y_train[index]
feed_dict = {
model.X: batch_x,
model.Y: batch_y}
sess.run(model.train_op, feed_dict=feed_dict)
# print("Iteration {0}: with minibatch training loss = {1}"
# .format(step+1, loss))
if (i+1)%100==0:
loss ,accuracy= sess.run([model.loss,model.accuracy], feed_dict=feed_dict)
print("Iteration {0}: with minibatch training loss = {1} accuracy = {2}"
.format(step+1, loss,accuracy))
def test_model(sess,model,X_test,Y_test,batch_size):
# num = len(X_test) // batch_size+1
#
# for i in range(num):
# index = np.random.choice(len(X_test), batch_size)
# batch_x = X_test[index]
# batch_y = np.transpose([Y_test[index]])
#
# feed_dict = {model.X: batch_x,
# model.Y: batch_y}
# y_out,loss= sess.run([model.y_out,model.loss], feed_dict=feed_dict)
#
# print(loss)
print(sess.run([model.loss], feed_dict={
model.X: X_test, model.Y: Y_test}))
完整代码
import pandas as pd
import numpy as np
import tensorflow as tf
# 数据加载
def loadData():
# user信息(只取uid)
userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
uid_ = userInfo['uid']
userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
userId_dum['uid']=uid_
# item信息(只取itemId)
header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
item_id_ = ItemInfo['item_id']
item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
item_Id_dum['item_id']=item_id_
# 训练数据
trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
trainData = trainData.drop(columns=['time'])
trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')
X_train = pd.merge(trainData, userId_dum, how='left')
X_train = pd.merge(X_train, item_Id_dum, how='left')
X_train=X_train.drop(columns=['uid','item_id','rating'])
# 测试数据
testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
testData = testData.drop(columns=['time'])
testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')
X_test = pd.merge(testData, userId_dum, how='left')
X_test = pd.merge(X_test, item_Id_dum, how='left')
X_test=X_test.drop(columns=['uid','item_id','rating'])
return X_train.values,Y_train.values,X_test.values,Y_test.values
class FM():
def __init__(self,vec_dim ,learning_rate ,feature_length):
"""
初始化参数
:param vec_dim: 隐藏因子个数
:param learning_rate: 学习率
:param feature_length:特征数
"""
self.vec_dim=vec_dim
self.learning_rate=learning_rate
self.feature_length=feature_length
# 创建输入占位符
def add_input(self):
self.X = tf.placeholder(shape=[None, self.feature_length], dtype=tf.float32, name='input_X')
self.Y = tf.placeholder(shape=[None, 2], dtype=tf.float32, name='input_y')
# 创建计算规则
def inference(self):
with tf.variable_scope('linear_layer'):
w0 = tf.get_variable(name='w0', shape=[2], dtype=tf.float32)
self.w = tf.get_variable(name='w', shape=[self.feature_length, 2],dtype=tf.float32)
self.linear_layer = tf.add(tf.matmul(self.X, self.w) , w0)
with tf.variable_scope('interaction_layer'):
self.v = tf.get_variable(name='v', shape=[self.feature_length, self.vec_dim],dtype=tf.float32)
self.interaction_layer = tf.multiply(0.5,
tf.reduce_sum(
tf.subtract(
tf.pow(tf.matmul(self.X, self.v), 2),
tf.matmul(self.X, tf.pow(self.v, 2))),
1, keep_dims=True))
self.y_out = tf.add(self.linear_layer, self.interaction_layer)
# 损失函数计算
def add_loss(self):
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))
#计算accuracy
def add_accuracy(self):
# accuracy
self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
#训练
def train(self):
optimizer = tf.train.FtrlOptimizer(self.learning_rate, l1_regularization_strength=2e-2,
l2_regularization_strength=0)
self.train_op = optimizer.minimize(self.loss)
#构建图
def build_graph(self):
self.add_input()
self.inference()
self.add_loss()
self.add_accuracy()
self.train()
def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
num = len(X_train) // batch_size+1
for step in range(epochs):
print("epochs{0}:".format(step+1))
for i in range(num):
index = np.random.choice(len(X_train), batch_size)
batch_x = X_train[index]
batch_y = Y_train[index]
feed_dict = {
model.X: batch_x,
model.Y: batch_y}
sess.run(model.train_op, feed_dict=feed_dict)
# print("Iteration {0}: with minibatch training loss = {1}"
# .format(step+1, loss))
if (i+1)%100==0:
loss ,accuracy= sess.run([model.loss,model.accuracy], feed_dict=feed_dict)
print("Iteration {0}: with minibatch training loss = {1} accuracy = {2}"
.format(step+1, loss,accuracy))
def test_model(sess,model,X_test,Y_test,batch_size):
# num = len(X_test) // batch_size+1
#
# for i in range(num):
# index = np.random.choice(len(X_test), batch_size)
# batch_x = X_test[index]
# batch_y = np.transpose([Y_test[index]])
#
# feed_dict = {model.X: batch_x,
# model.Y: batch_y}
# y_out,loss= sess.run([model.y_out,model.loss], feed_dict=feed_dict)
#
# print(loss)
print(sess.run([model.loss], feed_dict={
model.X: X_test, model.Y: Y_test}))
if __name__ == '__main__':
X_train,Y_train,X_test,Y_test=loadData()
# print(np.shape(X_train))
# print(np.shape(Y_train))
# print(np.shape(X_test))
print(Y_test)
learning_rate = 0.001
batch_size = 64
vec_dim = 10
feature_length = X_train.shape[1]
model = FM(vec_dim ,learning_rate ,feature_length)
model.build_graph()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print('start training...')
train_model(sess,model,X_train,Y_train,batch_size,epochs=10)
print('start testing...')
test_model(sess,model,X_test,Y_test,batch_size)