点击量预测(CTR)——PNN理论与实践

PNN模型—加强特征交叉能力

新加坡国立大学的研究人员于2017年提出基于深度学习的协同过滤模型NeuralCF,该模型是基于用户向量和物品向量这两个Embedding层,利用不同的互操作层进行特征的交叉组合,并且可以灵活地进行不同互操作层的拼接。其主要思想是利用多层神经网络代替经典的协同过滤的点积操作,加强了模型的表达能力,由于该模型只融合的用户和物品两组特征向量,如果要加入多种特征向量呢?在2016年,上海交通大学的研究人员提出的PNN模型,给出了特征交互方式的几种设计思路。
pnn模型结构

PNN的paper地址如下: https://arxiv.org/pdf/1611.00144.pdf
PNN的主要优点:

  • 相比于NeuralCF模型,可以加入不同形式、不同来源特征;
  • PNN模型对应深度学习结构的创新在于product的引入(内积和外积操作部分);
  • 相比于FNN来说,在进行高阶特征组合的同时,也融合了低阶特征,且无需进行两阶段训练。

PNN模型的缺点:

  • 在外积操作上,为了训练优化效率,大量简化其操作;
  • 对所有特征无区分的进行交叉组合,在一定程度上忽略了原始特征向量中包含的有用的价值信息。

算法原理

主要分析product部分,组合product部分主要由两部分组成,一是lz指的是linear signals部分,而是lp指的是quadratic signals 部分,先了解一下前提公式:
在这里插入图片描述
公式3:l1是输入到神经网络第一层的节点,它由线性部分lz、product部分lp、偏置b1组成。
公式四
公式4:定义下面计算lz、lp向量相乘的计算规则。

lz、lp
在这里插入图片描述

公式5:lz:为计算linear 部分的公式,lp计算product部分。D1指的是输入到第一层神经网络的节点数

那么其中z和p如何计算得到呢?

计算z和p
公式6:中间的三角符号是等价的意思,例 z 1 = f 1 z1=f1 z1=f1,其中 f 1 f1 f1为由公式8求得。
公式7:有两种方式由 f f f中两两组合求得内积和外积计算得出。

IPNN 分析——lp部分

与矩阵分解公式类似,将其转换为 W p n = θ n θ n T W_{p}^{n}=\theta ^{n}\theta ^{nT} Wpn=θnθnT,将公式改写:
在这里插入图片描述
其公式推导如下:
在这里插入图片描述
其中: δ i n = θ i n f i \delta _{i}^{n}=\theta _{i}^{n}f_{i} δin=θinfi , δ i n ∈ R M \delta _{i}^{n}\in R^{M} δinRM。结合公式5和公式11推导得出最终公式:
在这里插入图片描述

OPNN 分析——lp部分

将特征交叉的方式由内积变为外积,便可以得到PNN的另一种形式OPNN。
唯一不同的是,如下公式,其余差不多跟IPNN部分
在这里插入图片描述
先用公式14计算p然后再使用IPNN部分得到最后的lp部分。

代码实现

采取的数据是movielens 100.为了操作的方便,只为了展示FM实现的过程,只选取了uid、itemId作为输入特征,rating作为lable。

数据集

u.item: 电影信息数据

 movie id | movie title | release date | video release date |IMDb URL |unknown | Action | Adventure | Animation |Children's | Comedy | Crime |Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery |Romance | Sci-Fi |Thriller | War | Western
 
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

u.user: 用户信息数据

user id | age | gender | occupation | zip code

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067

ua.base: 训练数据集
ua.test: 测试数据集

user id | item id | rating | timestamp
1	1	5	874965758
1	2	3	876893171
1	3	4	878542960

数据处理

将uid和itemId使用one-hot编码,将rating作为输出标签,其评分等级为[0-5],大于3为1(表示用户感兴趣)小于3为0(表示用户不感兴趣)。


# 数据加载
def loadData():
    # user信息(只取uid)
    userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
    uid_ = userInfo['uid']
    userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
    userId_dum['uid']=uid_

    # item信息(只取itemId)
    header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
              'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
              'Thriller', 'War', 'Western']
    ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
    ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
    item_id_ = ItemInfo['item_id']
    item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
    item_Id_dum['item_id']=item_id_

    # 训练数据
    trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    trainData = trainData.drop(columns=['time'])

    trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)

    Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')

    X_train = pd.merge(trainData, userId_dum, how='left')
    X_train = pd.merge(X_train, item_Id_dum, how='left')
    X_train=X_train.drop(columns=['uid','item_id','rating'])


    # 测试数据
    testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    testData = testData.drop(columns=['time'])

    testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
    Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')

    X_test = pd.merge(testData, userId_dum, how='left')
    X_test = pd.merge(X_test, item_Id_dum, how='left')
    X_test=X_test.drop(columns=['uid','item_id','rating'])

    # 对应域 uid itemid
    # user信息 uid
    # item信息 itemid
    field_index={
    
    }
    userField=['uid']
    itemField=['itemId']
    field=userField+itemField

    # 每个域的长度
    userFieldLen=[len(uid_)]
    itemFieldLen=[len(item_id_)]

    field_len = userFieldLen + itemFieldLen
    j=0
    field_arange=[0]
    for field_n in  range(len(field)):
        field_arange.append(field_arange[field_n]+field_len[field_n])
    return X_train.values,Y_train.values,X_test.values,Y_test.values,field_arange,len(field)

PNN模型

class PNN:
    def __init__(self,vec_dim,learning_rate,feature_length,field_arange,field_len,dnn_layers,dropout_rate,product_way,lamda):
        self.vec_dim=vec_dim
        self.learning_rate=learning_rate
        self.feature_length=feature_length
        self.field_arange=field_arange
        self.field_len=field_len
        self.dnn_layers=dnn_layers
        self.dropout_rate=dropout_rate
        self.product_way = product_way
        self.lamda = float(lamda)
        self.l2_reg = tf.contrib.layers.l2_regularizer(self.lamda)

    def add_input(self):
        self.X = tf.placeholder(tf.float32,name='input_x')
        self.Y = tf.placeholder(tf.float32, shape=[None,2], name='input_y')

    # 创建计算规则
    def inference(self):
        with tf.variable_scope('Embedding_layer'):
            Embedding = [tf.get_variable(name='Embedding_%d'%i,
                                         shape=[self.field_arange[i+1]-self.field_arange[i], self.vec_dim],
                                         dtype=tf.float32) for i in range(self.field_len)]
            Embedding_layer = tf.concat([tf.matmul(tf.slice(self.X,[0,self.field_arange[i]],
                                                            [-1,self.field_arange[i+1]-self.field_arange[i]]), Embedding[i])
                                         for i in range(self.field_len)], axis=1)

        with tf.variable_scope('linear_part'):
            linear_w = tf.get_variable(shape=[self.field_len * self.vec_dim, self.dnn_layers[0]], dtype=tf.float32,
                                       name='linear_w',regularizer=self.l2_reg)
            self.lz = tf.matmul(Embedding_layer, linear_w)

        with tf.variable_scope('product_part'):
            Embedding_layer = tf.reshape(Embedding_layer, shape=[-1, self.field_len, self.vec_dim])
            lp = []
            if self.product_way=='product_inner':
                inner_product_w = tf.get_variable(name='inner_product_w', shape=[self.dnn_layers[0], self.field_len], dtype=tf.float32,regularizer=self.l2_reg)
                # 单独求每个节点的值
                for i in range(self.dnn_layers[0]):
                    # 对应文章中公式11 和 公式12(优化后的公式)
                    # 每个域 与 对应的 w相乘 然后求和 ,见公式11 所以再求平方和 得到一个单独的值
                    t = tf.reduce_sum(tf.multiply(Embedding_layer, tf.expand_dims(inner_product_w[i], axis=1)), axis=1)
                    lp.append(tf.reduce_sum(tf.square(t), axis=1, keep_dims=True))
            else:
                outer_product_w = tf.get_variable(name='outer_product_w', shape=[self.dnn_layers[0], self.vec_dim, self.vec_dim], dtype=tf.float32,regularizer=self.l2_reg)

                field_sum = tf.reduce_sum(Embedding_layer, axis=1)     # 跟求内积差不多,只是这里不同,由公式14可知:
                # 先求一个域的和,然后再与自己外积得到p,最后就跟内积差不多,只是最后需要把每个节点求得的矩阵 所有值相加为一个值
                p = tf.matmul(tf.expand_dims(field_sum, axis=2), tf.expand_dims(field_sum, axis=1))
                for i in range(self.dnn_layers[0]):
                    lp_i = tf.multiply(p, tf.expand_dims(outer_product_w[i], axis=0))
                    lp.append(tf.expand_dims(tf.reduce_sum(lp_i, axis=[1,2]), axis=1))
            self.lp = tf.concat(lp, axis=1)
            b = tf.get_variable(name='b', shape=[self.dnn_layers[0]], dtype=tf.float32)
            self.product_layer = tf.nn.relu(self.lz+self.lp+b)
        x = self.product_layer
        in_node = self.dnn_layers[0]
        with tf.variable_scope('dnn_part'):
            for i in range(1, len(self.dnn_layers)):
                out_node = self.dnn_layers[i]
                w = tf.get_variable(name='w_%d'%i, shape=[in_node, out_node], dtype=tf.float32,regularizer=self.l2_reg)
                b = tf.get_variable(name='b_%d'%i, shape=[out_node], dtype=tf.float32)
                x = tf.matmul(x, w) + b
                if out_node == 2:
                    self.y_out = x
                else:
                    x = tf.layers.dropout(tf.nn.relu(x), rate=self.dropout_rate)
                in_node = out_node

    def add_loss(self):
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))

    #计算accuracy
    def add_accuracy(self):
        # accuracy
        self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))

        # self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')
    #训练
    def train(self):
        optimizer =  tf.train.AdagradOptimizer(self.learning_rate)
        self.train_op = optimizer.minimize(self.loss)
    #构建图
    def build_graph(self):
        self.add_input()
        self.inference()
        self.add_loss()
        self.add_accuracy()
        self.train()

训练和测试


def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
    num = len(X_train) // batch_size+1
    for step in range(epochs):
        print("epochs{0}:".format(step+1))
        for i in range(num):
            index = np.random.choice(len(X_train), batch_size)
            batch_x = X_train[index]
            batch_y = Y_train[index]
            feed_dict = {
    
    model.X: batch_x,
                         model.Y: batch_y}
            sess.run(model.train_op, feed_dict=feed_dict)


            # print("Iteration {0}: with minibatch  training loss = {1}"
            #       .format(step+1, loss))

            if (i+1)%100==0:
                loss ,accuracy,y_out= sess.run([model.loss,model.accuracy,model.y_out], feed_dict=feed_dict)
                auc = metrics.roc_auc_score(batch_y, y_out)
                print("Iteration {0}: with minibatch training loss = {1} accuracy = {2} auc={3}"
                      .format(step+1, loss,accuracy,auc))

def test_model(sess,model,X_test,Y_test):

    loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy], feed_dict={
    
    model.X: X_test, model.Y: Y_test})

    print("loss={0} accuracy={1} auc={2}".format(loss,accuracy,metrics.roc_auc_score(Y_test, y_out)))

完整代码

import numpy as np
import pandas as pd
import tensorflow as tf




# 数据加载
from sklearn import metrics


def loadData():
    # user信息(只取uid)
    userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
    uid_ = userInfo['uid']
    userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
    userId_dum['uid']=uid_

    # item信息(只取itemId)
    header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
              'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
              'Thriller', 'War', 'Western']
    ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
    ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
    item_id_ = ItemInfo['item_id']
    item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
    item_Id_dum['item_id']=item_id_

    # 训练数据
    trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    trainData = trainData.drop(columns=['time'])

    trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)

    Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')

    X_train = pd.merge(trainData, userId_dum, how='left')
    X_train = pd.merge(X_train, item_Id_dum, how='left')
    X_train=X_train.drop(columns=['uid','item_id','rating'])


    # 测试数据
    testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
    testData = testData.drop(columns=['time'])

    testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
    Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')

    X_test = pd.merge(testData, userId_dum, how='left')
    X_test = pd.merge(X_test, item_Id_dum, how='left')
    X_test=X_test.drop(columns=['uid','item_id','rating'])

    # 对应域 uid itemid
    # user信息 uid
    # item信息 itemid
    field_index={
    
    }
    userField=['uid']
    itemField=['itemId']
    field=userField+itemField

    # 每个域的长度
    userFieldLen=[len(uid_)]
    itemFieldLen=[len(item_id_)]

    field_len = userFieldLen + itemFieldLen
    j=0
    field_arange=[0]
    for field_n in  range(len(field)):
        field_arange.append(field_arange[field_n]+field_len[field_n])
    return X_train.values,Y_train.values,X_test.values,Y_test.values,field_arange,len(field)

class PNN:
    def __init__(self,vec_dim,learning_rate,feature_length,field_arange,field_len,dnn_layers,dropout_rate,product_way,lamda):
        self.vec_dim=vec_dim
        self.learning_rate=learning_rate
        self.feature_length=feature_length
        self.field_arange=field_arange
        self.field_len=field_len
        self.dnn_layers=dnn_layers
        self.dropout_rate=dropout_rate
        self.product_way = product_way
        self.lamda = float(lamda)
        self.l2_reg = tf.contrib.layers.l2_regularizer(self.lamda)

    def add_input(self):
        self.X = tf.placeholder(tf.float32,name='input_x')
        self.Y = tf.placeholder(tf.float32, shape=[None,2], name='input_y')

    # 创建计算规则
    def inference(self):
        with tf.variable_scope('Embedding_layer'):
            Embedding = [tf.get_variable(name='Embedding_%d'%i,
                                         shape=[self.field_arange[i+1]-self.field_arange[i], self.vec_dim],
                                         dtype=tf.float32) for i in range(self.field_len)]
            Embedding_layer = tf.concat([tf.matmul(tf.slice(self.X,[0,self.field_arange[i]],
                                                            [-1,self.field_arange[i+1]-self.field_arange[i]]), Embedding[i])
                                         for i in range(self.field_len)], axis=1)

        with tf.variable_scope('linear_part'):
            linear_w = tf.get_variable(shape=[self.field_len * self.vec_dim, self.dnn_layers[0]], dtype=tf.float32,
                                       name='linear_w',regularizer=self.l2_reg)
            self.lz = tf.matmul(Embedding_layer, linear_w)

        with tf.variable_scope('product_part'):
            Embedding_layer = tf.reshape(Embedding_layer, shape=[-1, self.field_len, self.vec_dim])
            lp = []
            if self.product_way=='product_inner':
                inner_product_w = tf.get_variable(name='inner_product_w', shape=[self.dnn_layers[0], self.field_len], dtype=tf.float32,regularizer=self.l2_reg)
                # 单独求每个节点的值
                for i in range(self.dnn_layers[0]):
                    # 对应文章中公式11 和 公式12(优化后的公式)
                    # 每个域 与 对应的 w相乘 然后求和 ,见公式11 所以再求平方和 得到一个单独的值
                    t = tf.reduce_sum(tf.multiply(Embedding_layer, tf.expand_dims(inner_product_w[i], axis=1)), axis=1)
                    lp.append(tf.reduce_sum(tf.square(t), axis=1, keep_dims=True))
            else:
                outer_product_w = tf.get_variable(name='outer_product_w', shape=[self.dnn_layers[0], self.vec_dim, self.vec_dim], dtype=tf.float32,regularizer=self.l2_reg)

                field_sum = tf.reduce_sum(Embedding_layer, axis=1)     # 跟求内积差不多,只是这里不同,由公式14可知:
                # 先求一个域的和,然后再与自己外积得到p,最后就跟内积差不多,只是最后需要把每个节点求得的矩阵 所有值相加为一个值
                p = tf.matmul(tf.expand_dims(field_sum, axis=2), tf.expand_dims(field_sum, axis=1))
                for i in range(self.dnn_layers[0]):
                    lp_i = tf.multiply(p, tf.expand_dims(outer_product_w[i], axis=0))
                    lp.append(tf.expand_dims(tf.reduce_sum(lp_i, axis=[1,2]), axis=1))
            self.lp = tf.concat(lp, axis=1)
            b = tf.get_variable(name='b', shape=[self.dnn_layers[0]], dtype=tf.float32)
            self.product_layer = tf.nn.relu(self.lz+self.lp+b)
        x = self.product_layer
        in_node = self.dnn_layers[0]
        with tf.variable_scope('dnn_part'):
            for i in range(1, len(self.dnn_layers)):
                out_node = self.dnn_layers[i]
                w = tf.get_variable(name='w_%d'%i, shape=[in_node, out_node], dtype=tf.float32,regularizer=self.l2_reg)
                b = tf.get_variable(name='b_%d'%i, shape=[out_node], dtype=tf.float32)
                x = tf.matmul(x, w) + b
                if out_node == 2:
                    self.y_out = x
                else:
                    x = tf.layers.dropout(tf.nn.relu(x), rate=self.dropout_rate)
                in_node = out_node

    def add_loss(self):
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))

    #计算accuracy
    def add_accuracy(self):
        # accuracy
        self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))

        # self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')
    #训练
    def train(self):
        optimizer =  tf.train.AdagradOptimizer(self.learning_rate)
        self.train_op = optimizer.minimize(self.loss)
    #构建图
    def build_graph(self):
        self.add_input()
        self.inference()
        self.add_loss()
        self.add_accuracy()
        self.train()

def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
    num = len(X_train) // batch_size+1
    for step in range(epochs):
        print("epochs{0}:".format(step+1))
        for i in range(num):
            index = np.random.choice(len(X_train), batch_size)
            batch_x = X_train[index]
            batch_y = Y_train[index]
            feed_dict = {
    
    model.X: batch_x,
                         model.Y: batch_y}
            sess.run(model.train_op, feed_dict=feed_dict)


            # print("Iteration {0}: with minibatch  training loss = {1}"
            #       .format(step+1, loss))

            if (i+1)%100==0:
                loss ,accuracy,y_out= sess.run([model.loss,model.accuracy,model.y_out], feed_dict=feed_dict)
                auc = metrics.roc_auc_score(batch_y, y_out)
                print("Iteration {0}: with minibatch training loss = {1} accuracy = {2} auc={3}"
                      .format(step+1, loss,accuracy,auc))

def test_model(sess,model,X_test,Y_test):

    loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy], feed_dict={
    
    model.X: X_test, model.Y: Y_test})

    print("loss={0} accuracy={1} auc={2}".format(loss,accuracy,metrics.roc_auc_score(Y_test, y_out)))




if __name__ == '__main__':
    X_train,Y_train,X_test,Y_test,field_arange,field_len=loadData()
    learning_rate = 0.001
    batch_size = 128
    vec_dim = 10
    feature_length = X_train.shape[1]
    dnn_layers=[128,128,2]
    dropout_rate=0.8
    lamda=0.5

    model = PNN(vec_dim  ,learning_rate ,feature_length,field_arange,field_len,dnn_layers,dropout_rate,'product_inner',lamda)
    model.build_graph()


    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())
        print('start training...')
        train_model(sess,model,X_train,Y_train,batch_size,epochs=10)
        print('start testing...')
        test_model(sess,model,X_test,Y_test)

猜你喜欢

转载自blog.csdn.net/weixin_41044112/article/details/108003570
今日推荐