Short video recommendation based on Surprise collaborative filtering

foreword

The previous article introduced the realization of simple content recommendation through the basic web project structure, which is not so much a recommendation as a sorting algorithm. Because the popularity calculation method solves the dynamic quality of the timeliness of the content. But compared to users, what everyone sees is almost the same content (the difference may be just the top or bottom of a certain video at a certain time), and it has not been personalized for thousands of people.

Nevertheless, content-based popularity recommendation still has its unique application scenario—hot list. So you only need to change this function to another module, and leave personalized recommendations to algorithms that are better at doing this.

Of course, there are many ways to make a recommendation system, such as spark at the platform level and Surprise that I will talk about today. At the method level, deep learning can be used, collaborative filtering can be used, or combined together, etc. Big manufacturers may be more complete. There are many channels in the recall stage, such as recognizing video content based on convolutional frame capture, text similarity calculation and existing data support, followed by cleaning, rough sorting, fine sorting, rearrangement, etc. And other processes, maybe they are more to ensure the diversity of platform content.

Then we still focus on getting started and using it here, so that our projects can be quickly connected to personalized recommendations. The following is the connection with Surprise in the reason PHP project structure to achieve similarity recommendations between users and items.

environment

python3.8
Flask2.0
pandas2.0
mysql-connector-python
surprise
openpyxl
gunicorn

Introduction

The Surprise library is a tool library for building and analyzing recommendation systems. It provides a variety of recommendation algorithms, including baseline algorithms, neighborhood methods, and matrix decomposition-based algorithms (such as SVD, PMF, SVD++, NMF), etc. A variety of similarity measurement methods are built in, such as cosine similarity, mean square deviation (MSD), Pearson correlation coefficient, etc. These similarity measures can be used to evaluate the similarity between users, thus providing important data support for recommender systems.

Collaborative filtering dataset

Since the collaborative filtering recommendation is to be completed based on the tool library, it is naturally necessary to follow the standards of the library. Surprise is also similar to most collaborative filtering frameworks. The data set only needs users to rate an item. If you don’t have a free Movielens or Jester online, the following is the table I created based on the business, for your own reference.

CREATE TABLE `short_video_rating` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(120) DEFAULT '',
  `item_id` int(11) DEFAULT '0',
  `rating` int(11) unsigned DEFAULT '0' COMMENT '评分',
  `scoring_set` json DEFAULT NULL COMMENT '行为集合',
  `create_time` int(11) DEFAULT '0',
  `action_day_time` int(11) DEFAULT '0' COMMENT '更新当天时间',
  `update_time` int(11) DEFAULT '0' COMMENT '更新时间',
  `delete_time` int(11) DEFAULT '0' COMMENT '删除时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=107 DEFAULT CHARSET=utf8mb4 COMMENT='用户对视频评分表';

Business introduction

The web service end records scoring records according to preset standards at the place where the user operates through the interface or embedded point. When the scoring table has data, use python to convert the SQL record into a table and then import it into Surprise, train according to different algorithms, and finally return the corresponding recommended top list according to the received parameters. The python part is a service started by Flask, which interacts with php through http, which will be explained later in snippet code.

coding part

1. PHP request encapsulation

<?php
/**
 * Created by ZERO开发.
 * User: 北桥苏
 * Date: 2023/6/26 0026
 * Time: 14:43
 */

namespace app\common\service;


class Recommend
{
    private $condition;

    private $cfRecommends = [];

    private $output = [];

    public function __construct($flag = 1, $lastRecommendIds = [], $userId = "")
    {
        $this->condition['flag'] = $flag;
        $this->condition['last_recommend_ids'] = $lastRecommendIds;
        $this->condition['user_id'] = $userId;
    }

    public function addObserver($cfRecommend)
    {
        $this->cfRecommends[] = $cfRecommend;
    }

    public function startRecommend()
    {
        foreach ($this->cfRecommends as $cfRecommend) {
            $res = $cfRecommend->recommend($this->condition);
            $this->output = array_merge($res, $this->output);
        }

        $this->output = array_values(array_unique($this->output));

        return $this->output;
    }
}


abstract class cfRecommendBase
{

    protected $cfGatewayUrl = "127.0.0.1:6016";
    protected $limit = 15;

    public function __construct($limit = 15)
    {
        $this->limit = $limit;
        $this->cfGatewayUrl = config('api.video_recommend.gateway_url');
    }

    abstract public function recommend($condition);
}


class mcf extends cfRecommendBase
{
    public function recommend($condition)
    {
        //echo "mcf\n";
        $videoIdArr = [];

        $flag = $condition['flag'] ?? 1;
        $userId = $condition['user_id'] ?? '';
        $url = "{$this->cfGatewayUrl}/mcf_recommend";

        if ($flag == 1 && $userId) {
            //echo "mcf2\n";
            $param['raw_uid'] = (string)$userId;
            $param['top_k'] = $this->limit;

            $list = httpRequest($url, $param, 'json');
            $videoIdArr = json_decode($list, true) ?? [];
        }

        return $videoIdArr;
    }
}


class icf extends cfRecommendBase
{
    public function recommend($condition)
    {
        //echo "icf\n";
        $videoIdArr = [];

        $flag = $condition['flag'] ?? 1;
        $userId = $condition['user_id'] ?? '';
        $lastRecommendIds = $condition['last_recommend_ids'] ?? [];
        $url = "{$this->cfGatewayUrl}/icf_recommend";

        if ($flag > 1 && $lastRecommendIds && $userId) {
            //echo "icf2\n";
            $itemId = $lastRecommendIds[0] ?? 0;
            $param['raw_item_id'] = $itemId;
            $param['top_k'] = $this->limit;

            $list = httpRequest($url, $param, 'json');
            $videoIdArr = json_decode($list, true) ?? [];
        }

        return $videoIdArr;
    }
}

2. PHP initiates recommendation acquisition

Considering the lack of video inventory in the early stage, the method of combining the popularity list with collaborative filtering is adopted. The front-end obtains video recommendations, and the interface returns the video recommendation list with the identification (page number) of the next request. This pagination number is used for the pagination placed on the leaderboard list when the collaborative filtering service is down or there is no recommendation. But it is also necessary to ensure that the number of pages is actually valid, so when the page number is too large and no data is returned, it is recursively reset to the first page, and the page number is also returned to the front end to make data acquisition smoother.

public static function recommend($flag, $videoIds, $userId)
    {
        $nexFlag = $flag + 1;
        $formatterVideoList = [];

        try {
            // 协同过滤推荐
            $isOpen = config('api.video_recommend.is_open');
            $cfVideoIds = [];
            if ($isOpen == 1) {
                $recommend = new Recommend($flag, $videoIds, $userId);
                $recommend->addObserver(new mcf(15));
                $recommend->addObserver(new icf(15));
                $cfVideoIds = $recommend->startRecommend();
            }

            // 已读视频
            $nowTime = strtotime(date('Ymd'));
            $timeBefore = $nowTime - 60 * 60 * 24 * 100;
            $videoIdsFilter = self::getUserVideoRatingByTime($userId, $timeBefore);
            $cfVideoIds = array_diff($cfVideoIds, $videoIdsFilter);

            // 违规视频过滤
            $videoPool = [];
            $cfVideoIds && $videoPool = ShortVideoModel::listByOrderRaw($cfVideoIds, $flag);

            // 冷启动推荐
            !$videoPool && $videoPool = self::hotRank($userId, $videoIdsFilter, $flag);

            if ($videoPool) {
                list($nexFlag, $videoList) = $videoPool;
                $formatterVideoList = self::formatterVideoList($videoList, $userId);
            }
        } catch (\Exception $e) {
            $preFileName = str::snake(__FUNCTION__);
            $path = self::getClassName();
            write_log("msg:" . $e->getMessage(), $preFileName . "_error", $path);
        }

        return [$nexFlag, $formatterVideoList];
    }

3. Dataset Generation

import os
import mysql.connector
import datetime
import pandas as pd

now = datetime.datetime.now()
year = now.year
month = now.month
day = now.day
fullDate = str(year) + str(month) + str(day)

dir_data = './collaborative_filtering/cf_excel'
file_path = '{}/dataset_{}.xlsx'.format(dir_data, fullDate)
db_config = {
    "host": "127.0.0.1",
    "database": "database",
    "user": "user",
    "password": "password"
}

if not os.path.exists(file_path):
    cnx = mysql.connector.connect(user=db_config['user'], password=db_config['password'],
                                  host=db_config['host'], database=db_config['database'])

    df = pd.read_sql_query("SELECT user_id, item_id, rating FROM short_video_rating", cnx)

    print('---------------插入数据集----------------')

    # 将数据帧写入Excel文件
    df.to_excel(file_path, index=False)

if not os.path.exists(file_path):
    raise IOError("Dataset file is not exists!")

4. Collaborative filtering service

import os

from flask import Flask, request, json, Response, abort
from collaborative_filtering import cf_item
from collaborative_filtering import cf_user
from collaborative_filtering import cf_mix
from werkzeug.middleware.proxy_fix import ProxyFix

app = Flask(__name__)

@app.route('/')
def hello_world():
    return abort(404)

@app.route('/mcf_recommend', methods=["POST", "GET"])
def get_mcf_recommendation():
    json_data = request.get_json()

    raw_uid = json_data.get("raw_uid")
    top_k = json_data.get("top_k")

    recommend_result = cf_mix.collaborative_fitlering(raw_uid, top_k)

    return Response(json.dumps(recommend_result), mimetype='application/json')

@app.route('/ucf_recommend', methods=["POST", "GET"])
def get_ucf_recommendation():
    json_data = request.get_json()

    raw_uid = json_data.get("raw_uid")
    top_k = json_data.get("top_k")

    recommend_result = cf_user.collaborative_fitlering(raw_uid, top_k)

    return Response(json.dumps(recommend_result), mimetype='application/json')

@app.route('/icf_recommend', methods=["POST", "GET"])
def get_icf_recommendation():
    json_data = request.get_json()

    raw_item_id = json_data.get("raw_item_id")
    top_k = json_data.get("top_k")

    recommend_result = cf_item.collaborative_fitlering(raw_item_id, top_k)

    return Response(json.dumps(recommend_result), mimetype='application/json')

if __name__ == '__main__':
    app.run(host="0.0.0.0",
            debug=True,
            port=6016
            )

5. Based on user recommendation

# -*- coding: utf-8 -*-
# @File    : cf_recommendation.py
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
from collections import defaultdict

import os
from surprise import Dataset
from surprise import Reader
from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import KNNBaseline
from heapq import nlargest
import pandas as pd
import datetime
import time

def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    for uid, user_ratings in top_n.items():
        top_n[uid] = nlargest(n, user_ratings, key=lambda s: s[1])

    return top_n

class PredictionSet():

    def __init__(self, algo, trainset, user_raw_id=None, k=40):
        self.algo = algo
        self.trainset = trainset
        self.k = k
        if user_raw_id is not None:
            self.r_uid = user_raw_id
            self.i_uid = trainset.to_inner_uid(user_raw_id)
            self.knn_userset = self.algo.get_neighbors(self.i_uid, self.k)
            user_items = set([j for (j, _) in self.trainset.ur[self.i_uid]])
            self.neighbor_items = set()
            for nnu in self.knn_userset:
                for (j, _) in trainset.ur[nnu]:
                    if j not in user_items:
                        self.neighbor_items.add(j)

    def user_build_anti_testset(self, fill=None):
        fill = self.trainset.global_mean if fill is None else float(fill)

        anti_testset = []
        user_items = set([j for (j, _) in self.trainset.ur[self.i_uid]])
        anti_testset += [(self.r_uid, self.trainset.to_raw_iid(i), fill) for
                         i in self.neighbor_items if
                         i not in user_items]
        return anti_testset

def user_build_anti_testset(trainset, user_raw_id, fill=None):
    fill = trainset.global_mean if fill is None else float(fill)

    i_uid = trainset.to_inner_uid(user_raw_id)

    anti_testset = []

    user_items = set([j for (j, _) in trainset.ur[i_uid]])

    anti_testset += [(user_raw_id, trainset.to_raw_iid(i), fill) for
                     i in trainset.all_items() if
                     i not in user_items]

    return anti_testset


# ================= surprise 推荐部分 ====================
def collaborative_fitlering(raw_uid, top_k):

    now = datetime.datetime.now()
    year = now.year
    month = now.month
    day = now.day
    fullDate = str(year) + str(month) + str(day)

    dir_data = './collaborative_filtering/cf_excel'
    file_path = '{}/dataset_{}.xlsx'.format(dir_data, fullDate)

    if not os.path.exists(file_path):
        raise IOError("Dataset file is not exists!")

    # 读取数据集#####################
    alldata = pd.read_excel(file_path)

    reader = Reader(line_format='user item rating')
    dataset = Dataset.load_from_df(alldata, reader=reader)

    # 所有数据生成训练集
    trainset = dataset.build_full_trainset()

    # ================= BaselineOnly  ==================
    bsl_options = {'method': 'sgd', 'learning_rate': 0.0005}
    algo_BaselineOnly = BaselineOnly(bsl_options=bsl_options)
    algo_BaselineOnly.fit(trainset)

    # 获得推荐结果
    rset = user_build_anti_testset(trainset, raw_uid)

    # 测试休眠5秒，让客户端超时
    # time.sleep(5)
    # print(rset)
    # exit()

    predictions = algo_BaselineOnly.test(rset)
    top_n_baselineonly = get_top_n(predictions, n=5)

    # ================= KNNBasic  ==================
    sim_options = {'name': 'pearson', 'user_based': True}
    algo_KNNBasic = KNNBasic(sim_options=sim_options)
    algo_KNNBasic.fit(trainset)

    # 获得推荐结果  ---  只考虑 knn 用户的
    predictor = PredictionSet(algo_KNNBasic, trainset, raw_uid)
    knn_anti_set = predictor.user_build_anti_testset()
    predictions = algo_KNNBasic.test(knn_anti_set)
    top_n_knnbasic = get_top_n(predictions, n=top_k)

    # ================= KNNBaseline  ==================
    sim_options = {'name': 'pearson_baseline', 'user_based': True}
    algo_KNNBaseline = KNNBaseline(sim_options=sim_options)
    algo_KNNBaseline.fit(trainset)

    # 获得推荐结果  ---  只考虑 knn 用户的
    predictor = PredictionSet(algo_KNNBaseline, trainset, raw_uid)
    knn_anti_set = predictor.user_build_anti_testset()
    predictions = algo_KNNBaseline.test(knn_anti_set)
    top_n_knnbaseline = get_top_n(predictions, n=top_k)

    # =============== 按比例生成推荐结果 ==================
    recommendset = set()
    for results in [top_n_baselineonly, top_n_knnbasic, top_n_knnbaseline]:
        for key in results.keys():
            for recommendations in results[key]:
                iid, rating = recommendations
                recommendset.add(iid)

    items_baselineonly = set()
    for key in top_n_baselineonly.keys():
        for recommendations in top_n_baselineonly[key]:
            iid, rating = recommendations
            items_baselineonly.add(iid)

    items_knnbasic = set()
    for key in top_n_knnbasic.keys():
        for recommendations in top_n_knnbasic[key]:
            iid, rating = recommendations
            items_knnbasic.add(iid)

    items_knnbaseline = set()
    for key in top_n_knnbaseline.keys():
        for recommendations in top_n_knnbaseline[key]:
            iid, rating = recommendations
            items_knnbaseline.add(iid)

    rank = dict()
    for recommendation in recommendset:
        if recommendation not in rank:
            rank[recommendation] = 0
        if recommendation in items_baselineonly:
            rank[recommendation] += 1
        if recommendation in items_knnbasic:
            rank[recommendation] += 1
        if recommendation in items_knnbaseline:
            rank[recommendation] += 1

    max_rank = max(rank, key=lambda s: rank[s])
    if max_rank == 1:
        return list(items_baselineonly)
    else:
        result = nlargest(top_k, rank, key=lambda s: rank[s])

        return list(result)

        # print("排名结果: {}".format(result))

6. Item-based recommendation

# -*- coding: utf-8 -*-
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
from collections import defaultdict

import io
import os
from surprise import SVD, KNNBaseline, Reader, Dataset
import pandas as pd
import datetime
import mysql.connector
import pickle

# ================= surprise 推荐部分 ====================
def collaborative_fitlering(raw_item_id, top_k):

    now = datetime.datetime.now()
    year = now.year
    month = now.month
    day = now.day
    fullDate = str(year) + str(month) + str(day)

    # dir_data = './collaborative_filtering/cf_excel'
    dir_data = './cf_excel'
    file_path = '{}/dataset_{}.xlsx'.format(dir_data, fullDate)

    if not os.path.exists(file_path):
        raise IOError("Dataset file is not exists!")

    # 读取数据集#####################
    alldata = pd.read_excel(file_path)

    reader = Reader(line_format='user item rating')
    dataset = Dataset.load_from_df(alldata, reader=reader)

    # 使用协同过滤必须有这行，将我们的算法运用于整个数据集，而不进行交叉验证，构建了新的矩阵
    trainset = dataset.build_full_trainset()

    # print(pd.DataFrame(list(trainset.global_mean())))
    # exit()

    # 度量准则：pearson距离，协同过滤：基于item
    sim_options = {'name': 'pearson_baseline', 'user_based': False}
    algo = KNNBaseline(sim_options=sim_options)
    algo.fit(trainset)

    # 将训练好的模型序列化到磁盘上
    # with open('./cf_models/cf_item_model.pkl', 'wb') as f:
    #     pickle.dump(algo, f)

    #从磁盘中读取训练好的模型
    # with open('cf_item_model.pkl', 'rb') as f:
    #     algo = pickle.load(f)

    # 转换为内部id
    toy_story_inner_id = algo.trainset.to_inner_iid(raw_item_id)
    # 根据内部id找到最近的10个邻居
    toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=top_k)
    # 将10个邻居的内部id转换为item id也就是raw
    toy_story_neighbors_rids = (algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors)

    result = list(toy_story_neighbors_rids)

    return result

    # print(list(toy_story_neighbors_rids))


if __name__ == "__main__":
    res = collaborative_fitlering(15, 20)
    print(res)

other

1. Recommended service production deployment

In the development environment, it can be started through python recommend_service.py, and gunicorn is required for the later deployment environment by configuring environment variables after installation. Import werkzeug.middleware.proxy_fix in the code, modify the following startup part, and change startup to gunicorn -w 5 -b 0.0.0.0:6016 app:app

app.wsgi_app = ProxyFix(app.wsgi_app)
app.run()

2. Save the model locally

With the accumulation of business data, the data sets that need to be trained naturally become larger and larger, so the model training cycle can be shortened later. That is, after training the model regularly, save it locally, and then make recommendations based on the online data. The method of storing and reading the model is as follows.

2.1. Model Storage

sim_options = {'name': 'pearson_baseline', 'user_based': False}
    algo = KNNBaseline(sim_options=sim_options)
    algo.fit(trainset)

    # 将训练好的模型序列化到磁盘上
    with open('./cf_models/cf_item_model.pkl', 'wb') as f:
        pickle.dump(algo, f)

2.2. Model reading

    with open('cf_item_model.pkl', 'rb') as f:
        algo = pickle.load(f)

    # 转换为内部id
    toy_story_inner_id = algo.trainset.to_inner_iid(raw_item_id)
    # 根据内部id找到最近的10个邻居
    toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=top_k)
    # 将10个邻居的内部id转换为item id也就是raw
    toy_story_neighbors_rids = (algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors)

    result = list(toy_story_neighbors_rids)

    return result

write at the end

The above is still only a small part of the recommendation system. When doing data recall, it can cut frames from the video or separate the audio, and use the convolutional neural network to identify the type of audio and the general content of the video. Then, based on the tags formed by the user's past browsing records, content matching and so on are realized. This needs to be continuously learned and improved in the later stage.