協同過濾實現小型推薦系統

Tresdin發表於2018-11-17

原文網址 : https://learnku.com/articles/19862?order_by=created_at&

假設有以下使用者評分表:

\	p1	p2	p3	p4	p6	p7	p8	p9
u1	1	1	0	0	0	0	0	0
u2	1	2	1	0	1	0	1	0
u3	12	2	1	0	1	0	11	0
u4	0	0	0	0	0	0	0	1
u5	0	0	0	0	0	0	0	2
u6	0	0	0	0	0	0	0	0
u7	11	0	0	7	14	1	3	19

其中，每一行代表一個使用者, 每一列代表一個產品，每一個元素代表某使用者對於某產品的評分。設使用者 u_i 的評分向量為 v_i , 協同過濾的想法是，以兩個評分向量的夾角的餘弦代表兩個使用者之間的相似度:
file
其中 e_i 代表向量 v_i 的單位向量。那麼，對於每一行表示一個使用者的評分矩陣 A, 按行對其做單位化處理得到新矩陣 B , 協方差矩陣 C = BB' (B' 表示 B 的轉置，這裡上標 T 始終不能正常顯示) 就是使用者的相似度矩陣， C 具有以下特徵:
file
C_ij 代表使用者 i 和使用者 j 的相似度。接下來，求與使用者 i 相似度最高的若干使用者，然後通過這些使用者的評分情況向使用者 i 推薦產品。

構造評分矩陣

實際上，讓使用者對產品評分，是不容易實現的，可以通過歷史記錄，構造評分矩陣。本次實驗環境中，存在使用者行為記錄表 (user_actions)

user_id	product_id	keywork	action	created_at
17	1	NULL	view	2018-07-04 08:58:10
10	NULL	218	search	2018-07-04 09:26:54
4	109	NULL	view	2018-07-04 09:30:38
28	NULL	218	search	2018-07-04 09:35:41
28	56	NULL	view	2018-07-04 09:36:28
34	109	NULL	view	2018-07-04 10:06:15
34	109	NULL	buy	2018-07-04 10:06:38
34	109	NULL	buy	2018-07-04 10:06:46

這裡不去關注"搜尋"(action=search)行為, 取出瀏覽和購買的行，因為"瀏覽"和"購買"的比例大概是 50:1 , 所以，瀏覽一次記 1 分，購買一次記 50 分。另外，顯然還存在使用者表和產品表，以下只用到它們的 id, 表結構無所謂。

首先，獲取參與計算的使用者 id 和產品 id, 因為 id 可能是不連續的，將它們的索引和 id 之間的對映記錄下來，方便後續計算。因為是以使用者為基礎的，只讀取有歷史記錄的使用者，而產品是從產品表中讀取，也就是說在此查詢執行之後有互動行為的新使用者不會參與計算，而老使用者的行為仍會參與計算。考慮記憶體限制，將使用者分為幾塊處理，只計算一個塊內使用者彼此的相似度，推薦也只會在一個塊內的使用者間產生,程式碼如下 :

// 取在 users 和 user_actions 中都存在的 user_id
$users = UserAction::select("user_id")->distinct()
    ->join("users", "user_actions.user_id", "=", "users.id")
    ->orderBy("user_id")
    ->get();

// 使用者 id
foreach ($users as $u) {
    $ui[] = $u->user_id;
}
// 使用者索引
$iu = array_flip($ui);

$products = Product::select("id")->distinct()->orderBy("id")->get();
// 產品 id
foreach ($products as $p) {
    $pi[] = $p->id;
}
// 產品索引
$ip = array_flip($pi);

// 分塊
$k = $this->getChunks();
// 每一塊的最大數量
$kn = intval(ceil(count($ui) / $k));
$map = [
    "users" => $ui,
    "indexOfUsers" => $iu,
    "products" => $pi,
    "indexOfProducts" => $ip,
    "chunks" => $k,
    "numberPerChunk" => $kn,
];

使用者分塊，設定每塊最多使用者數，計算塊數應該更合理

函式 getRatingByUser(int $id) 實現從資料庫中讀一個使用者的評分向量, 其中 zerosArray(int $n) 的作用是返回一個長度為 $n 值全為 0 的陣列:

// 點選和購買的比例接近 50 : 1
protected $caseSql = "CASE `action` WHEN 'view' THEN 1 WHEN 'buy' THEN 50 ELSE 0 END AS `score`";
/**
 * 獲取一個使用者對於每一個產品的評分，得到評分向量
 * @var int $id user id
 * @var array  $ratingArr user rating vector
 * @return array
 */
public function getRatingByUser(int $id) : array
{
    $map = $this->getMap();

    $userActions = UserAction::select("product_id", DB::raw($this->caseSql))
    ->where("user_id", "=", $id)
    ->whereIn("product_id", $map["products"]) // 過濾掉只存在於 user_actions 中的 product_id
    ->where(function ($query) {
        $query->where("action", "=", "view")
        ->orWhere("action", "=", "buy");
    })
    ->orderBy("product_id")
    ->get();

    $ratingArray = $this->zerosArray(count($map["products"]));
    foreach ($userActions as $ua) {
    $index = $map["indexOfProducts"][$ua->product_id];
    $ratingArray[$index] += $ua->score;
    }
    return $ratingArray;
}

迴圈呼叫 getRatingByUser(int $id) 得到評分矩陣, 如下:
file

通過評分矩陣，可以得到一些有用的資訊：

矩陣按行求和，代表使用者的活躍度，圖中高亮的行所對應的使用者，明顯比其他的活躍

矩陣按列求和，代表產品的活躍度，圖中第一個產品，明顯比第二個產品活躍

矩陣先按行單位化後按列求和，值越大說明該列對應的產品受越多的使用者關注

每一行資料分佈情況反應了這一個使用者的偏好

考慮到評分矩陣可能被其他程式使用，在儲存的時候，不分塊，本實現通過指標處理這個檔案，不會造成效能問題。

計算相似度

PHP 實現

getSimilarityMatrix(void) 通過評分矩陣計算相似度(協方差)矩陣。向量運算使用了 math-php 庫, 因為 math-php 只能以列向量構建矩陣，所以由 B'B 計算協方差。 EPSILON = 1e-9 , 將長度小於 EPSILION 的向量視為 0 :

/**
 * 得到使用者評分向量相似度矩陣
 * 考慮記憶體限制，將使用者分成 $k 組, 求協方差
 * 如果初始化類傳入儲存位置, 則儲存資料到檔案
 * 生成當前進度
 * test:
 *     cache, chunk=1: 60s
 * @yield array
 * @return Generator
 */
public function getSimilarityMatrix() : Generator
{
    $k = $this->getChunks();
    $dir = $this->getDataDir();
    $users = $this->getMap()["users"];
    $urKey = $this->getCacheKey("user_rating");
    $smKey = $this->getCacheKey("sim_mat");
    $nk = intval(ceil(count($users) / $k));
    if ($dir) {
    $file = $dir . DIRECTORY_SEPARATOR . $urKey . ".csv";
    $isBig = filesize($file) > static::MAX_FILE_SIZE;
    // 大檔案按行讀, 否則直接讀入陣列
    if ($isBig) {
        $urCsv = fopen($file, "r");
    } else {
        $urCsv = file($file, FILE_IGNORE_NEW_LINES);
    }
    }

    for ($i = 0; $i < $k; $i++) {
    $vs = [];
    if ($i + 1 < $k) {
        $chunk = $nk;
    } else {
        $chunk = count($users) - $nk * $i;
    }
    for ($j = 0; $j < $chunk; $j++) {
        $index = $i * $nk + $j;
        if ($dir) {
        if ($isBig) {
            $arr = str_getcsv(fgets($urCsv));
        } else {
            $arr = str_getcsv($urCsv[$index]);
        }
        } else {
        $arr = Cache::get("$urKey.{$users[$index]}");
        }
        // 單位化處理
        $v = new Vector($arr);
        $v = $v->length() < static::EPSILON ? $v : $v->normalize();
        $vs[] = $v;
    }

    // 計算協方差
    $M = MatrixFactory::create($vs);
    $covMatrix = $M->transpose()->multiply($M);
    $covArray = $covMatrix->getMatrix();

    // 儲存資料
    if ($dir) {
        $file = $dir . DIRECTORY_SEPARATOR . "$smKey.$i.csv";
        $smCsv = fopen($file, "w");
        foreach ($covArray as $row) {
        fputcsv($smCsv, $row);
        }
        fclose($smCsv);
    } else {
        Cache::forever("$smKey.$i", $covArray);
    }
    yield [$i + 1, $k];
    }
    if ($dir && $isBig) {
    fclose($urCsv);
    }
}

執行該函式,本次測試耗時 76 秒:

xl@xl:~/apps/blog$ ./artisan tinker
Psy Shell v0.9.9 (PHP 7.2.10 — cli) by Justin Hileman
>>> $rs = new App\Tools\RecommenderSystem(1, storage_path('rsdata'))
=> App\Tools\RecommenderSystem {#2906}
>>> $start = time(); foreach ($rs->getSimilarityMatrix() as $arr){}; $end = time(); echo $end-$start;
76⏎

math-php 的效率

math-php 是未經線性代數優化的庫，檢視原始碼可知，計算矩陣乘法的時間複雜度為 O(n^3) 。而線性代數優化過的庫，在一定規模內(視硬體效能)矩陣操作的時間複雜度可以視為 O(1) 。實驗環境下, 計算 1000 階矩陣乘法, math-php 效率為 IntelMTK 的 1 / 10000, NumPy 的 1/1000 .

PYTHON 實現

下面是一個簡單的 python 實現, 使用了 sklearn.preprocessing.normalize 和 NumPy 庫。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from sklearn.preprocessing import normalize
import numpy as np

# 載入資料
A = np.loadtxt('rs.user_rating.csv', delimiter=',')
# 按行單位化
B = normalize(A, norm='l2', axis=1)
# 求協方差
C = B @ B.transpose()
# 儲存資料
np.savetxt('covmat.csv', C)

執行效果:

xl@xl:~/apps/blog/storage/rsdata$ ls -lh
total 14M
-rw-rw-r-- 1 xl xl  313 Nov 17 12:44 covMat.py
-rw-rw-r-- 1 xl xl  35K Nov 17 12:51 rs.map.json
-rw-rw-r-- 1 xl xl  13M Nov 17 12:53 rs.sim_mat.0.csv
-rw-rw-r-- 1 xl xl 632K Nov 16 19:44 rs.user_rating.csv
xl@xl:~/apps/blog/storage/rsdata$ time python3 covMat.py 

real    0m2.206s
user    0m2.315s
sys 0m0.660s
xl@xl:~/apps/blog/storage/rsdata$ ls -lh
total 119M
-rw-rw-r-- 1 xl xl 106M Nov 17 13:03 covmat.csv
-rw-rw-r-- 1 xl xl  313 Nov 17 12:44 covMat.py
-rw-rw-r-- 1 xl xl  35K Nov 17 12:51 rs.map.json
-rw-rw-r-- 1 xl xl  13M Nov 17 12:53 rs.sim_mat.0.csv
-rw-rw-r-- 1 xl xl 632K Nov 16 19:44 rs.user_rating.csv

因為 numpy 使用了更高的精度，矩陣檔案 ( covmat.csv ) 比php生成 ( rs.sim_mat.0.csv ) 的大很多，本次測試耗時 2 秒，可以看到，效率提升是很明顯的。

最後，得到協方差矩陣如下:
file
可以看到，矩陣是關於對角線對稱的，且在對角線上為最大值 1.

實際上，評分矩陣可能是非常稀疏的，可以使用 scipy.sparse 處理。

投票推薦

協方差矩陣中的第 i 行對應使用者 i 和其他使用者的相似度。 userLikes($uid, $n) 找出和使用者 $uid 最相似的 $n 個使用者, 其中 getSimVecByUser($uid) 實現取出行向量, userIdToLocation($id) 實現根據使用者 id 找到資料位置:

/**
 * 計算和使用者 $uid 相似度最高的 $n 個使用者, 返回由使用者 id 組成的陣列
 * @var int $uid user's id
 * @var int $n number of user
 * @return array
 */
public function userLikes($uid, $n = 10) : array
{
    $likes = [];
    $map = $this->getMap();
    $users = $map["users"];
    $kn = $map["numberPerChunk"];
    // 獲取相似度向量
    $vec = $this->getSimVecByUser($uid);
    if (!$vec) {
    return [];
    }
    // 逆排序
    arsort($vec);
    // 前 $n 個索引
    $topNI = array_slice(array_keys($vec), 0, $n);
    // 索引轉 id
    $location = $this->userIdToLocation($uid);
    $i = $location[0];
    foreach ($topNI as $j) {
    $likes[] = $users[$i * $kn + $j];
    }

    return $likes;
}

python 實現:

# 協方差矩陣按行逆向排序，返回索引
U = np.argsort(-C, axis=1)

接著, getUserFavouriteProducts(int $id, int $n) 實現從評分矩陣中找到使用者 $id 評分最高的 $n 個產品:

/**
 * 得到使用者 $id 評分最高的 $n 個產品, 返回產品 id 陣列
 * @var int $id user's id
 * @var int $n number of products
 * @return array
 */
public function getUserFavouriteProducts(int $id, int $n = 10) : array
{
    $dir = $this->getDataDir();
    $key = $this->getCacheKey("user_rating");
    $map = $this->getMap();
    $pi = $map["products"];
    $iu = $map["indexOfUsers"];
    $m = count($iu);
    $index = $iu[$id];
    if ($dir) {
    $file = $dir . DIRECTORY_SEPARATOR . "$key.csv";
    if (filesize($file) > static::MAX_FILE_SIZE) {
        $csv = fopen($file, "r");
        for ($i = 0; $i < $index; $i++) {
        fgets($csv);
        }
        $vec = str_getcsv(fgets($csv));
        fclose($csv);
    } else {
        $arr = file($file, FILE_IGNORE_NEW_LINES);
        $vec = str_getcsv($arr[$index]);
    }
    } else {
    $vec = Cache::get("$key.$id");
    }

    arsort($vec);
    $rn = array_slice($vec, 0, $n, true);
    // 刪除評分為 0 的項
    $rn = array_filter($rn, function ($item) {
    return abs($item) > 1e-9;
    });
    $fps = [];
    foreach ($rn as $pid => $score) {
    $fps[] = $pi[$pid];
    }

    return $fps;
}

python 實現:

# 評分矩陣按行逆向排序，返回索引
RI = np.argsort(-R, axis=1)

最後，通過投票，產生推薦產品, 以下投票策略比較粗糙，沒有考慮選民的權重:

/**
 * 為使用者 $uid 產生至多 $numberOfVoter * $numberOfVote 個推薦產品
 * @var int $uid user id
 * @var int $numberOfVoter 選民數
 * @var int $numberOfVote 每個選民的票數
 * @return array
 */
public function vote($uid, $numberOfVoter, $numberOfVote) : array
{
    $likes = $this->userLikes($uid, $numberOfVoter);
    $ps = [];
    foreach ($likes as $id) {
    $fps = $this->getUserFavouriteProducts($id, $numberOfVote);
    $ps = array_merge($ps, $fps);
    }
    $ps = array_unique($ps);
    // 推薦產品沒有必要儲存到檔案
    $key = $this->getCacheKey("user_recommender_products");
    Cache::forever("$key.$uid", $ps);
    return $ps;
}

以下 artisan 命令類，顯示了訓練的步驟:

<?php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use App\Tools\RecommenderSystem as RS;
use Cache;

class RecommenderSystem extends Command
{
    protected $signature = 'resys:train {--D|datadir= : where to save data, default is cache} {--K|chunk=1 : number of users chunk} {--U|vu=10 : number of voter} {--N|vn=10 : number of votes}';

    protected $description = 'recommender system';

    public function __construct()
    {
    parent::__construct();
    }

    public function handle() : void
    {
    $k= intval($this->option("chunk"));
    $vu = intval($this->option("vu"));
    $vn = intval($this->option("vn"));
    $dir = $this->hasOption("datadir") ? $this->option("datadir") : null;
    $rs = new RS($k, $dir);
    // 1. 得到使用者評分矩陣, O(n^2)
    $this->info("1. Compute user rating:");
    foreach ($rs->getRatingVectors() as $rv) {
        $bar1 = $bar1 ?? $this->output->createProgressBar($rv[1]);
        $bar1->setProgress($rv[0]);
    }
    $bar1->finish();
    $this->line("");

    // 2. 計算使用者相似度矩陣, O(n^3)
    $this->info("2. Compute user similarity:");
    $bar2 = $this->output->createProgressBar($k);
    foreach ($rs->getSimilarityMatrix() as $sm) {
        $bar2->setProgress($sm[0]);
    }
    $bar2->finish();
    $this->line("");

    // 3. 投票決定為每一個使用者推薦的產品, O(n)
    $this->info("3. vote:");
    $users = $rs->getMap()["users"];
    $bar3 = $this->output->createProgressBar(count($users));
    foreach ($users as $id) {
        $rs->vote($id, $vu, $vn);
        $bar3->advance();
    }
    $bar3->finish();
    $this->line("");

    // 4. 清除中間資料
    $this->info("4. clear cache:");
    foreach ($rs->clearCache() as $arr) {
        $bar4 = $bar4 ?? $this->output->createProgressBar($arr[1]);
        $bar4->setProgress($arr[0]);
    }
    $this->info("\nTrain done");

    }
}

執行效果如下:

xl@xl:~/apps/blog$ ll storage/rsdata/
total 8
drwxrwxr-x 2 xl xl 4096 Nov 17 13:28 ./
drwxr-xr-x 6 xl xl 4096 Nov 15 16:02 ../
xl@xl:~/apps/blog$ redis-cli 
127.0.0.1:6379> SELECT 1
OK
127.0.0.1:6379[1]> FLUSHALL
OK
127.0.0.1:6379[1]> exit
xl@xl:~/apps/blog$ time ./artisan resys:train -D storage/rsdata/ -K 4 -U 5 -N 5
1. Compute user rating:
 2099/2099 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
2. Compute user similarity:
 4/4 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
3. vote:
 2099/2099 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
4. clear cache:
 4/4 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
Train done

real    0m42.890s
user    0m23.208s
sys 0m2.285s
xl@xl:~/apps/blog$ ls -lh storage/rsdata/
total 668K
-rw-rw-r-- 1 xl xl  35K Nov 17 13:30 rs.map.json
-rw-rw-r-- 1 xl xl 632K Nov 17 13:30 rs.user_rating.csv
xl@xl:~/apps/blog$ redis-cli 
127.0.0.1:6379> SELECT 1
OK
127.0.0.1:6379[1]> KEYS *
   1) "laravel_cache:rs.user_recommender_products.705"
   2) "laravel_cache:rs.user_recommender_products.2957"
   3) "laravel_cache:rs.user_recommender_products.1697"
   4) "laravel_cache:rs.user_recommender_products.1625"
   5) "laravel_cache:rs.user_recommender_products.2844"
   6) "laravel_cache:rs.user_recommender_products.637"
   7) "laravel_cache:rs.user_recommender_products.1275"
   8) "laravel_cache:rs.user_recommender_products.2049"

127.0.0.1:6379[1]> get laravel_cache:rs.user_recommender_products.1060
"a:3:{i:0;i:9;i:1;i:1;i:2;i:10;}"

訓練完成後，快取中以 xxx.$user_id 儲存著推薦產品 id 組成的陣列。

尚待優化

未實現冷啟動，就是說，不存在互動行為的使用者不會產生推薦值，不存在互動的產品不會被推薦;
總認為呼叫者的引數是合理的，不會對其做判斷;
總認為檔案系統是穩定可靠的，讀寫檔案不做檢測;
構造評分矩陣的過程中，一次只讀一個使用者的歷史行為，資料庫查詢次數可能很多;
推薦產品的數量並不保證下限，因為實際中評分矩陣往往是非常稀疏的，過小的引數可能會導致推薦產品數量非常少;
…

svd 與相似度矩陣

因為求相似度矩陣，其實就是求矩陣 BB' ，自然考慮使用 svd , [u, s, v] = svd(A) , u 就是 AA' 的特徵矩陣，而 u 本身是規範正交的，使用 svd 可以省去單位化的步驟，這麼做的問題在於丟掉了產品 id 和索引之間的對映，尚未想到如何解決。

程式碼下載

src/app/Tools/RecommenderSystem.php , 訓練類
src/app/Console/Commands/RecommenderSystem.php , artisan 命令類

下載地址

【Datawhale】推薦系統-協同過濾
2020-10-22
推薦系統入門之使用協同過濾實現商品推薦
2021-03-11
推薦系統與協同過濾、奇異值分解
2019-03-04
協同過濾在推薦系統中的應用
2020-10-30
使用協同濾波（Collaborative Filtering）實現內容推薦系統
2021-01-03
Filter
基於使用者的協同過濾來構建推薦系統
2020-06-25
推薦召回--基於物品的協同過濾：ItemCF
2022-01-21
【轉】推薦系統演算法總結（二）——協同過濾(CF) MF FM FFM
2018-08-30
演算法
推薦系統--完整的架構設計和演算法(協同過濾、隱語義)
2019-09-09
架構演算法
【小白學推薦1】協同過濾零基礎到入門
2020-08-20
【JAVA】助力數字化營銷：基於協同過濾演算法實現個性化商品推薦
2024-04-23
Java演算法
協同過濾演算法概述與python 實現協同過濾演算法基於內容（usr-it
2021-09-09
演算法Python
協同過濾的R語言實現及改進
2019-02-22
R語言
協同過濾筆記
2024-04-07
筆記
基於遺傳最佳化的協同過濾推薦演算法matlab模擬
2024-03-23
演算法Matlab
基於專案的協同過濾推薦演算法(Item-Based Collaborative Filtering Recommendation Algorithms)
2024-04-07
演算法FilterGo
構建基於深度學習神經網路協同過濾模型(NCF)的影片推薦系統(Python3.10/Tensorflow2.11)
2023-03-30
深度學習神經網路模型Python
《推薦系統實踐》筆記 01 推薦系統簡介
2020-11-22
筆記
預測電影偏好？如何利用自編碼器實現協同過濾方法
2018-05-20
Spark推薦系統實踐
2021-01-12
Spark
神經圖協同過濾（Neural Graph Collaborative Filtering）
2020-11-25
Filter
分期商城實時推薦系統
2018-12-29
【推薦系統篇】--推薦系統之訓練模型
2018-03-26
模型
推薦Zoho CRM系統如何實現遠端辦公？
2022-04-11
推薦系統實踐學習系列（三）推薦系統冷啟動問題
2018-06-24
實現基於內容的電影推薦系統—程式碼實現
2024-04-07
Netflix推薦系統(Part Seven)-改善實驗系統
2019-03-01
【推薦系統篇】--推薦系統之測試資料
2018-03-27
推薦系統概述
2018-10-31
python 推薦系統
2022-02-28
Python
網易雲音樂推薦系統簡單實現系列
2019-03-04
基於矩陣分解的協同過濾演算法
2024-04-11
矩陣演算法
推薦系統論文之序列推薦：KERL
2021-05-17
推薦個比較好用的協同辦公軟體？
2022-02-17
推薦系統一——深入理解YouTube推薦系統演算法
2020-10-11
演算法
【推薦系統篇】--推薦系統介紹和基本架構流程
2018-03-26
架構
直播系統，利用關聯規則實現推薦演算法
2024-07-27
演算法
《推薦系統》-DIN模型
2020-10-22
模型

\	p1	p2	p3	p4	p6	p7	p8	p9
u1	1	1	0	0	0	0	0	0
u2	1	2	1	0	1	0	1	0
u3	12	2	1	0	1	0	11	0
u4	0	0	0	0	0	0	0	1
u5	0	0	0	0	0	0	0	2
u6	0	0	0	0	0	0	0	0
u7	11	0	0	7	14	1	3	19

\	p1	p2	p3	p4	p6	p7	p8	p9
u1	1	1	0	0	0	0	0	0
u2	1	2	1	0	1	0	1	0
u3	12	2	1	0	1	0	11	0
u4	0	0	0	0	0	0	0	1
u5	0	0	0	0	0	0	0	2
u6	0	0	0	0	0	0	0	0
u7	11	0	0	7	14	1	3	19