協同過濾演算法

發表於2016-10-08

協作型過濾

協同過濾是利用集體智慧的一個典型方法。要理解什麼是協同過濾 (Collaborative Filtering, 簡稱CF)，首先想一個簡單的問題，如果你現在想看個電影，但你不知道具體看哪部，你會怎麼做？大部分的人會問問周圍的朋友，看看最近有什麼好看的電影推薦，而我們一般更傾向於從口味比較類似的朋友那裡得到推薦。這就是協同過濾的核心思想。

協同過濾一般是在海量的使用者中發掘出一小部分和你品位比較類似的，在協同過濾中，這些使用者成為鄰居，然後根據他們喜歡的其他東西組織成一個排序的目錄推薦給你。

要實現協同過濾，需要以下幾個步驟：

蒐集偏好

尋找相近使用者

推薦物品

蒐集偏好

首先，我們要尋找一種表達不同人及其偏好的方法。這裡我們用python的巢狀字典來實現。

在本章中所用的資料，是從國外的網站grouplens下載的u.data。該資料總共四列，共分為使用者ID、電影ID、使用者評分、時間。我們只需根據前三列，生成相應的使用者偏好字典。

#生成使用者偏好字典
def make_data():
    result={}
    f = open('data/u.data', 'r')
    lines = f.readlines()
    for line in lines:
        #按行分割資料
        (userId , itemId , score,time ) = line.strip().split("\t")
        #字典要提前定義
        if not result.has_key( userId ):
            result[userId]={}
        #注意float,不然後續的運算存在型別問題
        result[userId][itemId] = float(score)
    return result

#生成使用者偏好字典

def make_data():

result={}

f = open('data/u.data', 'r')

lines = f.readlines()

for line in lines:

#按行分割資料

(userId , itemId , score,time ) = line.strip().split("\t")

#字典要提前定義

if not result.has_key( userId ):

result[userId]={}

#注意float,不然後續的運算存在型別問題

result[userId][itemId] = float(score)

return result

另外如果想在字典中顯示展現電影名，方便分析，也可以根據u.item中電影資料，預先生成電影的資料集。

#將id替換為電影名 構造資料集
def loadMovieLens(path='data'):
    # Get movie titles
    movies={}
    for line in open(path+'/u.item'):
        (id,title)=line.split('|')[0:2]
        movies[id]=title

    # Load data
    prefs={}
    for line in open(path+'/u.data'):
        (user,movieid,rating,ts)=line.split('\t')
        prefs.setdefault(user,{})
        prefs[user][movies[movieid]]=float(rating)
    return prefs

#將id替換為電影名構造資料集

def loadMovieLens(path='data'):

# Get movie titles

movies={}

for line in open(path+'/u.item'):

(id,title)=line.split('|')[0:2]

movies[id]=title

# Load data

prefs={}

for line in open(path+'/u.data'):

(user,movieid,rating,ts)=line.split('\t')

prefs.setdefault(user,{})

prefs[user][movies[movieid]]=float(rating)

return prefs

根據上面兩個函式中的一種，到此我們的使用者資料集已經構造好了，由於資料量不是非常大，暫時放在記憶體中即可。
由於以上資料集比較抽象，不方便講解，至此我們定義一個簡單的資料集來講解一些例子，一個簡單的巢狀字典：

#使用者：{電影名稱：評分}
critics={
    'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,'The Night Listener': 3.0},
    'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,'You, Me and Dupree': 3.5}, 
    'Michael Phillips':{'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,'Superman Returns': 3.5, 'The Night Listener': 4.0},
    'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,'The Night Listener': 4.5, 'Superman Returns': 4.0,'You, Me and Dupree': 2.5},
    'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0}, 
    'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
    'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}
}

#使用者：{電影名稱：評分}

critics={

'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,'The Night Listener': 3.0},

'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,'You, Me and Dupree': 3.5},

'Michael Phillips':{'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,'Superman Returns': 3.5, 'The Night Listener': 4.0},

'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,'The Night Listener': 4.5, 'Superman Returns': 4.0,'You, Me and Dupree': 2.5},

'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,

'You, Me and Dupree': 2.0},

'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},

'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}

}

尋找相近使用者

收集完使用者資訊後，我們通過一些方法來確定兩個使用者之間品味的相似程度，計算他們的相似度評價值。有很多方法可以計算，我們在此介紹兩套常見的方法：歐幾里得距離和皮爾遜相關度。

歐幾里得距離

歐幾里得距離（euclidea nmetric）（也稱歐式距離）是一個通常採用的距離定義，指在m維空間中兩個點之間的真實距離，或者向量的自然長度（即該點到原點的距離）。在二維和三維空間中的歐氏距離就是兩點之間的實際距離。

數學定義：
已知兩點 A = (x_1,x_2,…,x_n)和B = (y_1,y_2,…,y_n)，則兩點間距離：

接下來我們只要把資料集對映為座標系就可以明顯的比較出相似度，以”Snakes on a Plane”和”You, Me and Dupree”兩部電影距離，有座標系如下圖：
協同過濾演算法

計算上圖中Toby和Mick LaSalle的相似度：

from math import sqrt
sqrt(pow( 4.5-4 , 2 ) + pow( 1 – 2 , 2))
1.118033988749895

上面的式子計算出了實際距離值，但在實際應用中，我們希望相似度越大返回的值越大，並且控制在0~1之間的值。為此，我們可以取函式值加1的倒數(加1是為了防止除0的情況)：

1/(1+sqrt(pow( 4.5-4 , 2 ) + pow( 1 – 2 , 2)))
0.4721359549995794

接下來我們就可以封裝一個基於歐幾里得距離的相似度評價，具體python實現如下：

#歐幾里得距離
def sim_distance( prefs,person1,person2 ):
    si={}
    for itemId in prefs[person1]:
        if itemId in prefs[person2]:
            si[itemId] = 1
    #no same item
    if len(si)==0: return 0
    sum_of_squares = 0.0    
    
    #計算距離 
    sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])
    return 1/(1 + sqrt(sum_of_squares) )

#歐幾里得距離

def sim_distance( prefs,person1,person2 ):

si={}

for itemId in prefs[person1]:

if itemId in prefs[person2]:

si[itemId] = 1

#no same item

if len(si)==0: return 0

sum_of_squares = 0.0

#計算距離

sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])

return 1/(1 + sqrt(sum_of_squares) )

基於測試資料集critics，則可以計算兩個人的相似度值為：

sim_distance( critics , ‘Toby’, ‘Mick LaSalle’ )
0.307692307692

皮爾遜相關度

皮爾遜相關係數是一種度量兩個變數間相關程度的方法。它是一個介於 1 和 -1 之間的值，其中，1 表示變數完全正相關， 0 表示無關，-1 表示完全負相關。

數學公式：

與歐幾里得距離不同，我們根據兩個使用者來建立笛卡爾座標系，根據使用者對相同電影的評分來建立點，如下圖：
協同過濾演算法
在圖上，我們還可以看到一條線，因其繪製的原則是儘可能的接近圖上所有點，故該線也稱為最佳擬合線。用皮爾遜方法進行評價時，可以修正“誇大值”部分，例如某人對電影的要求更為嚴格，給出分數總是偏低。

python程式碼實現：

#皮爾遜相關度 
def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
      if item in prefs[p2]: si[item]=1
    
    if len(si)==0: return 0
    
    n=len(si)
    
    #計算開始
    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])
    
    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si])   
    
    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
    
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
    #計算結束

    if den==0: return 0
    
    r=num/den
    
    return r

#皮爾遜相關度

def sim_pearson(prefs,p1,p2):

si={}

for item in prefs[p1]:

if item in prefs[p2]: si[item]=1

if len(si)==0: return 0

n=len(si)

#計算開始

sum1=sum([prefs[p1][it] for it in si])

sum2=sum([prefs[p2][it] for it in si])

sum1Sq=sum([pow(prefs[p1][it],2) for it in si])

sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

num=pSum-(sum1*sum2/n)

den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))

#計算結束

if den==0: return 0

r=num/den

return r

最後根據critics的資料計算Gene Seymour和Lisa Rose的相關度：

recommendations.sim_pearson(recommendations.critics,’Gene Seymour’,’Lisa Rose’)

為評論者打分

到此，我們就可以根據計算出使用者之間的相關度，並根據相關度來生成相關度列表，找出與使用者口味相同的其他使用者。

#推薦使用者
def topMatches(prefs,person,n=5,similarity=sim_distance):
    #python列表推導式
    scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

#推薦使用者

def topMatches(prefs,person,n=5,similarity=sim_distance):

#python列表推導式

scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person]

scores.sort()

scores.reverse()

return scores[0:n]

基於物品的協同過濾

以上所講的都是基於使用者間相似的推薦，下面我們看看基於物品的推薦。

同樣，先構造資料集，即以物品為key的字典，格式為{電影:{使用者:評分,使用者:評分}}

#基於物品的列表
def transformPrefs(prefs):
    itemList ={}
    for person in prefs:
        for item in prefs[person]:
            if not itemList.has_key( item ):
                itemList[item]={}
                #result.setdefault(item,{})
            itemList[item][person]=prefs[person][item]
    return itemList

#基於物品的列表

def transformPrefs(prefs):

itemList ={}

for person in prefs:

for item in prefs[person]:

if not itemList.has_key( item ):

itemList[item]={}

#result.setdefault(item,{})

itemList[item][person]=prefs[person][item]

return itemList

計算物品間的相似度，物品間相似的變化不會像人那麼頻繁，所以我們可以構造物品間相似的集合，存成檔案重複利用：

#構建基於物品相似度資料集
def calculateSimilarItems(prefs,n=10):
    result={}
    itemPrefs=transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        c += 1
        if c%10==0: print "%d / %d" % (c,len(itemPrefs))
        scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)
        result[item]=scores
    return result

#構建基於物品相似度資料集

def calculateSimilarItems(prefs,n=10):

result={}

itemPrefs=transformPrefs(prefs)

c = 0

for item in itemPrefs:

c += 1

if c%10==0: print "%d / %d" % (c,len(itemPrefs))

scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)

result[item]=scores

return result

基於物品的推薦值計算，通過Toby已看影片的評分，乘以未看影片之間的相似度，來獲取權重。最後歸一化處理如下圖：
協同過濾演算法

#基於物品的推薦
def getRecommendedItems(prefs,itemMatch,user):
    userRatings=prefs[user]
    scores={}
    totalSim={}
# Loop over items rated by this user
    for (item,rating) in userRatings.items( ):
      # Loop over items similar to this one
      for (similarity,item2) in itemMatch[item]:

        # Ignore if this user has already rated this item
        if item2 in userRatings: continue
        # Weighted sum of rating times similarity
        scores.setdefault(item2,0)
        scores[item2]+=similarity*rating
        # Sum of all the similarities
        totalSim.setdefault(item2,0)
        totalSim[item2]+=similarity

# Divide each total score by total weighting to get an average
    rankings=[(score/totalSim[item],item) for item,score in scores.items( )]

# Return the rankings from highest to lowest
    rankings.sort( )
    rankings.reverse( )
    return rankings

#基於物品的推薦

def getRecommendedItems(prefs,itemMatch,user):

userRatings=prefs[user]

scores={}

totalSim={}

# Loop over items rated by this user

for (item,rating) in userRatings.items( ):

# Loop over items similar to this one

for (similarity,item2) in itemMatch[item]:

# Ignore if this user has already rated this item

if item2 in userRatings: continue

# Weighted sum of rating times similarity

scores.setdefault(item2,0)

scores[item2]+=similarity*rating

# Sum of all the similarities

totalSim.setdefault(item2,0)

totalSim[item2]+=similarity

# Divide each total score by total weighting to get an average

rankings=[(score/totalSim[item],item) for item,score in scores.items( )]

# Return the rankings from highest to lowest

rankings.sort( )

rankings.reverse( )

return rankings

原始碼

思考

UserCF和ItemCF的比較
歸一化處理的更合適方法
與頻繁模式挖掘的區別

推薦協同過濾演算法
2017-09-07
演算法
協同過濾演算法——入門
2015-02-05
演算法
協同過濾演算法簡介
2015-02-05
演算法
SimRank協同過濾推薦演算法
2017-02-03
演算法
協同過濾演算法概述與python 實現協同過濾演算法基於內容（usr-it
2021-09-09
演算法Python
協同過濾推薦演算法總結
2017-01-25
演算法
基於物品的協同過濾演算法
2017-11-30
演算法
協同過濾筆記
2024-04-07
筆記
chapter2:協同過濾
2017-10-04
APT
協同過濾（CF）演算法詳解和實現
2016-06-24
演算法
基於使用者的協同過濾演算法
2016-07-20
演算法
Slope One :簡單高效的協同過濾演算法
2015-01-13
演算法
基於矩陣分解的協同過濾演算法
2024-04-11
矩陣演算法
物品推薦（基於物品的協同過濾演算法）
2018-01-02
演算法
【Datawhale】推薦系統-協同過濾
2020-10-22
[機器學習]協同過濾演算法的原理和基於Spark 例項
2020-12-30
機器學習演算法Spark
矩陣分解在協同過濾推薦演算法中的應用
2017-01-26
矩陣演算法
協同過濾實現小型推薦系統
2018-11-17
chapter3:協同過濾－隱式評級及基於物品的過濾
2017-10-04
APT
推薦召回--基於物品的協同過濾：ItemCF
2022-01-21
【轉】推薦系統演算法總結（二）——協同過濾(CF) MF FM FFM
2018-08-30
演算法
協同過濾的R語言實現及改進
2019-02-22
R語言
推薦系統與協同過濾、奇異值分解
2019-03-04
協同過濾在推薦系統中的應用
2020-10-30
神經圖協同過濾（Neural Graph Collaborative Filtering）
2020-11-25
Filter
Python之協同過濾（尋找相近的使用者）
2017-01-15
Python
深入理解mahout基於hadoop的協同過濾流程
2014-11-13
Hadoop
推薦系統--完整的架構設計和演算法(協同過濾、隱語義)
2019-09-09
架構演算法
基於遺傳最佳化的協同過濾推薦演算法matlab模擬
2024-03-23
演算法Matlab
【小白學推薦1】協同過濾零基礎到入門
2020-08-20
[R]可能是史上程式碼最少的協同過濾推薦引擎
2017-09-23
【JAVA】助力數字化營銷：基於協同過濾演算法實現個性化商品推薦
2024-04-23
Java演算法
基於使用者的協同過濾來構建推薦系統
2020-06-25
推薦系統入門之使用協同過濾實現商品推薦
2021-03-11
基於專案的協同過濾推薦演算法(Item-Based Collaborative Filtering Recommendation Algorithms)
2024-04-07
演算法FilterGo
預測電影偏好？如何利用自編碼器實現協同過濾方法
2018-05-20
【轉】Spark MLlib協同過濾之交替最小二乘法ALS原理與實踐
2018-08-24
Spark
初探富文字之OT協同演算法
2023-01-08
演算法