chapter2:協同過濾

CopperDong發表於2017-10-04

APT

https://github.com/zacharski/pg2dm-python

一、如何尋找相似使用者

　　曼哈頓距離(Manhattan Distance)

｜x1 - x2 | + | y1 - y2 |

歐式距離

　　　　sqrt( (x1-x2)^2 + (y1-y2)^2 )

　　N維下的思考

　　　　 ×××

　　一個缺陷

　　　　當沒有缺失值時，曼哈頓距離和歐式距離非常好。缺失值的處理是一個活躍的學術研究問題

　　一般化

Python

#
#  FILTERINGDATA.py
#
#  Code file for the book Programmer's Guide to Data Mining
#  http://guidetodatamining.com
#  Ron Zacharski
#

from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
        }



def manhattan(rating1, rating2):
    """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries
       of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
    distance = 0
    commonRatings = False 
    for key in rating1:
        if key in rating2:
            distance += abs(rating1[key] - rating2[key])
            commonRatings = True
    if commonRatings:
        return distance
    else:
        return -1 #Indicates no ratings in common


def computeNearestNeighbor(username, users):
    """creates a sorted list of users based on their distance to username"""
    distances = []
    for user in users:
        if user != username:
            distance = manhattan(users[user], users[username])
            distances.append((distance, user))
    # sort based on distance -- closest first
    distances.sort()
    return distances

def recommend(username, users):
    """Give list of recommendations"""
    # first find nearest neighbor
    nearest = computeNearestNeighbor(username, users)[0][1]

    recommendations = []
    # now find bands neighbor rated that user didn't
    neighborRatings = users[nearest]
    userRatings = users[username]
    for artist in neighborRatings:
        if not artist in userRatings:
            recommendations.append((artist, neighborRatings[artist]))
    # using the fn sorted for variety - sort is more efficient
    return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)

# examples - uncomment to run

print( recommend('Hailey', users))
#print( recommend('Chan', users))

二、使用者的評級差異

　　　皮爾遜相關係數（Pearson Correlation Coefficient）

　　　　　尋找某個感興趣的使用者的最相似使用者

　　　　除了看上去或許有點複雜之外，上面公式的另一個問題在於演算法時可能需要對資料進行多遍掃描。幸運的是，對於演算法實現人員而言，還有另一個皮爾遜相關係數的近似計算公式：

from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
        }

def pearson(rating1, rating2):
    sum_xy = 0
    sum_x = 0
    sum_y = 0
    sum_x2 = 0
    sum_y2 = 0
    n = 0
    for key in rating1:
        if key in rating2:
            n += 1
            x = rating1[key]
            y = rating2[key]
            sum_xy += x * y
            sum_x += x
            sum_y += y
            sum_x2 += pow(x, 2)
            sum_y2 += pow(y, 2)
    # now compute denominator
    denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
    if denominator == 0:
        return 0
    else:
        return (sum_xy - (sum_x * sum_y) / n) / denominator

print( pearson(users['Angelica'], users['Bill'] ) )
print( pearson(users['Angelica'], users['Hailey'] ) )
print( pearson(users['Angelica'], users['Jordyn'] ) )

三、最後一個公式－－－餘弦相似度

　　　不僅在文字挖掘中使用得非常普遍，而且也廣泛用於協同過濾

　　　例子：跟蹤使用者播放某首音樂的次數並基於該資訊進行推薦

四、相似度的選擇

　　　如果資料受分數貶值（即不同使用者使用不同的評級範圍）的影響，則使用皮爾遜相關係數

　　　如果資料稠密（幾乎所有屬性都沒有零值）且屬性值大小十分重要，那麼使用諸如歐式距離或者曼哈頓距離

　　　如果資料稀疏，考慮使用餘弦相似度

五、Ｋ近鄰

　　　上面的問題：依賴單個“最相似”的使用者進行推薦。該使用者的其他怪癖愛好都會被推薦。

　　　一種解決辦法是基於多個相似的使用者進行推薦。這裡可以使用Ｋ近鄰方法

　　　利用Ｋ個最相似的使用者來確定推薦結果。基本思路

六、Python的一個推薦類

import codecs 
from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},
         
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},
         
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},
         
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},
         
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},
         
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))
        

        

    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)
                
        
    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

七、一個實際的資料集

BX-Dump.zip，Cai-Nicolas Zeigler從Book Crossing網站上收集了超過100萬書評，其中包含278858個使用者對271379本書的評級。

　　該CSV檔案包括３張表

BX-Users表：正如名字的含義一樣，該表包含的是使用者的資訊。具體包括整型的使用者ID欄位以及地址欄位和年齡欄位
BX-Books表：書通過ISBN、書名、作者、出版年份和出版商來表示
BX-Book-Rating表：包括使用者ID、書的ISBN和一個0到10之間的評級分數

import remmender as r
>>> rc = r.recommender(r.users)
>>> rc.recommend('Jordyn')
[('Blues Traveler', 5.0)]
>>> rc.loadBookDB('BX-Dump/')
1700018

# 現在我可以得到來自多倫多的一個使用者１7118的推薦結果
>>> rc.recommend('171118')
[(u"Devil's Waltz (Alex Delaware Novels (Paperback)) by Jonathan Kellerman", 9.0), (u'Silent Partner (Alex Delaware Novels (Paperback)) by Jonathan Kellerman', 8.0), (u'The Outsiders (Now in Speak!) by S. E. Hinton', 8.0), (u'Thinner by Stephen King', 8.0), (u'Sein Language by JERRY SEINFELD', 8.0)]
>>> rc.userRatings('171118', 5)
Ratings for toronto, ontario, canada
2421
The Careful Writer by Theodore M. Bernstein	10
The Darkest Road (The Fionavar Tapestry, Book 3) by Guy Gavriel Kay	10
Wonderful Life: The Burgess Shale and the Nature of History by Stephen Jay Gould	10
Time Power: The Revolutionary Time Management System That Can Change Your Professional and Personal by Charles Hobbs	10
Just So Stories (Penguin Twentieth-Century Classics) by Rudyard Kipling	10

協同過濾筆記
2024-04-07
筆記
協同過濾演算法
2016-10-08
演算法
【Datawhale】推薦系統-協同過濾
2020-10-22
推薦協同過濾演算法
2017-09-07
演算法
協同過濾演算法——入門
2015-02-05
演算法
協同過濾演算法簡介
2015-02-05
演算法
SimRank協同過濾推薦演算法
2017-02-03
演算法
協同過濾演算法概述與python 實現協同過濾演算法基於內容（usr-it
2021-09-09
演算法Python
協同過濾實現小型推薦系統
2018-11-17
協同過濾推薦演算法總結
2017-01-25
演算法
基於物品的協同過濾演算法
2017-11-30
演算法
chapter3:協同過濾－隱式評級及基於物品的過濾
2017-10-04
APT
推薦召回--基於物品的協同過濾：ItemCF
2022-01-21
協同過濾的R語言實現及改進
2019-02-22
R語言
推薦系統與協同過濾、奇異值分解
2019-03-04
協同過濾在推薦系統中的應用
2020-10-30
協同過濾（CF）演算法詳解和實現
2016-06-24
演算法
基於使用者的協同過濾演算法
2016-07-20
演算法
Slope One :簡單高效的協同過濾演算法
2015-01-13
演算法
基於矩陣分解的協同過濾演算法
2024-04-11
矩陣演算法
神經圖協同過濾（Neural Graph Collaborative Filtering）
2020-11-25
Filter
Python之協同過濾（尋找相近的使用者）
2017-01-15
Python
物品推薦（基於物品的協同過濾演算法）
2018-01-02
演算法
深入理解mahout基於hadoop的協同過濾流程
2014-11-13
Hadoop
【小白學推薦1】協同過濾零基礎到入門
2020-08-20
[機器學習]協同過濾演算法的原理和基於Spark 例項
2020-12-30
機器學習演算法Spark
[R]可能是史上程式碼最少的協同過濾推薦引擎
2017-09-23
基於使用者的協同過濾來構建推薦系統
2020-06-25
推薦系統入門之使用協同過濾實現商品推薦
2021-03-11
矩陣分解在協同過濾推薦演算法中的應用
2017-01-26
矩陣演算法
預測電影偏好？如何利用自編碼器實現協同過濾方法
2018-05-20
【轉】推薦系統演算法總結（二）——協同過濾(CF) MF FM FFM
2018-08-30
演算法
【轉】Spark MLlib協同過濾之交替最小二乘法ALS原理與實踐
2018-08-24
Spark
推薦系統--完整的架構設計和演算法(協同過濾、隱語義)
2019-09-09
架構演算法
基於遺傳最佳化的協同過濾推薦演算法matlab模擬
2024-03-23
演算法Matlab
過濾Servlet--過濾器
2016-08-30
Servlet過濾器
【JAVA】助力數字化營銷：基於協同過濾演算法實現個性化商品推薦
2024-04-23
Java演算法
過濾
2024-07-30