chapter2:協同過濾
https://github.com/zacharski/pg2dm-python
一、如何尋找相似使用者
曼哈頓距離(Manhattan Distance)
|x1 - x2 | + | y1 - y2 |
歐式距離
sqrt( (x1-x2)^2 + (y1-y2)^2 )
N維下的思考
×××
一個缺陷
當沒有缺失值時,曼哈頓距離和歐式距離非常好。缺失值的處理是一個活躍的學術研究問題
一般化
Python
#
# FILTERINGDATA.py
#
# Code file for the book Programmer's Guide to Data Mining
# http://guidetodatamining.com
# Ron Zacharski
#
from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
}
def manhattan(rating1, rating2):
"""Computes the Manhattan distance. Both rating1 and rating2 are dictionaries
of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
distance = 0
commonRatings = False
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key])
commonRatings = True
if commonRatings:
return distance
else:
return -1 #Indicates no ratings in common
def computeNearestNeighbor(username, users):
"""creates a sorted list of users based on their distance to username"""
distances = []
for user in users:
if user != username:
distance = manhattan(users[user], users[username])
distances.append((distance, user))
# sort based on distance -- closest first
distances.sort()
return distances
def recommend(username, users):
"""Give list of recommendations"""
# first find nearest neighbor
nearest = computeNearestNeighbor(username, users)[0][1]
recommendations = []
# now find bands neighbor rated that user didn't
neighborRatings = users[nearest]
userRatings = users[username]
for artist in neighborRatings:
if not artist in userRatings:
recommendations.append((artist, neighborRatings[artist]))
# using the fn sorted for variety - sort is more efficient
return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)
# examples - uncomment to run
print( recommend('Hailey', users))
#print( recommend('Chan', users))
二、使用者的評級差異
皮爾遜相關係數(Pearson Correlation Coefficient)
尋找某個感興趣的使用者的最相似使用者
除了看上去或許有點複雜之外,上面公式的另一個問題在於演算法時可能需要對資料進行多遍掃描。幸運的是,對於演算法實現人員而言,還有另一個皮爾遜相關係數的近似計算公式:
from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
}
def pearson(rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
# now compute denominator
denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
print( pearson(users['Angelica'], users['Bill'] ) )
print( pearson(users['Angelica'], users['Hailey'] ) )
print( pearson(users['Angelica'], users['Jordyn'] ) )
三、最後一個公式---餘弦相似度
不僅在文字挖掘中使用得非常普遍,而且也廣泛用於協同過濾
例子:跟蹤使用者播放某首音樂的次數並基於該資訊進行推薦
四、相似度的選擇
如果資料受分數貶值(即不同使用者使用不同的評級範圍)的影響,則使用皮爾遜相關係數
如果資料稠密(幾乎所有屬性都沒有零值)且屬性值大小十分重要,那麼使用諸如歐式距離或者曼哈頓距離
如果資料稀疏,考慮使用餘弦相似度
五、K近鄰
上面的問題:依賴單個“最相似”的使用者進行推薦。該使用者的其他怪癖愛好都會被推薦。
一種解決辦法是基於多個相似的使用者進行推薦。這裡可以使用K近鄰方法
利用K個最相似的使用者來確定推薦結果。基本思路
六、Python的一個推薦類
import codecs
from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0}
}
class recommender:
def __init__(self, data, k=1, metric='pearson', n=5):
""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id = {}
self.userid2name = {}
self.productid2name = {}
# for some reason I want to save the name of the metric
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# if data is dictionary set recommender data to it
#
if type(data).__name__ == 'dict':
self.data = data
def convertProductID2name(self, id):
"""Given product id number return product name"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id
def userRatings(self, id, n):
"""Return n top ratings for user with id"""
print ("Ratings for " + self.userid2name[id])
ratings = self.data[id]
print(len(ratings))
ratings = list(ratings.items())
ratings = [(self.convertProductID2name(k), v)
for (k, v) in ratings]
# finally sort and return
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
ratings = ratings[:n]
for rating in ratings:
print("%s\t%i" % (rating[0], rating[1]))
def loadBookDB(self, path=''):
"""loads the BX book dataset. Path is where the BX files are
located"""
self.data = {}
i = 0
#
# First load book ratings into self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[book] = rating
self.data[user] = currentRatings
f.close()
#
# Now load books into self.productid2name
# Books contains isbn, title, and author among other fields
#
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title + ' by ' + author
self.productid2name[isbn] = title
f.close()
#
# Now load user info into both self.userid2name and
# self.username2id
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
#print(line)
#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + ' (age: ' + age + ')'
else:
value = location
self.userid2name[userid] = value
self.username2id[location] = userid
f.close()
print(i)
def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
# now compute denominator
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
def computeNearestNeighbor(self, username):
"""creates a sorted list of users based on their distance to
username"""
distances = []
for instance in self.data:
if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
distances.append((instance, distance))
# sort based on distance -- closest first
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances
def recommend(self, user):
"""Give list of recommendations"""
recommendations = {}
# first get list of users ordered by nearness
nearest = self.computeNearestNeighbor(user)
#
# now get the ratings for the user
#
userRatings = self.data[user]
#
# determine the total distance
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# now iterate through the k nearest neighbors
# accumulating their ratings
for i in range(self.k):
# compute slice of pie
weight = nearest[i][1] / totalDistance
# get the name of the person
name = nearest[i][0]
# get the ratings for this person
neighborRatings = self.data[name]
# get the name of the person
# now find bands neighbor rated that user didn't
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# now make list from dictionary
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# Return the first n items
return recommendations[:self.n]
七、一個實際的資料集
BX-Dump.zip,Cai-Nicolas Zeigler從Book Crossing網站上收集了超過100萬書評,其中包含278858個使用者對271379本書的評級。
該CSV檔案包括3張表
- BX-Users表:正如名字的含義一樣,該表包含的是使用者的資訊。具體包括整型的使用者ID欄位以及地址欄位和年齡欄位
- BX-Books表:書通過ISBN、書名、作者、出版年份和出版商來表示
- BX-Book-Rating表:包括使用者ID、書的ISBN和一個0到10之間的評級分數
import remmender as r
>>> rc = r.recommender(r.users)
>>> rc.recommend('Jordyn')
[('Blues Traveler', 5.0)]
>>> rc.loadBookDB('BX-Dump/')
1700018
# 現在我可以得到來自多倫多的一個使用者17118的推薦結果
>>> rc.recommend('171118')
[(u"Devil's Waltz (Alex Delaware Novels (Paperback)) by Jonathan Kellerman", 9.0), (u'Silent Partner (Alex Delaware Novels (Paperback)) by Jonathan Kellerman', 8.0), (u'The Outsiders (Now in Speak!) by S. E. Hinton', 8.0), (u'Thinner by Stephen King', 8.0), (u'Sein Language by JERRY SEINFELD', 8.0)]
>>> rc.userRatings('171118', 5)
Ratings for toronto, ontario, canada
2421
The Careful Writer by Theodore M. Bernstein 10
The Darkest Road (The Fionavar Tapestry, Book 3) by Guy Gavriel Kay 10
Wonderful Life: The Burgess Shale and the Nature of History by Stephen Jay Gould 10
Time Power: The Revolutionary Time Management System That Can Change Your Professional and Personal by Charles Hobbs 10
Just So Stories (Penguin Twentieth-Century Classics) by Rudyard Kipling 10
相關文章
- 協同過濾筆記筆記
- 協同過濾演算法演算法
- 【Datawhale】推薦系統-協同過濾
- 推薦協同過濾演算法演算法
- 協同過濾演算法——入門演算法
- 協同過濾演算法簡介演算法
- SimRank協同過濾推薦演算法演算法
- 協同過濾演算法概述與python 實現協同過濾演算法基於內容(usr-it演算法Python
- 協同過濾實現小型推薦系統
- 協同過濾推薦演算法總結演算法
- 基於物品的協同過濾演算法演算法
- chapter3:協同過濾-隱式評級及基於物品的過濾APT
- 推薦召回--基於物品的協同過濾:ItemCF
- 協同過濾的R語言實現及改進R語言
- 推薦系統與協同過濾、奇異值分解
- 協同過濾在推薦系統中的應用
- 協同過濾(CF)演算法詳解和實現演算法
- 基於使用者的協同過濾演算法演算法
- Slope One :簡單高效的協同過濾演算法演算法
- 基於矩陣分解的協同過濾演算法矩陣演算法
- 神經圖協同過濾(Neural Graph Collaborative Filtering)Filter
- Python之協同過濾(尋找相近的使用者)Python
- 物品推薦(基於物品的協同過濾演算法)演算法
- 深入理解mahout基於hadoop的協同過濾流程Hadoop
- 【小白學推薦1】 協同過濾 零基礎到入門
- [機器學習]協同過濾演算法的原理和基於Spark 例項機器學習演算法Spark
- [R]可能是史上程式碼最少的協同過濾推薦引擎
- 基於使用者的協同過濾來構建推薦系統
- 推薦系統入門之使用協同過濾實現商品推薦
- 矩陣分解在協同過濾推薦演算法中的應用矩陣演算法
- 預測電影偏好?如何利用自編碼器實現協同過濾方法
- 【轉】推薦系統演算法總結(二)——協同過濾(CF) MF FM FFM演算法
- 【轉】Spark MLlib協同過濾之交替最小二乘法ALS原理與實踐Spark
- 推薦系統--完整的架構設計和演算法(協同過濾、隱語義)架構演算法
- 基於遺傳最佳化的協同過濾推薦演算法matlab模擬演算法Matlab
- 過濾Servlet--過濾器Servlet過濾器
- 【JAVA】助力數字化營銷:基於協同過濾演算法實現個性化商品推薦Java演算法
- 過濾