chapter4:內容過濾及分類---基於物品屬性的過濾
協同過濾也稱為社會過濾,利用了使用者社群的力量來幫助進行推薦,它的難點,包括資料稀疏和擴充套件性帶來的問題,另一個問題是基於協同過濾的推薦系統傾向於推薦已流行的物品,即偏向於流行事物。作為一個極端的例子,考慮一個全新樂隊剛釋出的專輯,由於樂隊和專輯從沒被人評價過或者沒人購買過,因此它永遠不會被推薦,這就是所謂的“冷啟動”問題。會帶來“富者越富”的效果
一種不同的推薦方法。考慮流音樂網站Pandora的推薦,基於一種稱為音樂基因的專案。他們僱了一些具有很強音樂理論背景的專業音樂人士作為分析師,有他們來決定歌曲的特徵(他們稱之為基因)。這些分析師會接受超過150個小時的培訓。一旦培訓完畢,他們就會花平均20~30分鐘的時間來分析一首歌曲以確定其基因或者說特徵。這些特徵當中很多都是專業性的。分析師會在超過400中基因上進行評分。由於每個月都大約新增15000首新歌,因此上述做法的工作量很大。
一、選擇合適取值的重要性
特徵選取,如音樂的流派、情緒,取值在1~5之間
用Python實現的資料格式
music = {"Dr Dog/Fate": {"piano": 2.5, "vocals": 4, "beat": 3.5, "blues": 3, "guitar": 5, "backup vocals": 4, "rap": 1},
"Phoenix/Lisztomania": {"piano": 2, "vocals": 5, "beat": 5, "blues": 3, "guitar": 2, "backup vocals": 1, "rap": 1},
"Heartless Bastards/Out at Sea": {"piano": 1, "vocals": 5, "beat": 4, "blues": 2, "guitar": 4, "backup vocals": 1, "rap": 1},
"Todd Snider/Don't Tempt Me": {"piano": 4, "vocals": 5, "beat": 4, "blues": 4, "guitar": 1, "backup vocals": 5, "rap": 1},
"The Black Keys/Magic Potion": {"piano": 1, "vocals": 4, "beat": 5, "blues": 3.5, "guitar": 5, "backup vocals": 1, "rap": 1},
"Glee Cast/Jessie's Girl": {"piano": 1, "vocals": 5, "beat": 3.5, "blues": 3, "guitar":4, "backup vocals": 5, "rap": 1},
"La Roux/Bulletproof": {"piano": 5, "vocals": 5, "beat": 4, "blues": 2, "guitar": 1, "backup vocals": 1, "rap": 1},
"Mike Posner": {"piano": 2.5, "vocals": 4, "beat": 4, "blues": 1, "guitar": 1, "backup vocals": 1, "rap": 1},
"Black Eyed Peas/Rock That Body": {"piano": 2, "vocals": 5, "beat": 5, "blues": 1, "guitar": 2, "backup vocals": 2, "rap": 4},
"Lady Gaga/Alejandro": {"piano": 1, "vocals": 5, "beat": 3, "blues": 2, "guitar": 1, "backup vocals": 2, "rap": 1}}
用曼哈頓距離推薦
from math import sqrt
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
}
music = {"Dr Dog/Fate": {"piano": 2.5, "vocals": 4, "beat": 3.5, "blues": 3, "guitar": 5, "backup vocals": 4, "rap": 1},
"Phoenix/Lisztomania": {"piano": 2, "vocals": 5, "beat": 5, "blues": 3, "guitar": 2, "backup vocals": 1, "rap": 1},
"Heartless Bastards/Out at Sea": {"piano": 1, "vocals": 5, "beat": 4, "blues": 2, "guitar": 4, "backup vocals": 1, "rap": 1},
"Todd Snider/Don't Tempt Me": {"piano": 4, "vocals": 5, "beat": 4, "blues": 4, "guitar": 1, "backup vocals": 5, "rap": 1},
"The Black Keys/Magic Potion": {"piano": 1, "vocals": 4, "beat": 5, "blues": 3.5, "guitar": 5, "backup vocals": 1, "rap": 1},
"Glee Cast/Jessie's Girl": {"piano": 1, "vocals": 5, "beat": 3.5, "blues": 3, "guitar":4, "backup vocals": 5, "rap": 1},
"La Roux/Bulletproof": {"piano": 5, "vocals": 5, "beat": 4, "blues": 2, "guitar": 1, "backup vocals": 1, "rap": 1},
"Mike Posner": {"piano": 2.5, "vocals": 4, "beat": 4, "blues": 1, "guitar": 1, "backup vocals": 1, "rap": 1},
"Black Eyed Peas/Rock That Body": {"piano": 2, "vocals": 5, "beat": 5, "blues": 1, "guitar": 2, "backup vocals": 2, "rap": 4},
"Lady Gaga/Alejandro": {"piano": 1, "vocals": 5, "beat": 3, "blues": 2, "guitar": 1, "backup vocals": 2, "rap": 1}}
def manhattan(rating1, rating2):
"""Computes the Manhattan distance. Both rating1 and rating2 are dictionaries
of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
distance = 0
total = 0
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key])
total += 1
return distance
def computeNearestNeighbor(username, users):
"""creates a sorted list of users based on their distance to username"""
distances = []
for user in users:
if user != username:
distance = manhattan(users[user], users[username])
distances.append((distance, user))
# sort based on distance -- closest first
distances.sort()
return distances
def recommend(username, users):
"""Give list of recommendations"""
# first find nearest neighbor
nearest = computeNearestNeighbor(username, users)[0][1]
recommendations = []
# now find bands neighbor rated that user didn't
neighborRatings = users[nearest]
userRatings = users[username]
for artist in neighborRatings:
if not artist in userRatings:
recommendations.append((artist, neighborRatings[artist]))
# using the fn sorted for variety - sort is more efficient
return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)
一個取值範圍的問題
假設某個特徵在距離計算中占主導地位,並不是什麼好事,實際上,這種不同屬性取值範圍的差異對任意推薦系統來說都是個大問題。
二、歸一化
解決上面的問題是歸一化。為了消除資料的偏斜性,我們必須要對資料標準化或者說歸一化。
一個常用的歸一化方法會將每個特徵的值轉換為0到1之間,如 (val - min) / (max - min)
如果你上過統計課,可能會熟悉更精確的標準化資料的做法,如標準分數(Standard Score)
使用標準分數的問題在於其會受到離群點的劇烈影響。
改進的標準分數
哪些情況下應該進行歸一化處理:記住的是如果進行歸一化的話會涉及計算的開銷
1、所用資料探勘方法基於特徵的值來計算兩個物件的距離
2、不同特徵的尺度不同(特別是有顯著不同的情況,如上述例子中的詢價和臥室數目)
三、最近鄰分類器的Python程式碼
為喜歡Green Day的使用者推薦歌曲
需要的資料
音樂的屬性music = { }
將music轉換成向量items = { } 方便計算
每個使用者對部分的評分users = { }
建立一個分類函式
四、體育專案的識別
小規模資料,兩個檔案athletesTrainingSet.txt(訓練分類器) and athletesTestSet.txt(評估分類器)
class Classifier:
def __init__(self, filename):
self.medianAndDeviation = []
# reading the data in from the file
f = open(filename)
lines = f.readlines()
f.close()
self.format = lines[0].strip().split('\t')
self.data = []
for line in lines[1:]:
fields = line.strip().split('\t')
ignore = []
vector = []
for i in range(len(fields)):
if self.format[i] == 'num':
vector.append(int(fields[i]))
elif self.format[i] == 'comment':
ignore.append(fields[i])
elif self.format[i] == 'class':
classification = fields[i]
self.data.append((classification, vector, ignore))
self.rawData = list(self.data)
##################################################
###
### FINISH THE FOLLOWING TWO METHODS
def getMedian(self, alist):
"""return median of alist"""
"""TO BE DONE"""
return 0
def getAbsoluteStandardDeviation(self, alist, median):
"""given alist and median return absolute standard deviation"""
"""TO BE DONE"""
return 0
###
###
##################################################
def unitTest():
list1 = [54, 72, 78, 49, 65, 63, 75, 67, 54]
list2 = [54, 72, 78, 49, 65, 63, 75, 67, 54, 68]
list3 = [69]
list4 = [69, 72]
classifier = Classifier('athletesTrainingSet.txt')
m1 = classifier.getMedian(list1)
m2 = classifier.getMedian(list2)
m3 = classifier.getMedian(list3)
m4 = classifier.getMedian(list4)
asd1 = classifier.getAbsoluteStandardDeviation(list1, m1)
asd2 = classifier.getAbsoluteStandardDeviation(list2, m2)
asd3 = classifier.getAbsoluteStandardDeviation(list3, m3)
asd4 = classifier.getAbsoluteStandardDeviation(list4, m4)
assert(round(m1, 3) == 65)
assert(round(m2, 3) == 66)
assert(round(m3, 3) == 69)
assert(round(m4, 3) == 70.5)
assert(round(asd1, 3) == 8)
assert(round(asd2, 3) == 7.5)
assert(round(asd3, 3) == 0)
assert(round(asd4, 3) == 1.5)
print("getMedian and getAbsoluteStandardDeviation work correctly")
unitTest()
五、Iris資料集
六、汽車MPG資料
該資料來自卡內基梅隆大學,最初用於1983年度的美國統計協會展會上。
七、雜談
注意歸一化,重要性
相關文章
- chapter3:協同過濾-隱式評級及基於物品的過濾APT
- 基於物品的協同過濾演算法演算法
- 物品推薦(基於物品的協同過濾演算法)演算法
- 推薦召回--基於物品的協同過濾:ItemCF
- 4、過濾器的使用及自定義過濾器過濾器
- 協同過濾演算法概述與python 實現協同過濾演算法基於內容(usr-it演算法Python
- 在指定的檔案過濾想要的內容
- 基於Spring Security Role過濾Jackson JSON輸出內容SpringJSON
- .Net MVC中定義全域性過濾器及在Action中排除全域性過濾器MVC過濾器
- 過濾Servlet--過濾器Servlet過濾器
- DFA演算法之內容敏感詞過濾演算法
- jQuery選擇器——內容過濾選擇器jQuery
- jQuery選擇器——屬性過濾選擇器jQuery
- Liunx運維(三)-檔案過濾及內容編輯處理運維
- servlet的過濾器filter類Servlet過濾器Filter
- 基於JavaScript的關鍵詞過濾示例JavaScript
- [MYSQL -13]過濾分組MySql
- 過濾
- Linux檔案過濾及內容編輯處理命令總結!Linux
- angular內建過濾器-filterAngular過濾器Filter
- 第三章 檔案過濾及內容編輯處理命令
- 靈玖軟體NlpirParser語義智慧內容過濾
- Milvus 向量資料庫如何實現屬性過濾資料庫
- 利用jquery子屬性過濾器實現隔行變色jQuery過濾器
- 關於過濾字元的問題字元
- filter過濾Filter
- 過濾器過濾器
- 聚合函式及分組與過濾(GROUP BY … HAVING)函式
- DRF之過濾類原始碼分析原始碼
- Isotope-jQuery神奇的分類過濾和排序佈局外掛jQuery排序
- Django(67)drf搜尋過濾和排序過濾Django排序
- 誠翔濾器光刻膠過濾器濾芯:保障光刻過程的高效與安全過濾器
- 關於資料過濾的設計
- 基於使用者的協同過濾演算法演算法
- 基於矩陣分解的協同過濾演算法矩陣演算法
- 第六篇:基於樸素貝葉斯分類演算法的郵件過濾系統演算法
- CSS的filter常用濾鏡屬性及語句大全CSSFilter
- 過濾函式函式