機器學習筆記1（K-近鄰演算法）

江先生發表於2018-01-12

人生苦短，我用Python

K-近鄰演算法：簡單來說，K-近鄰演算法就是採用測量不同特徵值之間的距離方法進行分類

優點：精度高、對異常值不敏感、無資料輸入假定
缺點：計算複雜度高、空間複雜度高
適用範圍：數值型、標稱型

工作原理：

存在一個樣本資料集合，也稱作訓練樣本集，並且樣本集中每個資料都存在標籤，即我們知道樣本集中每一個資料與所屬分類的對應關係。輸入沒有標籤的新資料後，將新資料的每個特徵與樣本集中資料對應的特徵進行比較，然後演算法提取樣本集中特徵最相似的資料（最近鄰）的分類標籤。一般來說，我們只選擇樣本集中前K個最相似的資料，這就是K-近鄰演算法中K的出處，通常K是不大於20的整數。最後，選擇K個最相似資料中出現次數最多的分類，作為新資料的分類。

K-近鄰演算法的一般流程：

收集資料：可以使用任何方法。
準備資料：距離計算所需要的數值，最好是結構化的資料格式。
分析資料：可以使用任何方法。
訓練演算法：此步驟不適用於K-近鄰演算法。
測試演算法：計算錯誤率。
使用演算法：首先需要輸入樣本資料和結構化的輸出結果，然後執行K-近鄰演算法判定輸入資料分別屬於哪個分類，最後應用對計算出的分類執行後續的處理。

實施KNN分類演算法--虛擬碼

對未知類別屬性的資料集中的每個點依次執行以下操作：

計算已知類別資料集中的點與當前點之間的距離；
按照距離遞增次序排序；
選取與當前點距離最小的K個點；
確定前K個點所在類別的出現頻率；
返回前K個點出現頻率最高的類別作為當前點的預測分類；

計算兩個向量點之間的距離公式--歐式距離公式：

例如：點（0，0）與（1，2）之間的距離計算為：

sqrt((1-0)**2+(2-0)**2)

程式碼實現：

import numpy as np
import operator
"""
def CreateDataSet():
    group=np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels=['A','A','B','B']
    return group,labels
print(CreateDataSet())
"""
"""
inX--用於分類的輸入向量
dataSet--輸入的訓練樣本集
labels--標籤向量
k--用於選擇最近鄰居的數目
其中標籤向量的元素數目和矩陣dataSet的行數相同
"""
def classify(inX,dataSet,labels,k):
    dataSetSize=dataSet.shape[0]    #獲得訓練樣本集的行數

    #將輸入向量在列方向重複一次，在行方向上dataSize次，並與訓練樣本集dataSet相減
    diffMat=np.tile(inX,(dataSetSize,1))-dataSet
    print("diffMat:")
    print(diffMat)
    #將相減後的集合進行平方運算
    sqDiffMat=diffMat**2
    print("sqDiffMat:")
    print(sqDiffMat)
    #對平方後的集合進行相加運算--按行相加
    sqDistances=sqDiffMat.sum(axis=1)
    print("sqDistances:")
    print(sqDistances)
    #對相加後的資料開平方，得到輸入向量與每個訓練樣本集之間的距離值
    distances=np.sqrt(sqDistances)
    print("distances")
    print(distances)
    #返回陣列從小到大的索引值--排序
    sortedDistIndicies=np.argsort(distances)
    print("sortedDistIndicies")
    print(sortedDistIndicies)
    classCount={}

    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        print("voteIlabel"+str(i))
        print(voteIlabel)
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
        print("classCount"+str(i))
        print(classCount)
    sortedClassCount=sorted(classCount.items(),
                            key=operator.itemgetter(1),reverse=True)
    print("sortedClassCount:")
    print(sortedClassCount)
    return sortedClassCount[0][0]

if __name__=='__main__':
    #訓練樣本集
    group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])

    #標籤向量
    labels = ['A', 'A', 'B', 'B']

    #輸入向量
    inX=[0,0]

    #用於選擇最近鄰居的數目
    k=3
    result=classify(inX,group,labels,k)
    print(result)


"""
輸出值：
diffMat:
[[-1.  -1.1]
 [-1.  -1. ]
 [ 0.   0. ]
 [ 0.  -0.1]]
sqDiffMat:
[[ 1.    1.21]
 [ 1.    1.  ]
 [ 0.    0.  ]
 [ 0.    0.01]]
sqDistances:
[ 2.21  2.    0.    0.01]
distances
[ 1.48660687  1.41421356  0.          0.1       ]
sortedDistIndicies
[2 3 1 0]
voteIlabel0
B
classCount0
{'B': 1}
voteIlabel1
B
classCount1
{'B': 2}
voteIlabel2
A
classCount2
{'B': 2, 'A': 1}
sortedClassCount:
[('B', 2), ('A', 1)]
B

Process finished with exit code 0
"""複製程式碼

測試結果：

輸入[0,0],經過測試後，返回的結果是B，也就是說[0,0]這個輸入向量通過K-近鄰演算法分類後歸為B類

示例：使用K-近鄰演算法改進約會網站的配對效果

收集資料：提供文字檔案
準備資料：使用Python解析文字檔案
分析資料：使用Matplotlib畫二維擴散圖
訓練演算法：此步驟不適用與K-近鄰演算法
測試演算法：使用海倫提供的部分資料作為測試樣本
測試樣本和非測試的區別在於：測試樣本是已經完成分類的資料，如果預測分類與實際類別不同，則標記為一個錯誤。
使用演算法：產生簡單的命令列程式，然後可以輸入一些特徵資料以判斷對方是否是自己喜歡的型別

準備資料：從文字檔案中解析資料

文字樣本資料特徵：

每年獲得的飛行常客里程數
玩視訊遊戲所耗時間的百分比
每週消費的冰淇淋公升數

將文字記錄轉換為numpy資料的解析程式：

def file2matrix(filename):
    # 開啟檔案
    fr = open(filename, 'r', encoding='utf-8')

    # 按行讀取資料
    arrayOLines = fr.readlines()

    # 獲取資料的行數
    numberOfLines = len(arrayOLines)
    # 建立以0填充的矩陣
    returnMat = np.zeros((numberOfLines, 3))
    print(returnMat)
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        print(line)
        # 擷取掉所有回車字元
        line = line.strip()
        print(line)
        # 以'\t'將line分割成一個元素列表
        listFromLine = line.split('\t')
        # 選取前三個元素，儲存到特徵矩陣中
        returnMat[index, :] = listFromLine[0:3]
        # 選取最後一個元素儲存到標籤向量中
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector
datingDataMat,datingLabels=file2matrix('D:\liuguojiang_Python\city_58\city_58\datingTestSet2.txt')
fig=plt.figure()
plt.title('K-')
plt.xlabel('fly')
plt.ylabel('consume')
ax=fig.add_subplot(111)

ax.scatter(datingDataMat[:,0],datingDataMat[:,1],
           15.0*np.array(datingLabels),15.0*np.array(datingLabels))
plt.show()複製程式碼

特別說明：程式碼中的資原始檔可以在此處下載：LiuGuoJiang/machinelearninginaction

解析文字資料並用散點圖展示：

準備資料：歸一化數值

任選樣本資料中一行資料，計算距離時，因為飛行常客里程數比較大，所以對最後計算結果影響過大，所以需要對資料做歸一化處理。如將取值範圍處理為0~1或者-1~1之間。下面的公式可以將任意取值範圍的特徵值轉化為0~1區間內的值：

newValue=(oldValue-min)/(max-min)

其中min和max分別是資料集中的最小特徵值和最大特徵值。

歸一化特徵值函式：

def autoNorm(dataSet):
    #選取列的最小值
    minVals=dataSet.min(0)
    #選取列的最大值
    maxVals=dataSet.max(0)
    #列的最大值與最小值做減法
    ranges=maxVals-minVals
    #
    normDataSet=np.zeros([dataSet.shape[0],dataSet.shape[1]])
    print(normDataSet)
    #取出dataSet的行數
    m=dataSet.shape[0]
    #np.tile(minVals,(m,1))將minVals在 列上重複一次，在行上重複m次
    normDataSet=dataSet-np.tile(minVals,(m,1))  #（oldValue-min）
    normDataSet=normDataSet/np.tile(ranges,(m,1))   #(oldValue-min)/(max-min)
    return normDataSet,ranges,minVals

normDataSet,ranges,minVals=autoNorm(datingDataMat)
print(normDataSet)複製程式碼

測試演算法：機器學習演算法一個很重要的工作就是評估演算法的正確率，通常我們只提供已有資料的90%作為訓練樣本來訓練分類器，而使用其餘的10%資料去測試分類器，檢測分類器的正確率。10%資料應該是隨機選擇的。

分類器的測試程式碼：

def datingClassUnitTest():
    hoRatio=0.10
    datingDataMat, datingLabels = file2matrix('D:\liuguojiang_Python\city_58\city_58\datingTestSet2.txt')
    print(datingDataMat)
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    print(normDataSet)
    m=normDataSet.shape[0]
    numTestVecs=int(m*hoRatio)
    print("numTestVecs")
    print(numTestVecs)
    errorCount=0.0
    for i in range(numTestVecs):
        classifierResult=classify(normDataSet[i,:],normDataSet[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classfier came back with:{},the real answer is:{}".format(classifierResult,datingLabels[i]))
        if (classifierResult!=datingLabels[i]):errorCount+=1.0
    print("the total error rate is:{}".format(errorCount/float(numTestVecs)))


the classfier came back with:3,the real answer is:3
the classfier came back with:2,the real answer is:2
the classfier came back with:1,the real answer is:1
.........
the classfier came back with:1,the real answer is:1
the classfier came back with:3,the real answer is:3
the classfier came back with:3,the real answer is:3
the classfier came back with:2,the real answer is:2
the classfier came back with:1,the real answer is:1
the classfier came back with:3,the real answer is:1
the total error rate is:0.05複製程式碼

分類器處理資料集的錯誤率是5%，即代表此分類器可以幫助物件判定分類。

編寫可以讓使用者輸入自己需要判斷的輸入向量，通過該分類器幫助使用者判斷屬於哪一分類：

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input( \
        "percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('D:\liuguojiang_Python\city_58\city_58\datingTestSet2.txt')
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles, percentTats, iceCream, ])
    classifierResult = classify((inArr - \
                                  minVals) / ranges, normDataSet, datingLabels, 3)
    print("You will probably like this person: {}".format(resultList[classifierResult - 1]))
if __name__=='__main__':
    classifyPerson()

"""
return:
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses
"""複製程式碼

總結：

定義K-近鄰演算法程式。
定義將文字資料集處理成二維陣列的函式，便於處理。
為消除某一特徵數值過大對結果判定的影響，定義歸一化數值函式，公式：（oldValue-min）/(max-min)
定義測試演算法函式，用於測試分類器的錯誤率是否滿足使用要求。
定義可以讓使用者輸入的程式碼，輸入輸入向量，用於判定分類

機器學習演算法——kNN（k-近鄰演算法）
2020-10-12
機器學習演算法KNN
k-近鄰演算法
2018-06-11
演算法
K-鄰近均值演算法
2020-10-08
演算法
機器學習實戰筆記-k近鄰演算法
2018-07-17
機器學習筆記演算法
Python:K-近鄰演算法
2017-12-06
Python演算法
【機器學習】機器學習建立演算法第2篇：K-近鄰演算法【附程式碼文件】
2024-03-15
機器學習演算法
機器學習——K近鄰演算法
2019-02-16
機器學習演算法
機器學習——KNN近鄰演算法
2020-11-04
機器學習KNN演算法
用Python實現K-近鄰演算法
2015-12-25
Python演算法
機器學習-K近鄰演算法-KNN
2024-04-27
機器學習演算法KNN
K-近鄰演算法介紹與程式碼實現
2019-07-05
演算法
機器學習-11-k近鄰演算法
2020-10-29
機器學習演算法
機器學習演算法之K近鄰演算法
2021-06-06
機器學習演算法
機器學習實戰2.1. 超詳細的k-近鄰演算法KNN（附Python程式碼）
2019-03-27
機器學習演算法KNNPython
機器學習——KNN（K近鄰）
2018-05-28
機器學習KNN
機器學習經典分類演算法 —— k-近鄰演算法（附python實現程式碼及資料集）
2019-07-29
機器學習演算法Python
K近鄰演算法：機器學習萌新必學演算法
2020-10-30
演算法機器學習
k-鄰近演算法實現約會網站的配對效果
2020-01-24
演算法網站
機器學習第4篇：sklearn 最鄰近演算法概述
2020-11-03
機器學習演算法
k近鄰演算法python實現 -- 《機器學習實戰》
2017-11-08
演算法Python機器學習
K - 近鄰演算法
2020-12-19
演算法
K近鄰演算法
2022-03-03
演算法
機器學習實戰----k值近鄰演算法（Python語言）
2021-09-09
機器學習演算法Python
機器學習演算法（三）：K近鄰(k-nearest neighbors)初探
2020-12-21
機器學習演算法REST
機器學習筆記(1): 梯度下降演算法
2024-06-02
機器學習筆記梯度演算法
什麼是機器學習分類演算法？【K-近鄰演算法(KNN)、交叉驗證、樸素貝葉斯演算法、決策樹、隨機森林】
2022-04-05
機器學習演算法KNN隨機森林
第一篇：K-近鄰分類演算法原理分析與程式碼實現
2017-01-19
演算法
用定租問題學透機器學習的K近鄰演算法
2021-09-09
機器學習演算法
什麼是機器學習的分類演算法？【K-近鄰演算法(KNN)、交叉驗證、樸素貝葉斯演算法、決策樹、隨機森林】
2022-04-04
機器學習演算法KNN隨機森林
第三篇：基於K-近鄰分類演算法的手寫識別系統
2017-01-19
演算法
機器學習——最鄰近規則分類（K Nearest Neighbor）KNN演算法
2017-09-12
機器學習RESTKNN演算法
機器學習_K近鄰Python程式碼詳解
2018-10-03
機器學習Python
第二篇：基於K-近鄰分類演算法的約會物件智慧匹配系統
2017-01-19
演算法物件
分類演算法-k 鄰近演算法
2020-01-19
演算法
2.1 k鄰近演算法之二
2016-01-21
演算法
2.1 k鄰近演算法之一
2016-01-18
演算法
K-最近鄰法(KNN)簡介
2018-10-04
KNN
統計學習筆記（3）——k近鄰法與kd樹
2016-07-25
筆記

機器學習筆記1（K-近鄰演算法）

相關文章