什麼是譜聚類?
就是找到一個合適的切割點將圖進行切割,核心思想就是:
使得切割的邊的權重和最小,對於無向圖而言就是切割的邊數最少,如上所示。但是,切割的時候可能會存在區域性最優,有以下兩種方法:
(1)RatioCut:核心是要求劃分出來的子圖的節點數儘可能的大
分母變為子圖的節點的個數 。
(2)NCut:考慮每個子圖的邊的權重和
分母變為子圖各邊的權重和。
具體之後求解可以參考:https://blog.csdn.net/songbinxu/article/details/80838865
譜聚類的整體流程?
- 計算距離矩陣(例如歐氏距離)
- 利用KNN計算鄰接矩陣 A
- 由 A 計算度矩陣 D 和拉普拉斯矩陣 L
- 標準化 L→$D^{−1/2}LD^{−1/2}$
- 對矩陣 $D^{−1/2}LD^{−1/2}$進行特徵值分解,得到特徵向量 $H_{nn}$
- 將 $H_{nn}$ 當成樣本送入 Kmeans 聚類
- 獲得聚類結果 C=(C1,C2,⋯,Ck)
python實現:
(1)首先是資料的生成:
from sklearn import datasets
x1, y1 = datasets.make_circles(n_samples=1000, factor=0.5, noise=0.05)
import matplotlib.pyplot as plt %matplotlib inline plt.title('make_circles function example') plt.scatter(x1[:, 0], x1[:, 1], marker='o') plt.show()
x1的形狀是(1000,2)
(2)接下來,我們要計算兩兩樣本之間的距離:
import numpy as np
def euclidDistance(x1, x2, sqrt_flag=False): res = np.sum((x1-x2)**2) if sqrt_flag: res = np.sqrt(res) return res
將這些距離用矩陣的形式儲存:
def calEuclidDistanceMatrix(X): X = np.array(X) S = np.zeros((len(X), len(X))) for i in range(len(X)): for j in range(i+1, len(X)): S[i][j] = 1.0 * euclidDistance(X[i], X[j]) S[j][i] = S[i][j] return S
S = calEuclidDistanceMatrix(x1)
array([[0.00000000e+00, 1.13270081e+00, 2.62565479e+00, ..., 2.99144277e+00, 1.88193070e+00, 1.12840739e+00], [1.13270081e+00, 0.00000000e+00, 2.72601994e+00, ..., 2.95125426e+00, 5.11864947e-01, 6.05388856e-05], [2.62565479e+00, 2.72601994e+00, 0.00000000e+00, ..., 1.30747922e-02, 1.18180915e+00, 2.74692378e+00], ..., [2.99144277e+00, 2.95125426e+00, 1.30747922e-02, ..., 0.00000000e+00, 1.26037239e+00, 2.97382982e+00], [1.88193070e+00, 5.11864947e-01, 1.18180915e+00, ..., 1.26037239e+00, 0.00000000e+00, 5.22992113e-01], [1.12840739e+00, 6.05388856e-05, 2.74692378e+00, ..., 2.97382982e+00, 5.22992113e-01, 0.00000000e+00]])
(3)使用KNN計算跟每個樣本最接近的k個樣本點,然後計算出鄰接矩陣:
def myKNN(S, k, sigma=1.0): N = len(S) #定義鄰接矩陣 A = np.zeros((N,N)) for i in range(N): #對每個樣本進行編號 dist_with_index = zip(S[i], range(N)) #對距離進行排序 dist_with_index = sorted(dist_with_index, key=lambda x:x[0]) #取得距離該樣本前k個最小距離的編號 neighbours_id = [dist_with_index[m][1] for m in range(k+1)] # xi's k nearest neighbours #構建鄰接矩陣 for j in neighbours_id: # xj is xi's neighbour A[i][j] = np.exp(-S[i][j]/2/sigma/sigma) A[j][i] = A[i][j] # mutually return A
A = myKNN(S,3)
array([[1. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 1. , 0. , ..., 0. , 0. , 0.99996973], [0. , 0. , 1. , ..., 0. , 0. , 0. ], ..., [0. , 0. , 0. , ..., 1. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 1. , 0. ], [0. , 0.99996973, 0. , ..., 0. , 0. , 1. ]])
(4)計算標準化的拉普拉斯矩陣
def calLaplacianMatrix(adjacentMatrix): # compute the Degree Matrix: D=sum(A) degreeMatrix = np.sum(adjacentMatrix, axis=1) # compute the Laplacian Matrix: L=D-A laplacianMatrix = np.diag(degreeMatrix) - adjacentMatrix # normailze # D^(-1/2) L D^(-1/2) sqrtDegreeMatrix = np.diag(1.0 / (degreeMatrix ** (0.5))) return np.dot(np.dot(sqrtDegreeMatrix, laplacianMatrix), sqrtDegreeMatrix)
L_sys = calLaplacianMatrix(A)
array([[ 0.66601736, 0. , 0. , ..., 0. , 0. , 0. ], [ 0. , 0.74997723, 0. , ..., 0. , 0. , -0.28868642], [ 0. , 0. , 0.74983185, ..., 0. , 0. , 0. ], ..., [ 0. , 0. , 0. , ..., 0.66662382, 0. , 0. ], [ 0. , 0. , 0. , ..., 0. , 0.74953329, 0. ], [ 0. , -0.28868642, 0. , ..., 0. , 0. , 0.66665079]])
(5)特徵值分解
lam, V = np.linalg.eig(L_sys) # H'shape is n*n lam = zip(lam, range(len(lam))) lam = sorted(lam, key=lambda x:x[0]) H = np.vstack([V[:,i] for (v, i) in lam[:1000]]).T H = np.asarray(H).astype(float)
(6)使用Kmeans進行聚類
from sklearn.cluster import KMeans def spKmeans(H): sp_kmeans = KMeans(n_clusters=2).fit(H) return sp_kmeans.labels_
labels = spKmeans(H) plt.title('spectral cluster result') plt.scatter(x1[:, 0], x1[:, 1], marker='o',c=labels) plt.show()
(7) 對比使用kmeans聚類
pure_kmeans = KMeans(n_clusters=2).fit(x1) plt.title('pure kmeans cluster result') plt.scatter(x1[:, 0], x1[:, 1], marker='o',c=pure_kmeans.labels_) plt.show()
參考:
https://www.cnblogs.com/xiximayou/p/13180579.html
https://www.cnblogs.com/chenmo1/p/11681669.html