眾所周知,機器學習可以大體分為三大類:監督學習、非監督學習和半監督學習。監督學習可以認為是我們有非常多的labeled標註資料來train一個模型,期待這個模型能學習到資料的分佈,以期對未來沒有見到的樣本做預測。那這個效能的源頭–訓練資料,就顯得非常感覺。你必須有足夠的訓練資料,以覆蓋真正現實資料中的樣本分佈才可以,這樣學習到的模型才有意義。那非監督學習就是沒有任何的labeled資料,就是平時所說的聚類了,利用他們本身的資料分佈,給他們劃分類別。而半監督學習,顧名思義就是處於兩者之間的,只有少量的labeled資料,我們試圖從這少量的labeled資料和大量的unlabeled資料中學習到有用的資訊。
一、半監督學習
半監督學習(Semi-supervised learning)發揮作用的場合是:你的資料有一些有label,一些沒有。而且一般是絕大部分都沒有,只有少許幾個有label。半監督學習演算法會充分的利用unlabeled資料來捕捉我們整個資料的潛在分佈。它基於三大假設:
- Smoothness平滑假設:相似的資料具有相同的label。
- Cluster聚類假設:處於同一個聚類下的資料具有相同label。
- Manifold流形假設:處於同一流形結構下的資料具有相同label。
例如下圖,只有兩個labeled資料,如果直接用他們來訓練一個分類器,例如LR或者SVM,那麼學出來的分類面就是左圖那樣的。如果現實中,這個資料是右圖那邊分佈的話,豬都看得出來,左圖訓練的這個分類器爛的一塌糊塗、慘不忍睹。因為我們的labeled訓練資料太少了,都沒辦法覆蓋我們未來可能遇到的情況。但是,如果右圖那樣,把大量的unlabeled資料(黑色的)都考慮進來,有個全域性觀念,牛逼的演算法會發現,哎喲,原來是兩個圈圈(分別處於兩個圓形的流形之上)!那演算法就很聰明,把大圈的資料都歸類為紅色類別,把內圈的資料都歸類為藍色類別。因為,實踐中,labeled資料是昂貴,很難獲得的,但unlabeled資料就不是了,寫個指令碼在網上爬就可以了,因此如果能充分利用大量的unlabeled資料來輔助提升我們的模型學習,這個價值就非常大。
半監督學習演算法有很多,下面我們介紹最簡單的標籤傳播演算法(label propagation),最喜歡簡單了,哈哈。
二、標籤傳播演算法
標籤傳播演算法(label propagation)的核心思想非常簡單:相似的資料應該具有相同的label。LP演算法包括兩大步驟:1)構造相似矩陣;2)勇敢的傳播吧。
2.1、相似矩陣構建
LP演算法是基於Graph的,因此我們需要先構建一個圖。我們為所有的資料構建一個圖,圖的節點就是一個資料點,包含labeled和unlabeled的資料。節點i和節點j的邊表示他們的相似度。這個圖的構建方法有很多,這裡我們假設這個圖是全連線的,節點i和節點j的邊權重為:
這裡,α是超參。
還有個非常常用的圖構建方法是knn圖,也就是隻保留每個節點的k近鄰權重,其他的為0,也就是不存在邊,因此是稀疏的相似矩陣。
2.2、LP演算法
標籤傳播演算法非常簡單:通過節點之間的邊傳播label。邊的權重越大,表示兩個節點越相似,那麼label越容易傳播過去。我們定義一個NxN的概率轉移矩陣P:
Pij表示從節點i轉移到節點j的概率。假設有C個類和L個labeled樣本,我們定義一個LxC的label矩陣YL,第i行表示第i個樣本的標籤指示向量,即如果第i個樣本的類別是j,那麼該行的第j個元素為1,其他為0。同樣,我們也給U個unlabeled樣本一個UxC的label矩陣YU。把他們合併,我們得到一個NxC的soft label矩陣F=[YL;YU]。soft label的意思是,我們保留樣本i屬於每個類別的概率,而不是互斥性的,這個樣本以概率1只屬於一個類。當然了,最後確定這個樣本i的類別的時候,是取max也就是概率最大的那個類作為它的類別的。那F裡面有個YU,它一開始是不知道的,那最開始的值是多少?無所謂,隨便設定一個值就可以了。
千呼萬喚始出來,簡單的LP演算法如下:
- 執行傳播:F=PF
- 重置F中labeled樣本的標籤:FL=YL
- 重複步驟1)和2)直到F收斂。
步驟1)就是將矩陣P和矩陣F相乘,這一步,每個節點都將自己的label以P確定的概率傳播給其他節點。如果兩個節點越相似(在歐式空間中距離越近),那麼對方的label就越容易被自己的label賦予,就是更容易拉幫結派。步驟2)非常關鍵,因為labeled資料的label是事先確定的,它不能被帶跑,所以每次傳播完,它都得迴歸它本來的label。隨著labeled資料不斷的將自己的label傳播出去,最後的類邊界會穿越高密度區域,而停留在低密度的間隔中。相當於每個不同類別的labeled樣本劃分了勢力範圍。
2.3、變身的LP演算法
我們知道,我們每次迭代都是計算一個soft label矩陣F=[YL;YU],但是YL是已知的,計算它沒有什麼用,在步驟2)的時候,還得把它弄回來。我們關心的只是YU,那我們能不能只計算YU呢?Yes。我們將矩陣P做以下劃分:
這時候,我們的演算法就一個運算:
迭代上面這個步驟直到收斂就ok了,是不是很cool。可以看到FU不但取決於labeled資料的標籤及其轉移概率,還取決了unlabeled資料的當前label和轉移概率。因此LP演算法能額外運用unlabeled資料的分佈特點。
這個演算法的收斂性也非常容易證明,具體見參考文獻[1]。實際上,它是可以收斂到一個凸解的:
所以我們也可以直接這樣求解,以獲得最終的YU。但是在實際的應用過程中,由於矩陣求逆需要O(n3)的複雜度,所以如果unlabeled資料非常多,那麼I – PUU矩陣的求逆將會非常耗時,因此這時候一般選擇迭代演算法來實現。
三、LP演算法的Python實現
Python環境的搭建就不囉嗦了,可以參考前面的部落格。需要額外依賴的庫是經典的numpy和matplotlib。程式碼中包含了兩種圖的構建方法:RBF和KNN指定。同時,自己生成了兩個toy資料庫:兩條長形形狀和兩個圈圈的資料。第四部分我們用大點的資料庫來做實驗,先簡單的視覺化驗證程式碼的正確性,再前線。
演算法程式碼:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
#*************************************************************************** #* #* Description: label propagation #* Author: Zou Xiaoyi (zouxy09@qq.com) #* Date: 2015-10-15 #* HomePage: http://blog.csdn.net/zouxy09 #* #************************************************************************** import time import numpy as np # return k neighbors index def navie_knn(dataSet, query, k): numSamples = dataSet.shape[0] ## step 1: calculate Euclidean distance diff = np.tile(query, (numSamples, 1)) - dataSet squaredDiff = diff ** 2 squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row ## step 2: sort the distance sortedDistIndices = np.argsort(squaredDist) if k > len(sortedDistIndices): k = len(sortedDistIndices) return sortedDistIndices[0:k] # build a big graph (normalized weight matrix) def buildGraph(MatX, kernel_type, rbf_sigma = None, knn_num_neighbors = None): num_samples = MatX.shape[0] affinity_matrix = np.zeros((num_samples, num_samples), np.float32) if kernel_type == 'rbf': if rbf_sigma == None: raise ValueError('You should input a sigma of rbf kernel!') for i in xrange(num_samples): row_sum = 0.0 for j in xrange(num_samples): diff = MatX[i, :] - MatX[j, :] affinity_matrix[i][j] = np.exp(sum(diff**2) / (-2.0 * rbf_sigma**2)) row_sum += affinity_matrix[i][j] affinity_matrix[i][:] /= row_sum elif kernel_type == 'knn': if knn_num_neighbors == None: raise ValueError('You should input a k of knn kernel!') for i in xrange(num_samples): k_neighbors = navie_knn(MatX, MatX[i, :], knn_num_neighbors) affinity_matrix[i][k_neighbors] = 1.0 / knn_num_neighbors else: raise NameError('Not support kernel type! You can use knn or rbf!') return affinity_matrix # label propagation def labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 1.5, \ knn_num_neighbors = 10, max_iter = 500, tol = 1e-3): # initialize num_label_samples = Mat_Label.shape[0] num_unlabel_samples = Mat_Unlabel.shape[0] num_samples = num_label_samples + num_unlabel_samples labels_list = np.unique(labels) num_classes = len(labels_list) MatX = np.vstack((Mat_Label, Mat_Unlabel)) clamp_data_label = np.zeros((num_label_samples, num_classes), np.float32) for i in xrange(num_label_samples): clamp_data_label[i][labels[i]] = 1.0 label_function = np.zeros((num_samples, num_classes), np.float32) label_function[0 : num_label_samples] = clamp_data_label label_function[num_label_samples : num_samples] = -1 # graph construction affinity_matrix = buildGraph(MatX, kernel_type, rbf_sigma, knn_num_neighbors) # start to propagation iter = 0; pre_label_function = np.zeros((num_samples, num_classes), np.float32) changed = np.abs(pre_label_function - label_function).sum() while iter < max_iter and changed > tol: if iter % 1 == 0: print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, changed) pre_label_function = label_function iter += 1 # propagation label_function = np.dot(affinity_matrix, label_function) # clamp label_function[0 : num_label_samples] = clamp_data_label # check converge changed = np.abs(pre_label_function - label_function).sum() # get terminate label of unlabeled data unlabel_data_labels = np.zeros(num_unlabel_samples) for i in xrange(num_unlabel_samples): unlabel_data_labels[i] = np.argmax(label_function[i+num_label_samples]) return unlabel_data_labels |
測試程式碼:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
#*************************************************************************** #* #* Description: label propagation #* Author: Zou Xiaoyi (zouxy09@qq.com) #* Date: 2015-10-15 #* HomePage: http://blog.csdn.net/zouxy09 #* #************************************************************************** import time import math import numpy as np from label_propagation import labelPropagation # show def show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels): import matplotlib.pyplot as plt for i in range(Mat_Label.shape[0]): if int(labels[i]) == 0: plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dr') elif int(labels[i]) == 1: plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Db') else: plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dy') for i in range(Mat_Unlabel.shape[0]): if int(unlabel_data_labels[i]) == 0: plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'or') elif int(unlabel_data_labels[i]) == 1: plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'ob') else: plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'oy') plt.xlabel('X1'); plt.ylabel('X2') plt.xlim(0.0, 12.) plt.ylim(0.0, 12.) plt.show() def loadCircleData(num_data): center = np.array([5.0, 5.0]) radiu_inner = 2 radiu_outer = 4 num_inner = num_data / 3 num_outer = num_data - num_inner data = [] theta = 0.0 for i in range(num_inner): pho = (theta % 360) * math.pi / 180 tmp = np.zeros(2, np.float32) tmp[0] = radiu_inner * math.cos(pho) + np.random.rand(1) + center[0] tmp[1] = radiu_inner * math.sin(pho) + np.random.rand(1) + center[1] data.append(tmp) theta += 2 theta = 0.0 for i in range(num_outer): pho = (theta % 360) * math.pi / 180 tmp = np.zeros(2, np.float32) tmp[0] = radiu_outer * math.cos(pho) + np.random.rand(1) + center[0] tmp[1] = radiu_outer * math.sin(pho) + np.random.rand(1) + center[1] data.append(tmp) theta += 1 Mat_Label = np.zeros((2, 2), np.float32) Mat_Label[0] = center + np.array([-radiu_inner + 0.5, 0]) Mat_Label[1] = center + np.array([-radiu_outer + 0.5, 0]) labels = [0, 1] Mat_Unlabel = np.vstack(data) return Mat_Label, labels, Mat_Unlabel def loadBandData(num_unlabel_samples): #Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]]) #labels = [0, 1] #Mat_Unlabel = np.array([[5.1, 2.], [5.0, 8.1]]) Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]]) labels = [0, 1] num_dim = Mat_Label.shape[1] Mat_Unlabel = np.zeros((num_unlabel_samples, num_dim), np.float32) Mat_Unlabel[:num_unlabel_samples/2, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[0] Mat_Unlabel[num_unlabel_samples/2 : num_unlabel_samples, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[1] return Mat_Label, labels, Mat_Unlabel # main function if __name__ == "__main__": num_unlabel_samples = 800 #Mat_Label, labels, Mat_Unlabel = loadBandData(num_unlabel_samples) Mat_Label, labels, Mat_Unlabel = loadCircleData(num_unlabel_samples) ## Notice: when use 'rbf' as our kernel, the choice of hyper parameter 'sigma' is very import! It should be ## chose according to your dataset, specific the distance of two data points. I think it should ensure that ## each point has about 10 knn or w_i,j is large enough. It also influence the speed of converge. So, may be ## 'knn' kernel is better! #unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 0.2) unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'knn', knn_num_neighbors = 10, max_iter = 400) show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels) |
該註釋的,程式碼都註釋的,有看不明白的,歡迎交流。不同迭代次數時候的結果如下:
是不是很漂亮的傳播過程?!在數值上也是可以看到隨著迭代的進行逐漸收斂的,迭代的數值變化過程如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---> Iteration 0/400, changed: 1602.000000 ---> Iteration 1/400, changed: 6.300182 ---> Iteration 2/400, changed: 5.129996 ---> Iteration 3/400, changed: 4.301994 ---> Iteration 4/400, changed: 3.819295 ---> Iteration 5/400, changed: 3.501743 ---> Iteration 6/400, changed: 3.277122 ---> Iteration 7/400, changed: 3.105952 ---> Iteration 8/400, changed: 2.967030 ---> Iteration 9/400, changed: 2.848606 ---> Iteration 10/400, changed: 2.743997 ---> Iteration 11/400, changed: 2.649270 ---> Iteration 12/400, changed: 2.562057 ---> Iteration 13/400, changed: 2.480885 ---> Iteration 14/400, changed: 2.404774 ---> Iteration 15/400, changed: 2.333075 ---> Iteration 16/400, changed: 2.265301 ---> Iteration 17/400, changed: 2.201107 ---> Iteration 18/400, changed: 2.140209 ---> Iteration 19/400, changed: 2.082354 ---> Iteration 20/400, changed: 2.027376 ---> Iteration 21/400, changed: 1.975071 ---> Iteration 22/400, changed: 1.925286 ---> Iteration 23/400, changed: 1.877894 ---> Iteration 24/400, changed: 1.832743 ---> Iteration 25/400, changed: 1.789721 ---> Iteration 26/400, changed: 1.748706 ---> Iteration 27/400, changed: 1.709593 ---> Iteration 28/400, changed: 1.672284 ---> Iteration 29/400, changed: 1.636668 ---> Iteration 30/400, changed: 1.602668 ---> Iteration 31/400, changed: 1.570200 ---> Iteration 32/400, changed: 1.539179 ---> Iteration 33/400, changed: 1.509530 ---> Iteration 34/400, changed: 1.481182 ---> Iteration 35/400, changed: 1.454066 ---> Iteration 36/400, changed: 1.428120 ---> Iteration 37/400, changed: 1.403283 ---> Iteration 38/400, changed: 1.379502 ---> Iteration 39/400, changed: 1.356734 ---> Iteration 40/400, changed: 1.334906 ---> Iteration 41/400, changed: 1.313983 ---> Iteration 42/400, changed: 1.293921 ---> Iteration 43/400, changed: 1.274681 ---> Iteration 44/400, changed: 1.256214 ---> Iteration 45/400, changed: 1.238491 ---> Iteration 46/400, changed: 1.221474 ---> Iteration 47/400, changed: 1.205126 ---> Iteration 48/400, changed: 1.189417 ---> Iteration 49/400, changed: 1.174316 ---> Iteration 50/400, changed: 1.159804 ---> Iteration 51/400, changed: 1.145844 ---> Iteration 52/400, changed: 1.132414 ---> Iteration 53/400, changed: 1.119490 ---> Iteration 54/400, changed: 1.107032 ---> Iteration 55/400, changed: 1.095054 ---> Iteration 56/400, changed: 1.083513 ---> Iteration 57/400, changed: 1.072397 ---> Iteration 58/400, changed: 1.061671 ---> Iteration 59/400, changed: 1.051324 ---> Iteration 60/400, changed: 1.041363 ---> Iteration 61/400, changed: 1.031742 ---> Iteration 62/400, changed: 1.022459 ---> Iteration 63/400, changed: 1.013494 ---> Iteration 64/400, changed: 1.004836 ---> Iteration 65/400, changed: 0.996484 ---> Iteration 66/400, changed: 0.988407 ---> Iteration 67/400, changed: 0.980592 ---> Iteration 68/400, changed: 0.973045 ---> Iteration 69/400, changed: 0.965744 ---> Iteration 70/400, changed: 0.958682 ---> Iteration 71/400, changed: 0.951848 ---> Iteration 72/400, changed: 0.945227 ---> Iteration 73/400, changed: 0.938820 ---> Iteration 74/400, changed: 0.932608 ---> Iteration 75/400, changed: 0.926590 ---> Iteration 76/400, changed: 0.920765 ---> Iteration 77/400, changed: 0.915107 ---> Iteration 78/400, changed: 0.909628 ---> Iteration 79/400, changed: 0.904309 ---> Iteration 80/400, changed: 0.899143 ---> Iteration 81/400, changed: 0.894122 ---> Iteration 82/400, changed: 0.889259 ---> Iteration 83/400, changed: 0.884530 ---> Iteration 84/400, changed: 0.879933 ---> Iteration 85/400, changed: 0.875464 ---> Iteration 86/400, changed: 0.871121 ---> Iteration 87/400, changed: 0.866888 ---> Iteration 88/400, changed: 0.862773 ---> Iteration 89/400, changed: 0.858783 ---> Iteration 90/400, changed: 0.854879 ---> Iteration 91/400, changed: 0.851084 ---> Iteration 92/400, changed: 0.847382 ---> Iteration 93/400, changed: 0.843779 ---> Iteration 94/400, changed: 0.840274 ---> Iteration 95/400, changed: 0.836842 ---> Iteration 96/400, changed: 0.833501 ---> Iteration 97/400, changed: 0.830240 ---> Iteration 98/400, changed: 0.827051 ---> Iteration 99/400, changed: 0.823950 ---> Iteration 100/400, changed: 0.820906 ---> Iteration 101/400, changed: 0.817946 ---> Iteration 102/400, changed: 0.815053 ---> Iteration 103/400, changed: 0.812217 ---> Iteration 104/400, changed: 0.809437 ---> Iteration 105/400, changed: 0.806724 ---> Iteration 106/400, changed: 0.804076 ---> Iteration 107/400, changed: 0.801480 ---> Iteration 108/400, changed: 0.798937 ---> Iteration 109/400, changed: 0.796448 ---> Iteration 110/400, changed: 0.794008 ---> Iteration 111/400, changed: 0.791612 ---> Iteration 112/400, changed: 0.789282 ---> Iteration 113/400, changed: 0.786984 ---> Iteration 114/400, changed: 0.784728 ---> Iteration 115/400, changed: 0.782516 ---> Iteration 116/400, changed: 0.780355 ---> Iteration 117/400, changed: 0.778216 ---> Iteration 118/400, changed: 0.776139 ---> Iteration 119/400, changed: 0.774087 ---> Iteration 120/400, changed: 0.772072 ---> Iteration 121/400, changed: 0.770085 ---> Iteration 122/400, changed: 0.768146 ---> Iteration 123/400, changed: 0.766232 ---> Iteration 124/400, changed: 0.764356 ---> Iteration 125/400, changed: 0.762504 ---> Iteration 126/400, changed: 0.760685 ---> Iteration 127/400, changed: 0.758889 ---> Iteration 128/400, changed: 0.757135 ---> Iteration 129/400, changed: 0.755406 |
四、LP演算法MPI並行實現
這裡,我們測試的是LP的變身版本。從公式,我們可以看到,第二項PULYL迭代過程並沒有發生變化,所以這部分實際上從迭代開始就可以計算好,從而避免重複計算。不過,不管怎樣,LP演算法都要計算一個UxU的矩陣PUU和一個UxC矩陣FU的乘積。當我們的unlabeled資料非常多,而且類別也很多的時候,計算是很慢的,同時佔用的記憶體量也非常大。另外,構造Graph需要計算兩兩的相似度,也是O(n2)的複雜度,當我們資料的特徵維度很大的時候,這個計算量也是非常客觀的。所以我們就得考慮並行處理了。而且最好是能放到叢集上並行。那如何並行呢?
對演算法的並行化,一般分為兩種:資料並行和模型並行。
資料並行很好理解,就是將資料劃分,每個節點只處理一部分資料,例如我們構造圖的時候,計算每個資料的k近鄰。例如我們有1000個樣本和20個CPU節點,那麼就平均分發,讓每個CPU節點計算50個樣本的k近鄰,然後最後再合併大家的結果。可見這個加速比也是非常可觀的。
模型並行一般發生在模型很大,無法放到單機的記憶體裡面的時候。例如龐大的深度神經網路訓練的時候,就需要把這個網路切開,然後分別求解梯度,最後有個leader的節點來收集大家的梯度,再反饋給大家去更新。當然了,其中存在更細緻和高效的工程處理方法。在我們的LP演算法中,也是可以做模型並行的。假如我們的類別數C很大,把類別數切開,讓不同的CPU節點處理,實際上就相當於模型並行了。
那為啥不切大矩陣PUU,而是切小點的矩陣FU,因為大矩陣PUU沒法獨立分塊,並行的一個原則是處理必須是獨立的。 矩陣FU依賴的是所有的U,而把PUU切開分發到其他節點的時候,每次FU的更新都需要和其他的節點通訊,這個通訊的代價是很大的(實際上,很多並行系統沒法達到線性的加速度的瓶頸是通訊!線性加速比是,我增加了n臺機器,速度就提升了n倍)。但是對類別C也就是矩陣FU切分,就不會有這個問題,因為他們的計算是獨立的。只是決定樣本的最終類別的時候,將所有的FU收集回來求max就可以了。
所以,在下面的程式碼中,是同時包含了資料並行和模型並行的雛形的。另外,還值得一提的是,我們是迭代演算法,那決定什麼時候迭代演算法停止?除了判斷收斂外,我們還可以讓每迭代幾步,就用測試label測試一次結果,看模型的整體訓練效能如何。特別是判斷訓練是否過擬合的時候非常有效。因此,程式碼中包含了這部分內容。
好了,程式碼終於來了。大家可以搞點大資料庫來測試,如果有MPI叢集條件的話就更好了。
下面的程式碼依賴numpy、scipy(用其稀疏矩陣加速計算)和mpi4py。其中mpi4py需要依賴openmpi和Cpython,可以參考我之前的部落格進行安裝。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
#*************************************************************************** #* #* Description: label propagation #* Author: Zou Xiaoyi (zouxy09@qq.com) #* Date: 2015-10-15 #* HomePage: http://blog.csdn.net/zouxy09 #* #************************************************************************** import os, sys, time import numpy as np from scipy.sparse import csr_matrix, lil_matrix, eye import operator import cPickle as pickle import mpi4py.MPI as MPI # # Global variables for MPI # # instance for invoking MPI related functions comm = MPI.COMM_WORLD # the node rank in the whole community comm_rank = comm.Get_rank() # the size of the whole community, i.e., the total number of working nodes in the MPI cluster comm_size = comm.Get_size() # load mnist dataset def load_MNIST(): import gzip f = gzip.open("mnist.pkl.gz", "rb") train, val, test = pickle.load(f) f.close() Mat_Label = train[0] labels = train[1] Mat_Unlabel = test[0] groundtruth = test[1] labels_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] return Mat_Label, labels, labels_id, Mat_Unlabel, groundtruth # return k neighbors index def navie_knn(dataSet, query, k): numSamples = dataSet.shape[0] ## step 1: calculate Euclidean distance diff = np.tile(query, (numSamples, 1)) - dataSet squaredDiff = diff ** 2 squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row ## step 2: sort the distance sortedDistIndices = np.argsort(squaredDist) if k > len(sortedDistIndices): k = len(sortedDistIndices) return sortedDistIndices[0:k] # build a big graph (normalized weight matrix) # sparse U x (U + L) matrix def buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors): num_unlabel_samples = Mat_Unlabel.shape[0] data = []; indices = []; indptr = [0] Mat_all = np.vstack((Mat_Label, Mat_Unlabel)) values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors for i in xrange(num_unlabel_samples): k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors) indptr.append(np.int32(indptr[-1]) + knn_num_neighbors) indices.extend(k_neighbors) data.append(values) return csr_matrix((np.hstack(data), indices, indptr)) # build a big graph (normalized weight matrix) # sparse U x (U + L) matrix def buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors): num_unlabel_samples = Mat_Unlabel.shape[0] local_data = []; local_indices = []; local_indptr = [0] Mat_all = np.vstack((Mat_Label, Mat_Unlabel)) values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors sample_offset = np.linspace(0, num_unlabel_samples, comm_size + 1).astype('int') for i in range(sample_offset[comm_rank], sample_offset[comm_rank+1]): k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors) local_indptr.append(np.int32(local_indptr[-1]) + knn_num_neighbors) local_indices.extend(k_neighbors) local_data.append(values) data = np.hstack(comm.allgather(local_data)) indices = np.hstack(comm.allgather(local_indices)) indptr_tmp = comm.allgather(local_indptr) indptr = [] for i in range(len(indptr_tmp)): if i == 0: indptr.extend(indptr_tmp[i]) else: last_indptr = indptr[-1] del(indptr[-1]) indptr.extend(indptr_tmp[i] + last_indptr) return csr_matrix((np.hstack(data), indices, indptr), dtype = np.float32) # label propagation def run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 100, tol = 1e-4, test_per_iter = 1): # load data and graph print "Processor %d/%d loading graph file..." % (comm_rank, comm_size) #Mat_Label, labels, Mat_Unlabel, groundtruth = loadFourBandData() Mat_Label, labels, labels_id, Mat_Unlabel, unlabel_data_id = load_MNIST() if comm_size > len(labels_id): raise ValueError("Sorry, the processors must be less than the number of classes") #affinity_matrix = buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors) affinity_matrix = buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors) # get some parameters num_classes = len(labels_id) num_label_samples = len(labels) num_unlabel_samples = Mat_Unlabel.shape[0] affinity_matrix_UL = affinity_matrix[:, 0:num_label_samples] affinity_matrix_UU = affinity_matrix[:, num_label_samples:num_label_samples+num_unlabel_samples] if comm_rank == 0: print "Have %d labeled images, %d unlabeled images and %d classes" % (num_label_samples, num_unlabel_samples, num_classes) # divide label_function_U and label_function_L to all processors class_offset = np.linspace(0, num_classes, comm_size + 1).astype('int') # initialize local label_function_U local_start_class = class_offset[comm_rank] local_num_classes = class_offset[comm_rank+1] - local_start_class local_label_function_U = eye(num_unlabel_samples, local_num_classes, 0, np.float32, format='csr') # initialize local label_function_L local_label_function_L = lil_matrix((num_label_samples, local_num_classes), dtype = np.float32) for i in xrange(num_label_samples): class_off = int(labels[i]) - local_start_class if class_off >= 0 and class_off < local_num_classes: local_label_function_L[i, class_off] = 1.0 local_label_function_L = local_label_function_L.tocsr() local_label_info = affinity_matrix_UL.dot(local_label_function_L) print "Processor %d/%d has to process %d classes..." % (comm_rank, comm_size, local_label_function_L.shape[1]) # start to propagation iter = 1; changed = 100.0; evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id) while True: pre_label_function = local_label_function_U.copy() # propagation local_label_function_U = affinity_matrix_UU.dot(local_label_function_U) + local_label_info # check converge local_changed = abs(pre_label_function - local_label_function_U).sum() changed = comm.reduce(local_changed, root = 0, op = MPI.SUM) status = 'RUN' test = False if comm_rank == 0: if iter % 1 == 0: norm_changed = changed / (num_unlabel_samples * num_classes) print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, norm_changed) if iter >= max_iter or changed < tol: status = 'STOP' print "************** Iteration over! ****************" if iter % test_per_iter == 0: test = True iter += 1 test = comm.bcast(test if comm_rank == 0 else None, root = 0) status = comm.bcast(status if comm_rank == 0 else None, root = 0) if status == 'STOP': break if test == True: evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id) evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id) def evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id): # get local label with max score if comm_rank == 0: print "Start to combine local result..." local_max_score = np.zeros((num_unlabel_samples, 1), np.float32) local_max_label = np.zeros((num_unlabel_samples, 1), np.int32) for i in xrange(num_unlabel_samples): local_max_label[i, 0] = np.argmax(local_label_function_U.getrow(i).todense()) local_max_score[i, 0] = local_label_function_U[i, local_max_label[i, 0]] local_max_label[i, 0] += local_start_class # gather the results from all the processors if comm_rank == 0: print "Start to gather results from all processors" all_max_label = np.hstack(comm.allgather(local_max_label)) all_max_score = np.hstack(comm.allgather(local_max_score)) # get terminate label of unlabeled data if comm_rank == 0: print "Start to analysis the results..." right_predict_count = 0 for i in xrange(num_unlabel_samples): if i % 1000 == 0: print "***", all_max_score[i] max_idx = np.argmax(all_max_score[i]) max_label = all_max_label[i, max_idx] if int(unlabel_data_id[i]) == int(labels_id[max_label]): right_predict_count += 1 accuracy = float(right_predict_count) * 100.0 / num_unlabel_samples print "Have %d samples, accuracy: %.3f%%!" % (num_unlabel_samples, accuracy) if __name__ == '__main__': run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 30) |