標籤傳播演算法（Label Propagation）及 Python 實現

發表於2017-08-31

眾所周知，機器學習可以大體分為三大類：監督學習、非監督學習和半監督學習。監督學習可以認為是我們有非常多的labeled標註資料來train一個模型，期待這個模型能學習到資料的分佈，以期對未來沒有見到的樣本做預測。那這個效能的源頭–訓練資料，就顯得非常感覺。你必須有足夠的訓練資料，以覆蓋真正現實資料中的樣本分佈才可以，這樣學習到的模型才有意義。那非監督學習就是沒有任何的labeled資料，就是平時所說的聚類了，利用他們本身的資料分佈，給他們劃分類別。而半監督學習，顧名思義就是處於兩者之間的，只有少量的labeled資料，我們試圖從這少量的labeled資料和大量的unlabeled資料中學習到有用的資訊。

一、半監督學習

半監督學習（Semi-supervised learning）發揮作用的場合是：你的資料有一些有label，一些沒有。而且一般是絕大部分都沒有，只有少許幾個有label。半監督學習演算法會充分的利用unlabeled資料來捕捉我們整個資料的潛在分佈。它基於三大假設：

Smoothness平滑假設：相似的資料具有相同的label。
Cluster聚類假設：處於同一個聚類下的資料具有相同label。
Manifold流形假設：處於同一流形結構下的資料具有相同label。

例如下圖，只有兩個labeled資料，如果直接用他們來訓練一個分類器，例如LR或者SVM，那麼學出來的分類面就是左圖那樣的。如果現實中，這個資料是右圖那邊分佈的話，豬都看得出來，左圖訓練的這個分類器爛的一塌糊塗、慘不忍睹。因為我們的labeled訓練資料太少了，都沒辦法覆蓋我們未來可能遇到的情況。但是，如果右圖那樣，把大量的unlabeled資料（黑色的）都考慮進來，有個全域性觀念，牛逼的演算法會發現，哎喲，原來是兩個圈圈（分別處於兩個圓形的流形之上）！那演算法就很聰明，把大圈的資料都歸類為紅色類別，把內圈的資料都歸類為藍色類別。因為，實踐中，labeled資料是昂貴，很難獲得的，但unlabeled資料就不是了，寫個指令碼在網上爬就可以了，因此如果能充分利用大量的unlabeled資料來輔助提升我們的模型學習，這個價值就非常大。

半監督學習演算法有很多，下面我們介紹最簡單的標籤傳播演算法（label propagation），最喜歡簡單了，哈哈。

二、標籤傳播演算法

標籤傳播演算法（label propagation）的核心思想非常簡單：相似的資料應該具有相同的label。LP演算法包括兩大步驟：1）構造相似矩陣；2）勇敢的傳播吧。

2.1、相似矩陣構建

LP演算法是基於Graph的，因此我們需要先構建一個圖。我們為所有的資料構建一個圖，圖的節點就是一個資料點，包含labeled和unlabeled的資料。節點i和節點j的邊表示他們的相似度。這個圖的構建方法有很多，這裡我們假設這個圖是全連線的，節點i和節點j的邊權重為：

這裡，α是超參。

還有個非常常用的圖構建方法是knn圖，也就是隻保留每個節點的k近鄰權重，其他的為0，也就是不存在邊，因此是稀疏的相似矩陣。

2.2、LP演算法

標籤傳播演算法非常簡單：通過節點之間的邊傳播label。邊的權重越大，表示兩個節點越相似，那麼label越容易傳播過去。我們定義一個NxN的概率轉移矩陣P：

P_ij表示從節點i轉移到節點j的概率。假設有C個類和L個labeled樣本，我們定義一個LxC的label矩陣Y_L，第i行表示第i個樣本的標籤指示向量，即如果第i個樣本的類別是j，那麼該行的第j個元素為1，其他為0。同樣，我們也給U個unlabeled樣本一個UxC的label矩陣Y_U。把他們合併，我們得到一個NxC的soft label矩陣F=[Y_L;Y_U]。soft label的意思是，我們保留樣本i屬於每個類別的概率，而不是互斥性的，這個樣本以概率1只屬於一個類。當然了，最後確定這個樣本i的類別的時候，是取max也就是概率最大的那個類作為它的類別的。那F裡面有個Y_U，它一開始是不知道的，那最開始的值是多少？無所謂，隨便設定一個值就可以了。

千呼萬喚始出來，簡單的LP演算法如下：

執行傳播：F=PF
重置F中labeled樣本的標籤：F_L=Y_L
重複步驟1）和2）直到F收斂。

步驟1）就是將矩陣P和矩陣F相乘，這一步，每個節點都將自己的label以P確定的概率傳播給其他節點。如果兩個節點越相似（在歐式空間中距離越近），那麼對方的label就越容易被自己的label賦予，就是更容易拉幫結派。步驟2）非常關鍵，因為labeled資料的label是事先確定的，它不能被帶跑，所以每次傳播完，它都得迴歸它本來的label。隨著labeled資料不斷的將自己的label傳播出去，最後的類邊界會穿越高密度區域，而停留在低密度的間隔中。相當於每個不同類別的labeled樣本劃分了勢力範圍。

2.3、變身的LP演算法

我們知道，我們每次迭代都是計算一個soft label矩陣F=[Y_L;Y_U]，但是Y_L是已知的，計算它沒有什麼用，在步驟2）的時候，還得把它弄回來。我們關心的只是Y_U，那我們能不能只計算Y_U呢？Yes。我們將矩陣P做以下劃分：

這時候，我們的演算法就一個運算：

迭代上面這個步驟直到收斂就ok了，是不是很cool。可以看到F_U不但取決於labeled資料的標籤及其轉移概率，還取決了unlabeled資料的當前label和轉移概率。因此LP演算法能額外運用unlabeled資料的分佈特點。

這個演算法的收斂性也非常容易證明，具體見參考文獻[1]。實際上，它是可以收斂到一個凸解的：

所以我們也可以直接這樣求解，以獲得最終的Y_U。但是在實際的應用過程中，由於矩陣求逆需要O(n³)的複雜度，所以如果unlabeled資料非常多，那麼I – P_UU矩陣的求逆將會非常耗時，因此這時候一般選擇迭代演算法來實現。

三、LP演算法的Python實現

Python環境的搭建就不囉嗦了，可以參考前面的部落格。需要額外依賴的庫是經典的numpy和matplotlib。程式碼中包含了兩種圖的構建方法：RBF和KNN指定。同時，自己生成了兩個toy資料庫：兩條長形形狀和兩個圈圈的資料。第四部分我們用大點的資料庫來做實驗，先簡單的視覺化驗證程式碼的正確性，再前線。

演算法程式碼：

#***************************************************************************  
#*   
#* Description: label propagation  
#* Author: Zou Xiaoyi (zouxy09@qq.com)  
#* Date:   2015-10-15  
#* HomePage: http://blog.csdn.net/zouxy09  
#*   
#**************************************************************************  
  
import time  
import numpy as np  
  
# return k neighbors index  
def navie_knn(dataSet, query, k):  
    numSamples = dataSet.shape[0]  
  
    ## step 1: calculate Euclidean distance  
    diff = np.tile(query, (numSamples, 1)) - dataSet  
    squaredDiff = diff ** 2  
    squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row  
  
    ## step 2: sort the distance  
    sortedDistIndices = np.argsort(squaredDist)  
    if k > len(sortedDistIndices):  
        k = len(sortedDistIndices)  
  
    return sortedDistIndices[0:k]  
  
  
# build a big graph (normalized weight matrix)  
def buildGraph(MatX, kernel_type, rbf_sigma = None, knn_num_neighbors = None):  
    num_samples = MatX.shape[0]  
    affinity_matrix = np.zeros((num_samples, num_samples), np.float32)  
    if kernel_type == 'rbf':  
        if rbf_sigma == None:  
            raise ValueError('You should input a sigma of rbf kernel!')  
        for i in xrange(num_samples):  
            row_sum = 0.0  
            for j in xrange(num_samples):  
                diff = MatX[i, :] - MatX[j, :]  
                affinity_matrix[i][j] = np.exp(sum(diff**2) / (-2.0 * rbf_sigma**2))  
                row_sum += affinity_matrix[i][j]  
            affinity_matrix[i][:] /= row_sum  
    elif kernel_type == 'knn':  
        if knn_num_neighbors == None:  
            raise ValueError('You should input a k of knn kernel!')  
        for i in xrange(num_samples):  
            k_neighbors = navie_knn(MatX, MatX[i, :], knn_num_neighbors)  
            affinity_matrix[i][k_neighbors] = 1.0 / knn_num_neighbors  
    else:  
        raise NameError('Not support kernel type! You can use knn or rbf!')  
      
    return affinity_matrix  
  
  
# label propagation  
def labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 1.5, \  
                    knn_num_neighbors = 10, max_iter = 500, tol = 1e-3):  
    # initialize  
    num_label_samples = Mat_Label.shape[0]  
    num_unlabel_samples = Mat_Unlabel.shape[0]  
    num_samples = num_label_samples + num_unlabel_samples  
    labels_list = np.unique(labels)  
    num_classes = len(labels_list)  
      
    MatX = np.vstack((Mat_Label, Mat_Unlabel))  
    clamp_data_label = np.zeros((num_label_samples, num_classes), np.float32)  
    for i in xrange(num_label_samples):  
        clamp_data_label[i][labels[i]] = 1.0  
      
    label_function = np.zeros((num_samples, num_classes), np.float32)  
    label_function[0 : num_label_samples] = clamp_data_label  
    label_function[num_label_samples : num_samples] = -1  
      
    # graph construction  
    affinity_matrix = buildGraph(MatX, kernel_type, rbf_sigma, knn_num_neighbors)  
      
    # start to propagation  
    iter = 0; pre_label_function = np.zeros((num_samples, num_classes), np.float32)  
    changed = np.abs(pre_label_function - label_function).sum()  
    while iter < max_iter and changed > tol:  
        if iter % 1 == 0:  
            print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, changed)  
        pre_label_function = label_function  
        iter += 1  
          
        # propagation  
        label_function = np.dot(affinity_matrix, label_function)  
          
        # clamp  
        label_function[0 : num_label_samples] = clamp_data_label  
          
        # check converge  
        changed = np.abs(pre_label_function - label_function).sum()  
      
    # get terminate label of unlabeled data  
    unlabel_data_labels = np.zeros(num_unlabel_samples)  
    for i in xrange(num_unlabel_samples):  
        unlabel_data_labels[i] = np.argmax(label_function[i+num_label_samples])  
      
    return unlabel_data_labels

100

101

#***************************************************************************

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#**************************************************************************

import time

import numpy as np

# return k neighbors index

def navie_knn(dataSet, query, k):

numSamples = dataSet.shape[0]

## step 1: calculate Euclidean distance

diff = np.tile(query, (numSamples, 1)) - dataSet

squaredDiff = diff ** 2

squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

## step 2: sort the distance

sortedDistIndices = np.argsort(squaredDist)

if k > len(sortedDistIndices):

k = len(sortedDistIndices)

return sortedDistIndices[0:k]

# build a big graph (normalized weight matrix)

def buildGraph(MatX, kernel_type, rbf_sigma = None, knn_num_neighbors = None):

num_samples = MatX.shape[0]

affinity_matrix = np.zeros((num_samples, num_samples), np.float32)

if kernel_type == 'rbf':

if rbf_sigma == None:

raise ValueError('You should input a sigma of rbf kernel!')

for i in xrange(num_samples):

row_sum = 0.0

for j in xrange(num_samples):

diff = MatX[i, :] - MatX[j, :]

affinity_matrix[i][j] = np.exp(sum(diff**2) / (-2.0 * rbf_sigma**2))

row_sum += affinity_matrix[i][j]

affinity_matrix[i][:] /= row_sum

elif kernel_type == 'knn':

if knn_num_neighbors == None:

raise ValueError('You should input a k of knn kernel!')

for i in xrange(num_samples):

k_neighbors = navie_knn(MatX, MatX[i, :], knn_num_neighbors)

affinity_matrix[i][k_neighbors] = 1.0 / knn_num_neighbors

else:

raise NameError('Not support kernel type! You can use knn or rbf!')

return affinity_matrix

# label propagation

def labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 1.5, \

knn_num_neighbors = 10, max_iter = 500, tol = 1e-3):

# initialize

num_label_samples = Mat_Label.shape[0]

num_unlabel_samples = Mat_Unlabel.shape[0]

num_samples = num_label_samples + num_unlabel_samples

labels_list = np.unique(labels)

num_classes = len(labels_list)

MatX = np.vstack((Mat_Label, Mat_Unlabel))

clamp_data_label = np.zeros((num_label_samples, num_classes), np.float32)

for i in xrange(num_label_samples):

clamp_data_label[i][labels[i]] = 1.0

label_function = np.zeros((num_samples, num_classes), np.float32)

label_function[0 : num_label_samples] = clamp_data_label

label_function[num_label_samples : num_samples] = -1

# graph construction

affinity_matrix = buildGraph(MatX, kernel_type, rbf_sigma, knn_num_neighbors)

# start to propagation

iter = 0; pre_label_function = np.zeros((num_samples, num_classes), np.float32)

changed = np.abs(pre_label_function - label_function).sum()

while iter < max_iter and changed > tol:

if iter % 1 == 0:

print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, changed)

pre_label_function = label_function

iter += 1

# propagation

label_function = np.dot(affinity_matrix, label_function)

# clamp

label_function[0 : num_label_samples] = clamp_data_label

# check converge

changed = np.abs(pre_label_function - label_function).sum()

# get terminate label of unlabeled data

unlabel_data_labels = np.zeros(num_unlabel_samples)

for i in xrange(num_unlabel_samples):

unlabel_data_labels[i] = np.argmax(label_function[i+num_label_samples])

return unlabel_data_labels

測試程式碼：

#***************************************************************************  
#*   
#* Description: label propagation  
#* Author: Zou Xiaoyi (zouxy09@qq.com)  
#* Date:   2015-10-15  
#* HomePage: http://blog.csdn.net/zouxy09  
#*   
#**************************************************************************  
  
import time  
import math  
import numpy as np  
from label_propagation import labelPropagation  
  
# show  
def show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels):   
    import matplotlib.pyplot as plt   
      
    for i in range(Mat_Label.shape[0]):  
        if int(labels[i]) == 0:    
            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dr')    
        elif int(labels[i]) == 1:    
            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Db')  
        else:  
            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dy')  
      
    for i in range(Mat_Unlabel.shape[0]):  
        if int(unlabel_data_labels[i]) == 0:    
            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'or')    
        elif int(unlabel_data_labels[i]) == 1:    
            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'ob')  
        else:  
            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'oy')  
      
    plt.xlabel('X1'); plt.ylabel('X2')   
    plt.xlim(0.0, 12.)  
    plt.ylim(0.0, 12.)  
    plt.show()    
  
  
def loadCircleData(num_data):  
    center = np.array([5.0, 5.0])  
    radiu_inner = 2  
    radiu_outer = 4  
    num_inner = num_data / 3  
    num_outer = num_data - num_inner  
      
    data = []  
    theta = 0.0  
    for i in range(num_inner):  
        pho = (theta % 360) * math.pi / 180  
        tmp = np.zeros(2, np.float32)  
        tmp[0] = radiu_inner * math.cos(pho) + np.random.rand(1) + center[0]  
        tmp[1] = radiu_inner * math.sin(pho) + np.random.rand(1) + center[1]  
        data.append(tmp)  
        theta += 2  
      
    theta = 0.0  
    for i in range(num_outer):  
        pho = (theta % 360) * math.pi / 180  
        tmp = np.zeros(2, np.float32)  
        tmp[0] = radiu_outer * math.cos(pho) + np.random.rand(1) + center[0]  
        tmp[1] = radiu_outer * math.sin(pho) + np.random.rand(1) + center[1]  
        data.append(tmp)  
        theta += 1  
      
    Mat_Label = np.zeros((2, 2), np.float32)  
    Mat_Label[0] = center + np.array([-radiu_inner + 0.5, 0])  
    Mat_Label[1] = center + np.array([-radiu_outer + 0.5, 0])  
    labels = [0, 1]  
    Mat_Unlabel = np.vstack(data)  
    return Mat_Label, labels, Mat_Unlabel  
  
  
def loadBandData(num_unlabel_samples):  
    #Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])  
    #labels = [0, 1]  
    #Mat_Unlabel = np.array([[5.1, 2.], [5.0, 8.1]])  
      
    Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])  
    labels = [0, 1]  
    num_dim = Mat_Label.shape[1]  
    Mat_Unlabel = np.zeros((num_unlabel_samples, num_dim), np.float32)  
    Mat_Unlabel[:num_unlabel_samples/2, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[0]  
    Mat_Unlabel[num_unlabel_samples/2 : num_unlabel_samples, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[1]  
    return Mat_Label, labels, Mat_Unlabel  
  
  
# main function  
if __name__ == "__main__":  
    num_unlabel_samples = 800  
    #Mat_Label, labels, Mat_Unlabel = loadBandData(num_unlabel_samples)  
    Mat_Label, labels, Mat_Unlabel = loadCircleData(num_unlabel_samples)  
      
    ## Notice: when use 'rbf' as our kernel, the choice of hyper parameter 'sigma' is very import! It should be  
    ## chose according to your dataset, specific the distance of two data points. I think it should ensure that  
    ## each point has about 10 knn or w_i,j is large enough. It also influence the speed of converge. So, may be  
    ## 'knn' kernel is better!  
    #unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 0.2)  
    unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'knn', knn_num_neighbors = 10, max_iter = 400)  
    show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels)

100

101

#***************************************************************************

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#**************************************************************************

import time

import math

import numpy as np

from label_propagation import labelPropagation

# show

def show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels):

import matplotlib.pyplot as plt

for i in range(Mat_Label.shape[0]):

if int(labels[i]) == 0:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dr')

elif int(labels[i]) == 1:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Db')

else:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dy')

for i in range(Mat_Unlabel.shape[0]):

if int(unlabel_data_labels[i]) == 0:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'or')

elif int(unlabel_data_labels[i]) == 1:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'ob')

else:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'oy')

plt.xlabel('X1'); plt.ylabel('X2')

plt.xlim(0.0, 12.)

plt.ylim(0.0, 12.)

plt.show()

def loadCircleData(num_data):

center = np.array([5.0, 5.0])

radiu_inner = 2

radiu_outer = 4

num_inner = num_data / 3

num_outer = num_data - num_inner

data = []

theta = 0.0

for i in range(num_inner):

pho = (theta % 360) * math.pi / 180

tmp = np.zeros(2, np.float32)

tmp[0] = radiu_inner * math.cos(pho) + np.random.rand(1) + center[0]

tmp[1] = radiu_inner * math.sin(pho) + np.random.rand(1) + center[1]

data.append(tmp)

theta += 2

theta = 0.0

for i in range(num_outer):

pho = (theta % 360) * math.pi / 180

tmp = np.zeros(2, np.float32)

tmp[0] = radiu_outer * math.cos(pho) + np.random.rand(1) + center[0]

tmp[1] = radiu_outer * math.sin(pho) + np.random.rand(1) + center[1]

data.append(tmp)

theta += 1

Mat_Label = np.zeros((2, 2), np.float32)

Mat_Label[0] = center + np.array([-radiu_inner + 0.5, 0])

Mat_Label[1] = center + np.array([-radiu_outer + 0.5, 0])

labels = [0, 1]

Mat_Unlabel = np.vstack(data)

return Mat_Label, labels, Mat_Unlabel

def loadBandData(num_unlabel_samples):

#Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

#labels = [0, 1]

#Mat_Unlabel = np.array([[5.1, 2.], [5.0, 8.1]])

Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

labels = [0, 1]

num_dim = Mat_Label.shape[1]

Mat_Unlabel = np.zeros((num_unlabel_samples, num_dim), np.float32)

Mat_Unlabel[:num_unlabel_samples/2, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[0]

Mat_Unlabel[num_unlabel_samples/2 : num_unlabel_samples, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[1]

return Mat_Label, labels, Mat_Unlabel

# main function

if __name__ == "__main__":

num_unlabel_samples = 800

#Mat_Label, labels, Mat_Unlabel = loadBandData(num_unlabel_samples)

Mat_Label, labels, Mat_Unlabel = loadCircleData(num_unlabel_samples)

## Notice: when use 'rbf' as our kernel, the choice of hyper parameter 'sigma' is very import! It should be

## chose according to your dataset, specific the distance of two data points. I think it should ensure that

## each point has about 10 knn or w_i,j is large enough. It also influence the speed of converge. So, may be

## 'knn' kernel is better!

#unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 0.2)

unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'knn', knn_num_neighbors = 10, max_iter = 400)

show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels)

該註釋的，程式碼都註釋的，有看不明白的，歡迎交流。不同迭代次數時候的結果如下：

是不是很漂亮的傳播過程？！在數值上也是可以看到隨著迭代的進行逐漸收斂的，迭代的數值變化過程如下：

---> Iteration 0/400, changed: 1602.000000  
---> Iteration 1/400, changed: 6.300182  
---> Iteration 2/400, changed: 5.129996  
---> Iteration 3/400, changed: 4.301994  
---> Iteration 4/400, changed: 3.819295  
---> Iteration 5/400, changed: 3.501743  
---> Iteration 6/400, changed: 3.277122  
---> Iteration 7/400, changed: 3.105952  
---> Iteration 8/400, changed: 2.967030  
---> Iteration 9/400, changed: 2.848606  
---> Iteration 10/400, changed: 2.743997  
---> Iteration 11/400, changed: 2.649270  
---> Iteration 12/400, changed: 2.562057  
---> Iteration 13/400, changed: 2.480885  
---> Iteration 14/400, changed: 2.404774  
---> Iteration 15/400, changed: 2.333075  
---> Iteration 16/400, changed: 2.265301  
---> Iteration 17/400, changed: 2.201107  
---> Iteration 18/400, changed: 2.140209  
---> Iteration 19/400, changed: 2.082354  
---> Iteration 20/400, changed: 2.027376  
---> Iteration 21/400, changed: 1.975071  
---> Iteration 22/400, changed: 1.925286  
---> Iteration 23/400, changed: 1.877894  
---> Iteration 24/400, changed: 1.832743  
---> Iteration 25/400, changed: 1.789721  
---> Iteration 26/400, changed: 1.748706  
---> Iteration 27/400, changed: 1.709593  
---> Iteration 28/400, changed: 1.672284  
---> Iteration 29/400, changed: 1.636668  
---> Iteration 30/400, changed: 1.602668  
---> Iteration 31/400, changed: 1.570200  
---> Iteration 32/400, changed: 1.539179  
---> Iteration 33/400, changed: 1.509530  
---> Iteration 34/400, changed: 1.481182  
---> Iteration 35/400, changed: 1.454066  
---> Iteration 36/400, changed: 1.428120  
---> Iteration 37/400, changed: 1.403283  
---> Iteration 38/400, changed: 1.379502  
---> Iteration 39/400, changed: 1.356734  
---> Iteration 40/400, changed: 1.334906  
---> Iteration 41/400, changed: 1.313983  
---> Iteration 42/400, changed: 1.293921  
---> Iteration 43/400, changed: 1.274681  
---> Iteration 44/400, changed: 1.256214  
---> Iteration 45/400, changed: 1.238491  
---> Iteration 46/400, changed: 1.221474  
---> Iteration 47/400, changed: 1.205126  
---> Iteration 48/400, changed: 1.189417  
---> Iteration 49/400, changed: 1.174316  
---> Iteration 50/400, changed: 1.159804  
---> Iteration 51/400, changed: 1.145844  
---> Iteration 52/400, changed: 1.132414  
---> Iteration 53/400, changed: 1.119490  
---> Iteration 54/400, changed: 1.107032  
---> Iteration 55/400, changed: 1.095054  
---> Iteration 56/400, changed: 1.083513  
---> Iteration 57/400, changed: 1.072397  
---> Iteration 58/400, changed: 1.061671  
---> Iteration 59/400, changed: 1.051324  
---> Iteration 60/400, changed: 1.041363  
---> Iteration 61/400, changed: 1.031742  
---> Iteration 62/400, changed: 1.022459  
---> Iteration 63/400, changed: 1.013494  
---> Iteration 64/400, changed: 1.004836  
---> Iteration 65/400, changed: 0.996484  
---> Iteration 66/400, changed: 0.988407  
---> Iteration 67/400, changed: 0.980592  
---> Iteration 68/400, changed: 0.973045  
---> Iteration 69/400, changed: 0.965744  
---> Iteration 70/400, changed: 0.958682  
---> Iteration 71/400, changed: 0.951848  
---> Iteration 72/400, changed: 0.945227  
---> Iteration 73/400, changed: 0.938820  
---> Iteration 74/400, changed: 0.932608  
---> Iteration 75/400, changed: 0.926590  
---> Iteration 76/400, changed: 0.920765  
---> Iteration 77/400, changed: 0.915107  
---> Iteration 78/400, changed: 0.909628  
---> Iteration 79/400, changed: 0.904309  
---> Iteration 80/400, changed: 0.899143  
---> Iteration 81/400, changed: 0.894122  
---> Iteration 82/400, changed: 0.889259  
---> Iteration 83/400, changed: 0.884530  
---> Iteration 84/400, changed: 0.879933  
---> Iteration 85/400, changed: 0.875464  
---> Iteration 86/400, changed: 0.871121  
---> Iteration 87/400, changed: 0.866888  
---> Iteration 88/400, changed: 0.862773  
---> Iteration 89/400, changed: 0.858783  
---> Iteration 90/400, changed: 0.854879  
---> Iteration 91/400, changed: 0.851084  
---> Iteration 92/400, changed: 0.847382  
---> Iteration 93/400, changed: 0.843779  
---> Iteration 94/400, changed: 0.840274  
---> Iteration 95/400, changed: 0.836842  
---> Iteration 96/400, changed: 0.833501  
---> Iteration 97/400, changed: 0.830240  
---> Iteration 98/400, changed: 0.827051  
---> Iteration 99/400, changed: 0.823950  
---> Iteration 100/400, changed: 0.820906  
---> Iteration 101/400, changed: 0.817946  
---> Iteration 102/400, changed: 0.815053  
---> Iteration 103/400, changed: 0.812217  
---> Iteration 104/400, changed: 0.809437  
---> Iteration 105/400, changed: 0.806724  
---> Iteration 106/400, changed: 0.804076  
---> Iteration 107/400, changed: 0.801480  
---> Iteration 108/400, changed: 0.798937  
---> Iteration 109/400, changed: 0.796448  
---> Iteration 110/400, changed: 0.794008  
---> Iteration 111/400, changed: 0.791612  
---> Iteration 112/400, changed: 0.789282  
---> Iteration 113/400, changed: 0.786984  
---> Iteration 114/400, changed: 0.784728  
---> Iteration 115/400, changed: 0.782516  
---> Iteration 116/400, changed: 0.780355  
---> Iteration 117/400, changed: 0.778216  
---> Iteration 118/400, changed: 0.776139  
---> Iteration 119/400, changed: 0.774087  
---> Iteration 120/400, changed: 0.772072  
---> Iteration 121/400, changed: 0.770085  
---> Iteration 122/400, changed: 0.768146  
---> Iteration 123/400, changed: 0.766232  
---> Iteration 124/400, changed: 0.764356  
---> Iteration 125/400, changed: 0.762504  
---> Iteration 126/400, changed: 0.760685  
---> Iteration 127/400, changed: 0.758889  
---> Iteration 128/400, changed: 0.757135  
---> Iteration 129/400, changed: 0.755406

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

---> Iteration 0/400, changed: 1602.000000

---> Iteration 1/400, changed: 6.300182

---> Iteration 2/400, changed: 5.129996

---> Iteration 3/400, changed: 4.301994

---> Iteration 4/400, changed: 3.819295

---> Iteration 5/400, changed: 3.501743

---> Iteration 6/400, changed: 3.277122

---> Iteration 7/400, changed: 3.105952

---> Iteration 8/400, changed: 2.967030

---> Iteration 9/400, changed: 2.848606

---> Iteration 10/400, changed: 2.743997

---> Iteration 11/400, changed: 2.649270

---> Iteration 12/400, changed: 2.562057

---> Iteration 13/400, changed: 2.480885

---> Iteration 14/400, changed: 2.404774

---> Iteration 15/400, changed: 2.333075

---> Iteration 16/400, changed: 2.265301

---> Iteration 17/400, changed: 2.201107

---> Iteration 18/400, changed: 2.140209

---> Iteration 19/400, changed: 2.082354

---> Iteration 20/400, changed: 2.027376

---> Iteration 21/400, changed: 1.975071

---> Iteration 22/400, changed: 1.925286

---> Iteration 23/400, changed: 1.877894

---> Iteration 24/400, changed: 1.832743

---> Iteration 25/400, changed: 1.789721

---> Iteration 26/400, changed: 1.748706

---> Iteration 27/400, changed: 1.709593

---> Iteration 28/400, changed: 1.672284

---> Iteration 29/400, changed: 1.636668

---> Iteration 30/400, changed: 1.602668

---> Iteration 31/400, changed: 1.570200

---> Iteration 32/400, changed: 1.539179

---> Iteration 33/400, changed: 1.509530

---> Iteration 34/400, changed: 1.481182

---> Iteration 35/400, changed: 1.454066

---> Iteration 36/400, changed: 1.428120

---> Iteration 37/400, changed: 1.403283

---> Iteration 38/400, changed: 1.379502

---> Iteration 39/400, changed: 1.356734

---> Iteration 40/400, changed: 1.334906

---> Iteration 41/400, changed: 1.313983

---> Iteration 42/400, changed: 1.293921

---> Iteration 43/400, changed: 1.274681

---> Iteration 44/400, changed: 1.256214

---> Iteration 45/400, changed: 1.238491

---> Iteration 46/400, changed: 1.221474

---> Iteration 47/400, changed: 1.205126

---> Iteration 48/400, changed: 1.189417

---> Iteration 49/400, changed: 1.174316

---> Iteration 50/400, changed: 1.159804

---> Iteration 51/400, changed: 1.145844

---> Iteration 52/400, changed: 1.132414

---> Iteration 53/400, changed: 1.119490

---> Iteration 54/400, changed: 1.107032

---> Iteration 55/400, changed: 1.095054

---> Iteration 56/400, changed: 1.083513

---> Iteration 57/400, changed: 1.072397

---> Iteration 58/400, changed: 1.061671

---> Iteration 59/400, changed: 1.051324

---> Iteration 60/400, changed: 1.041363

---> Iteration 61/400, changed: 1.031742

---> Iteration 62/400, changed: 1.022459

---> Iteration 63/400, changed: 1.013494

---> Iteration 64/400, changed: 1.004836

---> Iteration 65/400, changed: 0.996484

---> Iteration 66/400, changed: 0.988407

---> Iteration 67/400, changed: 0.980592

---> Iteration 68/400, changed: 0.973045

---> Iteration 69/400, changed: 0.965744

---> Iteration 70/400, changed: 0.958682

---> Iteration 71/400, changed: 0.951848

---> Iteration 72/400, changed: 0.945227

---> Iteration 73/400, changed: 0.938820

---> Iteration 74/400, changed: 0.932608

---> Iteration 75/400, changed: 0.926590

---> Iteration 76/400, changed: 0.920765

---> Iteration 77/400, changed: 0.915107

---> Iteration 78/400, changed: 0.909628

---> Iteration 79/400, changed: 0.904309

---> Iteration 80/400, changed: 0.899143

---> Iteration 81/400, changed: 0.894122

---> Iteration 82/400, changed: 0.889259

---> Iteration 83/400, changed: 0.884530

---> Iteration 84/400, changed: 0.879933

---> Iteration 85/400, changed: 0.875464

---> Iteration 86/400, changed: 0.871121

---> Iteration 87/400, changed: 0.866888

---> Iteration 88/400, changed: 0.862773

---> Iteration 89/400, changed: 0.858783

---> Iteration 90/400, changed: 0.854879

---> Iteration 91/400, changed: 0.851084

---> Iteration 92/400, changed: 0.847382

---> Iteration 93/400, changed: 0.843779

---> Iteration 94/400, changed: 0.840274

---> Iteration 95/400, changed: 0.836842

---> Iteration 96/400, changed: 0.833501

---> Iteration 97/400, changed: 0.830240

---> Iteration 98/400, changed: 0.827051

---> Iteration 99/400, changed: 0.823950

---> Iteration 100/400, changed: 0.820906

---> Iteration 101/400, changed: 0.817946

---> Iteration 102/400, changed: 0.815053

---> Iteration 103/400, changed: 0.812217

---> Iteration 104/400, changed: 0.809437

---> Iteration 105/400, changed: 0.806724

---> Iteration 106/400, changed: 0.804076

---> Iteration 107/400, changed: 0.801480

---> Iteration 108/400, changed: 0.798937

---> Iteration 109/400, changed: 0.796448

---> Iteration 110/400, changed: 0.794008

---> Iteration 111/400, changed: 0.791612

---> Iteration 112/400, changed: 0.789282

---> Iteration 113/400, changed: 0.786984

---> Iteration 114/400, changed: 0.784728

---> Iteration 115/400, changed: 0.782516

---> Iteration 116/400, changed: 0.780355

---> Iteration 117/400, changed: 0.778216

---> Iteration 118/400, changed: 0.776139

---> Iteration 119/400, changed: 0.774087

---> Iteration 120/400, changed: 0.772072

---> Iteration 121/400, changed: 0.770085

---> Iteration 122/400, changed: 0.768146

---> Iteration 123/400, changed: 0.766232

---> Iteration 124/400, changed: 0.764356

---> Iteration 125/400, changed: 0.762504

---> Iteration 126/400, changed: 0.760685

---> Iteration 127/400, changed: 0.758889

---> Iteration 128/400, changed: 0.757135

---> Iteration 129/400, changed: 0.755406

四、LP演算法MPI並行實現

這裡，我們測試的是LP的變身版本。從公式，我們可以看到，第二項P_ULY_L迭代過程並沒有發生變化，所以這部分實際上從迭代開始就可以計算好，從而避免重複計算。不過，不管怎樣，LP演算法都要計算一個UxU的矩陣P_UU和一個UxC矩陣F_U的乘積。當我們的unlabeled資料非常多，而且類別也很多的時候，計算是很慢的，同時佔用的記憶體量也非常大。另外，構造Graph需要計算兩兩的相似度，也是O(n²)的複雜度，當我們資料的特徵維度很大的時候，這個計算量也是非常客觀的。所以我們就得考慮並行處理了。而且最好是能放到叢集上並行。那如何並行呢？

對演算法的並行化，一般分為兩種：資料並行和模型並行。

資料並行很好理解，就是將資料劃分，每個節點只處理一部分資料，例如我們構造圖的時候，計算每個資料的k近鄰。例如我們有1000個樣本和20個CPU節點，那麼就平均分發，讓每個CPU節點計算50個樣本的k近鄰，然後最後再合併大家的結果。可見這個加速比也是非常可觀的。

模型並行一般發生在模型很大，無法放到單機的記憶體裡面的時候。例如龐大的深度神經網路訓練的時候，就需要把這個網路切開，然後分別求解梯度，最後有個leader的節點來收集大家的梯度，再反饋給大家去更新。當然了，其中存在更細緻和高效的工程處理方法。在我們的LP演算法中，也是可以做模型並行的。假如我們的類別數C很大，把類別數切開，讓不同的CPU節點處理，實際上就相當於模型並行了。

那為啥不切大矩陣P_UU，而是切小點的矩陣F_U，因為大矩陣P_UU沒法獨立分塊，並行的一個原則是處理必須是獨立的。矩陣F_U依賴的是所有的U，而把P_UU切開分發到其他節點的時候，每次F_U的更新都需要和其他的節點通訊，這個通訊的代價是很大的（實際上，很多並行系統沒法達到線性的加速度的瓶頸是通訊！線性加速比是，我增加了n臺機器，速度就提升了n倍）。但是對類別C也就是矩陣F_U切分，就不會有這個問題，因為他們的計算是獨立的。只是決定樣本的最終類別的時候，將所有的F_U收集回來求max就可以了。

所以，在下面的程式碼中，是同時包含了資料並行和模型並行的雛形的。另外，還值得一提的是，我們是迭代演算法，那決定什麼時候迭代演算法停止？除了判斷收斂外，我們還可以讓每迭代幾步，就用測試label測試一次結果，看模型的整體訓練效能如何。特別是判斷訓練是否過擬合的時候非常有效。因此，程式碼中包含了這部分內容。

好了，程式碼終於來了。大家可以搞點大資料庫來測試，如果有MPI叢集條件的話就更好了。

下面的程式碼依賴numpy、scipy（用其稀疏矩陣加速計算）和mpi4py。其中mpi4py需要依賴openmpi和Cpython，可以參考我之前的部落格進行安裝。

#***************************************************************************  
#*   
#* Description: label propagation  
#* Author: Zou Xiaoyi (zouxy09@qq.com)  
#* Date:   2015-10-15  
#* HomePage: http://blog.csdn.net/zouxy09  
#*   
#**************************************************************************  
  
import os, sys, time  
import numpy as np  
from scipy.sparse import csr_matrix, lil_matrix, eye  
import operator  
import cPickle as pickle  
import mpi4py.MPI as MPI  
  
#  
#   Global variables for MPI  
#  
  
# instance for invoking MPI related functions  
comm = MPI.COMM_WORLD  
# the node rank in the whole community  
comm_rank = comm.Get_rank()  
# the size of the whole community, i.e., the total number of working nodes in the MPI cluster  
comm_size = comm.Get_size()  
  
# load mnist dataset  
def load_MNIST():  
    import gzip  
    f = gzip.open("mnist.pkl.gz", "rb")  
    train, val, test = pickle.load(f)  
    f.close()  
      
    Mat_Label = train[0]  
    labels = train[1]  
    Mat_Unlabel = test[0]  
    groundtruth = test[1]  
    labels_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  
  
    return Mat_Label, labels, labels_id, Mat_Unlabel, groundtruth  
  
# return k neighbors index  
def navie_knn(dataSet, query, k):  
    numSamples = dataSet.shape[0]  
  
    ## step 1: calculate Euclidean distance  
    diff = np.tile(query, (numSamples, 1)) - dataSet  
    squaredDiff = diff ** 2  
    squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row  
  
    ## step 2: sort the distance  
    sortedDistIndices = np.argsort(squaredDist)  
    if k > len(sortedDistIndices):  
        k = len(sortedDistIndices)  
    return sortedDistIndices[0:k]  
  
  
# build a big graph (normalized weight matrix)  
# sparse U x (U + L) matrix  
def buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors):  
    num_unlabel_samples = Mat_Unlabel.shape[0]  
    data = []; indices = []; indptr = [0]  
    Mat_all = np.vstack((Mat_Label, Mat_Unlabel))  
    values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors  
    for i in xrange(num_unlabel_samples):  
        k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)  
        indptr.append(np.int32(indptr[-1]) + knn_num_neighbors)  
        indices.extend(k_neighbors)  
        data.append(values)   
    return csr_matrix((np.hstack(data), indices, indptr))  
  
  
# build a big graph (normalized weight matrix)  
# sparse U x (U + L) matrix  
def buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors):  
    num_unlabel_samples = Mat_Unlabel.shape[0]  
    local_data = []; local_indices = []; local_indptr = [0]  
    Mat_all = np.vstack((Mat_Label, Mat_Unlabel))  
    values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors  
    sample_offset = np.linspace(0, num_unlabel_samples, comm_size + 1).astype('int')  
    for i in range(sample_offset[comm_rank], sample_offset[comm_rank+1]):  
        k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)  
        local_indptr.append(np.int32(local_indptr[-1]) + knn_num_neighbors)  
        local_indices.extend(k_neighbors)  
        local_data.append(values)  
    data = np.hstack(comm.allgather(local_data))  
    indices = np.hstack(comm.allgather(local_indices))  
    indptr_tmp = comm.allgather(local_indptr)  
    indptr = []  
    for i in range(len(indptr_tmp)):  
        if i == 0:  
            indptr.extend(indptr_tmp[i])  
        else:  
            last_indptr = indptr[-1]  
            del(indptr[-1])  
            indptr.extend(indptr_tmp[i] + last_indptr)  
    return csr_matrix((np.hstack(data), indices, indptr), dtype = np.float32)  
  
  
# label propagation  
def run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 100, tol = 1e-4, test_per_iter = 1):  
    # load data and graph  
    print "Processor %d/%d loading graph file..." % (comm_rank, comm_size)  
    #Mat_Label, labels, Mat_Unlabel, groundtruth = loadFourBandData()  
    Mat_Label, labels, labels_id, Mat_Unlabel, unlabel_data_id = load_MNIST()  
    if comm_size > len(labels_id):  
        raise ValueError("Sorry, the processors must be less than the number of classes")  
    #affinity_matrix = buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors)  
    affinity_matrix = buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors)  
      
    # get some parameters  
    num_classes = len(labels_id)  
    num_label_samples = len(labels)  
    num_unlabel_samples = Mat_Unlabel.shape[0]  
  
    affinity_matrix_UL = affinity_matrix[:, 0:num_label_samples]  
    affinity_matrix_UU = affinity_matrix[:, num_label_samples:num_label_samples+num_unlabel_samples]  
  
    if comm_rank == 0:  
        print "Have %d labeled images, %d unlabeled images and %d classes" % (num_label_samples, num_unlabel_samples, num_classes)  
      
    # divide label_function_U and label_function_L to all processors  
    class_offset = np.linspace(0, num_classes, comm_size + 1).astype('int')  
      
    # initialize local label_function_U  
    local_start_class = class_offset[comm_rank]  
    local_num_classes = class_offset[comm_rank+1] - local_start_class  
    local_label_function_U = eye(num_unlabel_samples, local_num_classes, 0, np.float32, format='csr')  
      
    # initialize local label_function_L  
    local_label_function_L = lil_matrix((num_label_samples, local_num_classes), dtype = np.float32)  
    for i in xrange(num_label_samples):  
        class_off = int(labels[i]) - local_start_class  
        if class_off >= 0 and class_off < local_num_classes:  
            local_label_function_L[i, class_off] = 1.0  
    local_label_function_L = local_label_function_L.tocsr()  
    local_label_info = affinity_matrix_UL.dot(local_label_function_L)  
    print "Processor %d/%d has to process %d classes..." % (comm_rank, comm_size, local_label_function_L.shape[1])  
      
    # start to propagation  
    iter = 1; changed = 100.0;  
    evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)  
    while True:  
        pre_label_function = local_label_function_U.copy()  
          
        # propagation  
        local_label_function_U = affinity_matrix_UU.dot(local_label_function_U) + local_label_info  
          
        # check converge  
        local_changed = abs(pre_label_function - local_label_function_U).sum()  
        changed = comm.reduce(local_changed, root = 0, op = MPI.SUM)  
        status = 'RUN'  
        test = False  
        if comm_rank == 0:  
            if iter % 1 == 0:  
                norm_changed = changed / (num_unlabel_samples * num_classes)  
                print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, norm_changed)  
            if iter >= max_iter or changed < tol:  
                status = 'STOP'  
                print "************** Iteration over! ****************"  
            if iter % test_per_iter == 0:  
                test = True  
            iter += 1  
        test = comm.bcast(test if comm_rank == 0 else None, root = 0)  
        status = comm.bcast(status if comm_rank == 0 else None, root = 0)  
        if status == 'STOP':  
            break  
        if test == True:  
            evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)  
    evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)  
  
  
def evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id):  
    # get local label with max score  
    if comm_rank == 0:  
        print "Start to combine local result..."  
    local_max_score = np.zeros((num_unlabel_samples, 1), np.float32)   
    local_max_label = np.zeros((num_unlabel_samples, 1), np.int32)  
    for i in xrange(num_unlabel_samples):  
        local_max_label[i, 0] = np.argmax(local_label_function_U.getrow(i).todense())  
        local_max_score[i, 0] = local_label_function_U[i, local_max_label[i, 0]]  
        local_max_label[i, 0] += local_start_class  
          
    # gather the results from all the processors  
    if comm_rank == 0:  
        print "Start to gather results from all processors"  
    all_max_label = np.hstack(comm.allgather(local_max_label))  
    all_max_score = np.hstack(comm.allgather(local_max_score))  
      
    # get terminate label of unlabeled data  
    if comm_rank == 0:  
        print "Start to analysis the results..."  
        right_predict_count = 0  
        for i in xrange(num_unlabel_samples):  
            if i % 1000 == 0:  
                print "***", all_max_score[i]  
            max_idx = np.argmax(all_max_score[i])  
            max_label = all_max_label[i, max_idx]  
            if int(unlabel_data_id[i]) == int(labels_id[max_label]):  
                right_predict_count += 1  
        accuracy = float(right_predict_count) * 100.0 / num_unlabel_samples  
        print "Have %d samples, accuracy: %.3f%%!" % (num_unlabel_samples, accuracy)  
  
  
if __name__ == '__main__':  
    run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 30)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

#***************************************************************************

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#**************************************************************************

import os, sys, time

import numpy as np

from scipy.sparse import csr_matrix, lil_matrix, eye

import operator

import cPickle as pickle

import mpi4py.MPI as MPI

# Global variables for MPI

# instance for invoking MPI related functions

comm = MPI.COMM_WORLD

# the node rank in the whole community

comm_rank = comm.Get_rank()

# the size of the whole community, i.e., the total number of working nodes in the MPI cluster

comm_size = comm.Get_size()

# load mnist dataset

def load_MNIST():

import gzip

f = gzip.open("mnist.pkl.gz", "rb")

train, val, test = pickle.load(f)

f.close()

Mat_Label = train[0]

labels = train[1]

Mat_Unlabel = test[0]

groundtruth = test[1]

labels_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

return Mat_Label, labels, labels_id, Mat_Unlabel, groundtruth

# return k neighbors index

def navie_knn(dataSet, query, k):

numSamples = dataSet.shape[0]

## step 1: calculate Euclidean distance

diff = np.tile(query, (numSamples, 1)) - dataSet

squaredDiff = diff ** 2

squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

## step 2: sort the distance

sortedDistIndices = np.argsort(squaredDist)

if k > len(sortedDistIndices):

k = len(sortedDistIndices)

return sortedDistIndices[0:k]

# build a big graph (normalized weight matrix)

# sparse U x (U + L) matrix

def buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors):

num_unlabel_samples = Mat_Unlabel.shape[0]

data = []; indices = []; indptr = [0]

Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

for i in xrange(num_unlabel_samples):

k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

indptr.append(np.int32(indptr[-1]) + knn_num_neighbors)

indices.extend(k_neighbors)

data.append(values)

return csr_matrix((np.hstack(data), indices, indptr))

# build a big graph (normalized weight matrix)

# sparse U x (U + L) matrix

def buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors):

num_unlabel_samples = Mat_Unlabel.shape[0]

local_data = []; local_indices = []; local_indptr = [0]

Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

sample_offset = np.linspace(0, num_unlabel_samples, comm_size + 1).astype('int')

for i in range(sample_offset[comm_rank], sample_offset[comm_rank+1]):

k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

local_indptr.append(np.int32(local_indptr[-1]) + knn_num_neighbors)

local_indices.extend(k_neighbors)

local_data.append(values)

data = np.hstack(comm.allgather(local_data))

indices = np.hstack(comm.allgather(local_indices))

indptr_tmp = comm.allgather(local_indptr)

indptr = []

for i in range(len(indptr_tmp)):

if i == 0:

indptr.extend(indptr_tmp[i])

else:

last_indptr = indptr[-1]

del(indptr[-1])

indptr.extend(indptr_tmp[i] + last_indptr)

return csr_matrix((np.hstack(data), indices, indptr), dtype = np.float32)

# label propagation

def run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 100, tol = 1e-4, test_per_iter = 1):

# load data and graph

print "Processor %d/%d loading graph file..." % (comm_rank, comm_size)

#Mat_Label, labels, Mat_Unlabel, groundtruth = loadFourBandData()

Mat_Label, labels, labels_id, Mat_Unlabel, unlabel_data_id = load_MNIST()

if comm_size > len(labels_id):

raise ValueError("Sorry, the processors must be less than the number of classes")

#affinity_matrix = buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors)

affinity_matrix = buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors)

# get some parameters

num_classes = len(labels_id)

num_label_samples = len(labels)

num_unlabel_samples = Mat_Unlabel.shape[0]

affinity_matrix_UL = affinity_matrix[:, 0:num_label_samples]

affinity_matrix_UU = affinity_matrix[:, num_label_samples:num_label_samples+num_unlabel_samples]

if comm_rank == 0:

print "Have %d labeled images, %d unlabeled images and %d classes" % (num_label_samples, num_unlabel_samples, num_classes)

# divide label_function_U and label_function_L to all processors

class_offset = np.linspace(0, num_classes, comm_size + 1).astype('int')

# initialize local label_function_U

local_start_class = class_offset[comm_rank]

local_num_classes = class_offset[comm_rank+1] - local_start_class

local_label_function_U = eye(num_unlabel_samples, local_num_classes, 0, np.float32, format='csr')

# initialize local label_function_L

local_label_function_L = lil_matrix((num_label_samples, local_num_classes), dtype = np.float32)

for i in xrange(num_label_samples):

class_off = int(labels[i]) - local_start_class

if class_off >= 0 and class_off < local_num_classes:

local_label_function_L[i, class_off] = 1.0

local_label_function_L = local_label_function_L.tocsr()

local_label_info = affinity_matrix_UL.dot(local_label_function_L)

print "Processor %d/%d has to process %d classes..." % (comm_rank, comm_size, local_label_function_L.shape[1])

# start to propagation

iter = 1; changed = 100.0;

evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

while True:

pre_label_function = local_label_function_U.copy()

# propagation

local_label_function_U = affinity_matrix_UU.dot(local_label_function_U) + local_label_info

# check converge

local_changed = abs(pre_label_function - local_label_function_U).sum()

changed = comm.reduce(local_changed, root = 0, op = MPI.SUM)

status = 'RUN'

test = False

if comm_rank == 0:

if iter % 1 == 0:

norm_changed = changed / (num_unlabel_samples * num_classes)

print "---> Iteration %d/%d, changed: %f" % (iter, max_iter, norm_changed)

if iter >= max_iter or changed < tol:

status = 'STOP'

print "************** Iteration over! ****************"

if iter % test_per_iter == 0:

test = True

iter += 1

test = comm.bcast(test if comm_rank == 0 else None, root = 0)

status = comm.bcast(status if comm_rank == 0 else None, root = 0)

if status == 'STOP':

break

if test == True:

evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

def evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id):

# get local label with max score

if comm_rank == 0:

print "Start to combine local result..."

local_max_score = np.zeros((num_unlabel_samples, 1), np.float32)

local_max_label = np.zeros((num_unlabel_samples, 1), np.int32)

for i in xrange(num_unlabel_samples):

local_max_label[i, 0] = np.argmax(local_label_function_U.getrow(i).todense())

local_max_score[i, 0] = local_label_function_U[i, local_max_label[i, 0]]

local_max_label[i, 0] += local_start_class

# gather the results from all the processors

if comm_rank == 0:

print "Start to gather results from all processors"

all_max_label = np.hstack(comm.allgather(local_max_label))

all_max_score = np.hstack(comm.allgather(local_max_score))

# get terminate label of unlabeled data

if comm_rank == 0:

print "Start to analysis the results..."

right_predict_count = 0

for i in xrange(num_unlabel_samples):

if i % 1000 == 0:

print "***", all_max_score[i]

max_idx = np.argmax(all_max_score[i])

max_label = all_max_label[i, max_idx]

if int(unlabel_data_id[i]) == int(labels_id[max_label]):

right_predict_count += 1

accuracy = float(right_predict_count) * 100.0 / num_unlabel_samples

print "Have %d samples, accuracy: %.3f%%!" % (num_unlabel_samples, accuracy)

if __name__ == '__main__':

run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 30)

五、參考資料

[1]Semi-SupervisedLearning with Graphs.pdf

社群發現之標籤傳播演算法（LPA）python實現
2024-04-26
演算法Python
HTML <label> 標籤
2018-09-08
HTML
Tkinter (07) 標籤部件 Label
2020-07-22
Tkinter (28) 標籤部件 ttk.Label
2020-08-24
詳細闡述基於時間的反向傳播演算法（Back-Propagation Through Time,BPTT）
2018-09-20
反向傳播演算法
純Python和PyTorch對比實現迴圈神經網路RNNCell及反向傳播
2018-12-15
PythonPyTorch神經網路RNN反向傳播
如何構建標籤畫像工程體系及實現方案
2022-12-26
目標匹配：匈牙利演算法的python實現
2020-12-29
演算法Python
標籤編碼、獨熱編碼大不同 - Python 實現
2019-01-25
Python
神經網路入門篇之深層神經網路：詳解前向傳播和反向傳播（Forward and backward propagation）
2023-11-29
神經網路反向傳播Forward
javascript實現文字框標籤驗證
2018-10-13
JavaScript
TikTok推薦演算法詳解：2023年如何實現病毒式傳播
2023-03-13
演算法
[原始碼解析] PyTorch 如何實現後向傳播 (4)---- 具體演算法
2021-11-01
原始碼PyTorch演算法
基於Label studio實現UIE資訊抽取智慧標註方案，提升標註效率！
2023-04-04
UI
Python-遺傳演算法君主交叉程式碼實現
2021-08-12
Python演算法
numpy實現神經網路-反向傳播
2020-10-03
神經網路反向傳播
標籤實現預載入功能詳解
2019-03-04
Pyqt5 實現多標籤頁面
2024-09-11
QT
多目標遺傳演算法NSGA-Ⅱ與其Python實現多目標投資組合優化問題
2018-11-08
演算法Python優化
meta標籤的作用及整理
2020-03-06
CORDIC演算法解釋及verilog HDL實現（圓座標系）
2024-08-18
演算法
python基礎之 python實現PID演算法及測試的例子
2020-03-13
Python演算法
<link>標籤實現預載入功能詳解
2018-06-15
MyBatis標籤實現的動態SQL語句
2018-09-11
MyBatisSQL
vuejs實現新增tag標籤程式碼例項
2020-11-12
VueJS
tag標籤是怎麼用OT實現的？
2019-05-11
用張量廣播機制實現神經網路反向傳播
2021-02-22
神經網路反向傳播
反向傳播演算法的暴力理解
2021-11-17
反向傳播演算法
詳解 BackPropagation 反向傳播演算法！
2021-09-22
反向傳播演算法
spring事務傳播的Propagation.REQUIRES_NEW以及NEVER MANDATORY驗證，及其失效的詭異問題
2023-05-08
SpringUI
Label和input實現純CSS切換
2019-02-02
CSS
智慧優化演算法——python實現免疫遺傳演算法的影像擬合
2022-04-12
優化演算法Python
標籤的最佳實踐
2019-02-16
Tensorflow實現神經網路的前向傳播
2020-06-22
神經網路
FM演算法python實現
2019-03-26
演算法Python
python實現冒泡演算法
2019-02-16
Python演算法
python實現FM演算法
2020-12-25
Python演算法
Angular 實現輸入框中顯示文章標籤
2019-03-03
Angular
06#Web 實戰：實現可滑動的標籤頁
2022-11-25
Web

標籤傳播演算法（Label Propagation）及 Python 實現

一、半監督學習

二、標籤傳播演算法

2.1、相似矩陣構建

2.2、LP演算法

2.3、變身的LP演算法

三、LP演算法的Python實現

四、LP演算法MPI並行實現

五、參考資料

相關文章