資料科學系統學習 機器學習演算法 # 西瓜書學習記錄 [12] 整合學習實踐

xtomh 發表於 2022-03-09
Machine Learning 演算法

boosting 方法擁有多個版本,這裡將只關注其中一個最流行的版本 AdaBoost。


在構造 AdaBoost 的程式碼時,我們將首先通過一個簡單資料集來確保在演算法實現上一切就緒。使用如下的資料集:

def loadSimpData():

datMat = matrix([[ 1. ,  2.1],

[ 2. ,  1.1],

[ 1.3,  1. ],

[ 1. ,  1. ],

[ 2. ,  1. ]])

classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]

return datMat,classLabels

在 python 提示符下,執行程式碼載入資料集:

>>> import adaboost

>>> datMat, classLabels=adaboost.loadSimpData()

我們先給出函式buildStump()的虛擬碼:

程式清單 7-1 單層決策樹生成函式

'''

Created on Sep 20, 2018

@author: yufei

Adaboost is short for Adaptive Boosting

"""

測試是否有某個值小於或大於我們正在測試的閾值

def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data

retArray = ones((shape(dataMatrix)[0],1))

if threshIneq == 'lt':

retArray[dataMatrix[:,dimen] <= threshVal] = -1.0

else:

retArray[dataMatrix[:,dimen] > threshVal] = -1.0

return retArray

在一個加權資料集中迴圈

buildStump()將會遍歷stumpClassify()函式所有的可能輸入值

並找到具有最低錯誤率的單層決策樹

def buildStump(dataArr,classLabels,D):

dataMatrix = mat(dataArr); labelMat = mat(classLabels).T

m,n = shape(dataMatrix)

# 變數 numSteps 用於在特徵的所有可能值上進行遍歷

numSteps = 10.0

# 建立一個空字典,用於儲存給定權重向量 D 時所得到的最佳單層決策樹的相關資訊

bestStump = {}; bestClasEst = mat(zeros((m,1)))

# 初始化為正無窮大,之後用於尋找可能的最小錯誤率

minError = inf

# 第一層迴圈在資料集的所有特徵上遍歷

for i in range(n):#loop over all dimensions

rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();

# 計算步長

stepSize = (rangeMax-rangeMin)/numSteps

# 第二層迴圈是瞭解步長後再在這些值上遍歷

for j in range(-1,int(numSteps)+1):#loop over all range in current dimension

# 第三個迴圈是在大於和小於之間切換不等式

for inequal in ['lt', 'gt']: #go over less than and greater than

threshVal = (rangeMin + float(j) * stepSize)

# 呼叫 stumpClassify() 函式,返回分類預測結果

predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan

errArr = mat(ones((m,1)))

errArr[predictedVals == labelMat] = 0

weightedError = D.T*errArr  #calc total error multiplied by D

# print("split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError))

# 將當前錯誤率與已有的最小錯誤率進行比較

if weightedError < minError:

minError = weightedError

bestClasEst = predictedVals.copy()

bestStump['dim'] = i

bestStump['thresh'] = threshVal

bestStump['ineq'] = inequal

return bestStump,minError,bestClasEst

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

為了解實際執行過程,在 python 提示符下,執行程式碼並得到結果:

>>> D=mat(ones((5,1))/5)

>>> adaboost.buildStump(datMat, classLabels, D)

split: dim 0, thresh 0.90, thresh ineqal: lt, the weighted error is 0.400

split: dim 0, thresh 0.90, thresh ineqal: gt, the weighted error is 0.600

split: dim 0, thresh 1.00, thresh ineqal: lt, the weighted error is 0.400

split: dim 0, thresh 1.00, thresh ineqal: gt, the weighted error is 0.600

split: dim 0, thresh 1.10, thresh ineqal: lt, the weighted error is 0.400

split: dim 0, thresh 1.10, thresh ineqal: gt, the weighted error is 0.600

split: dim 0, thresh 1.20, thresh ineqal: lt, the weighted error is 0.400

split: dim 0, thresh 1.20, thresh ineqal: gt, the weighted error is 0.600

split: dim 0, thresh 1.30, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.30, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.40, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.40, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.50, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.50, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.60, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.60, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.70, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.70, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.80, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.80, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 1.90, thresh ineqal: lt, the weighted error is 0.200

split: dim 0, thresh 1.90, thresh ineqal: gt, the weighted error is 0.800

split: dim 0, thresh 2.00, thresh ineqal: lt, the weighted error is 0.600

split: dim 0, thresh 2.00, thresh ineqal: gt, the weighted error is 0.400

split: dim 1, thresh 0.89, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 0.89, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.00, thresh ineqal: lt, the weighted error is 0.200

split: dim 1, thresh 1.00, thresh ineqal: gt, the weighted error is 0.800

split: dim 1, thresh 1.11, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.11, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.22, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.22, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.33, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.33, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.44, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.44, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.55, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.55, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.66, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.66, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.77, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.77, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.88, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.88, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 1.99, thresh ineqal: lt, the weighted error is 0.400

split: dim 1, thresh 1.99, thresh ineqal: gt, the weighted error is 0.600

split: dim 1, thresh 2.10, thresh ineqal: lt, the weighted error is 0.600

split: dim 1, thresh 2.10, thresh ineqal: gt, the weighted error is 0.400

({'dim': 0, 'thresh': 1.3, 'ineq': 'lt'}, matrix([[0.2]]), array([[-1.],

[ 1.],

[-1.],

[ 1.]]))

這一行可以註釋掉,這裡為了理解函式的執行而列印出來。

將當前錯誤率與已有的最小錯誤率進行對比後,如果當前的值較小,那麼就在字典baseStump中儲存該單層決策樹。字典、錯誤率和類別估計值都會返回給 AdaBoost 演算法。

上述,我們已經構建了單層決策樹,得到了弱學習器。接下來,我們將使用多個弱分類器來構建 AdaBoost 程式碼。

首先給出整個實現的虛擬碼如下:

程式清單 7-2 基於單層決策樹的 AdaBoost 訓練過程

輸入引數:資料集、類別標籤、迭代次數(需要使用者指定)

def adaBoostTrainDS(dataArr,classLabels,numIt=40):

weakClassArr = []

m = shape(dataArr)[0]

# 向量 D 包含了每個資料點的權重,初始化為 1/m

D = mat(ones((m,1))/m)   #init D to all equal

# 記錄每個資料點的類別估計累計值

aggClassEst = mat(zeros((m,1)))

for i in range(numIt):

# 呼叫 buildStump() 函式建立一個單層決策樹

bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump

print ("D:",D.T)

# 計算 alpha,本次單層決策樹輸出結果的權重

# 確保沒有錯誤時不會發生除零溢位

alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0

bestStump['alpha'] = alpha

weakClassArr.append(bestStump)                  #store Stump Params in Array

print("classEst: ",classEst.T)

# 為下一次迭代計算 D

expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy

D = multiply(D,exp(expon))                              #Calc New D for next iteration

D = D/D.sum()

#calc training error of all classifiers, if this is 0 quit for loop early (use break)

# 錯誤率累加計算

aggClassEst += alpha*classEst

print("aggClassEst: ",aggClassEst.T)

# 為了得到二值分類結果呼叫 sign() 函式

aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))

errorRate = aggErrors.sum()/m

print ("total error: ",errorRate)

# 若總錯誤率為 0,則中止 for 迴圈

if errorRate == 0.0: break

return weakClassArr,aggClassEst

在 python 提示符下,執行程式碼並得到結果:

>>> classifierArray = adaboost.adaBoostTrainDS(datMat, classLabels, 9)

D: [[0.2 0.2 0.2 0.2 0.2]]

classEst:  [[-1.  1. -1. -1.  1.]]

aggClassEst:  [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]

total error:  0.2

D: [[0.5   0.125 0.125 0.125 0.125]]

classEst:  [[ 1.  1. -1. -1. -1.]]

aggClassEst:  [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]

D: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5       ]]

classEst:  [[1. 1. 1. 1. 1.]]

aggClassEst:  [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]

total error:  0.0

最後,我們來觀察測試錯誤率。

程式清單 7-3 AdaBoost 分類函式

將弱分類器的訓練過程從程式中抽查來,應用到某個具體的例項上去。

datToClass: 一個或多個待分類樣例

classifierArr: 多個弱分類器組成的陣列

返回 aggClassEst 符號,大於 0 返回1;小於 0 返回 -1

def adaClassify(datToClass,classifierArr):

dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS

m = shape(dataMatrix)[0]

for i in range(len(classifierArr)):

classEst = stumpClassify(dataMatrix, classifierArr[0][i]['dim'], classifierArr[0][i]['thresh'],

classifierArr[0][i]['ineq'])

aggClassEst += classifierArr[0][i]['alpha']*classEst

print (aggClassEst)

return sign(aggClassEst)

>>> datArr, labelArr = adaboost.loadSimpData()

>>> classifierArr = adaboost.adaBoostTrainDS(datArr, labelArr, 30)

輸入以下命令進行分類:

>>> adaboost.adaClassify([0,0], classifierArr)

[[-0.69314718]]

[[-1.66610226]]

matrix([[-1.]])

隨著迭代的進行,資料點 [0,0] 的分類結果越來越強。也可以在其它點上分類:

>>> adaboost.adaClassify([[5,5],[0,0]], classifierArr)

[[ 0.69314718]

[-0.69314718]]

[[ 1.66610226]

[-1.66610226]]

matrix([[ 1.],

[-1.]])


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/386259/viewspace-2868405/,如需轉載,請註明出處,否則將追究法律責任。