資料科學系統學習 機器學習演算法 # 西瓜書學習記錄 [12] 整合學習實踐
boosting 方法擁有多個版本,這裡將只關注其中一個最流行的版本 AdaBoost。
在構造 AdaBoost 的程式碼時,我們將首先通過一個簡單資料集來確保在演算法實現上一切就緒。使用如下的資料集:
def loadSimpData():
datMat = matrix([[ 1. , 2.1],
[ 2. , 1.1],
[ 1.3, 1. ],
[ 1. , 1. ],
[ 2. , 1. ]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat,classLabels
在 python 提示符下,執行程式碼載入資料集:
>>> import adaboost
>>> datMat, classLabels=adaboost.loadSimpData()
我們先給出函式buildStump()的虛擬碼:
程式清單 7-1 單層決策樹生成函式
'''
Created on Sep 20, 2018
@author: yufei
Adaboost is short for Adaptive Boosting
"""
測試是否有某個值小於或大於我們正在測試的閾值
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
retArray = ones((shape(dataMatrix)[0],1))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray
在一個加權資料集中迴圈
buildStump()將會遍歷stumpClassify()函式所有的可能輸入值
並找到具有最低錯誤率的單層決策樹
def buildStump(dataArr,classLabels,D):
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
# 變數 numSteps 用於在特徵的所有可能值上進行遍歷
numSteps = 10.0
# 建立一個空字典,用於儲存給定權重向量 D 時所得到的最佳單層決策樹的相關資訊
bestStump = {}; bestClasEst = mat(zeros((m,1)))
# 初始化為正無窮大,之後用於尋找可能的最小錯誤率
minError = inf
# 第一層迴圈在資料集的所有特徵上遍歷
for i in range(n):#loop over all dimensions
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
# 計算步長
stepSize = (rangeMax-rangeMin)/numSteps
# 第二層迴圈是瞭解步長後再在這些值上遍歷
for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
# 第三個迴圈是在大於和小於之間切換不等式
for inequal in ['lt', 'gt']: #go over less than and greater than
threshVal = (rangeMin + float(j) * stepSize)
# 呼叫 stumpClassify() 函式,返回分類預測結果
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
errArr = mat(ones((m,1)))
errArr[predictedVals == labelMat] = 0
weightedError = D.T*errArr #calc total error multiplied by D
# print("split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError))
# 將當前錯誤率與已有的最小錯誤率進行比較
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
為了解實際執行過程,在 python 提示符下,執行程式碼並得到結果:
>>> D=mat(ones((5,1))/5)
>>> adaboost.buildStump(datMat, classLabels, D)
split: dim 0, thresh 0.90, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 0.90, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.00, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.00, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.10, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.10, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.20, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.20, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.30, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.30, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.40, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.40, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.50, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.50, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.60, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.60, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.70, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.70, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.80, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.80, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 1.90, thresh ineqal: lt, the weighted error is 0.200
split: dim 0, thresh 1.90, thresh ineqal: gt, the weighted error is 0.800
split: dim 0, thresh 2.00, thresh ineqal: lt, the weighted error is 0.600
split: dim 0, thresh 2.00, thresh ineqal: gt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.00, thresh ineqal: lt, the weighted error is 0.200
split: dim 1, thresh 1.00, thresh ineqal: gt, the weighted error is 0.800
split: dim 1, thresh 1.11, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.11, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.22, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.22, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.33, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.33, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.44, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.44, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.55, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.55, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.66, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.66, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.77, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.77, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.88, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.88, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 1.99, thresh ineqal: lt, the weighted error is 0.400
split: dim 1, thresh 1.99, thresh ineqal: gt, the weighted error is 0.600
split: dim 1, thresh 2.10, thresh ineqal: lt, the weighted error is 0.600
split: dim 1, thresh 2.10, thresh ineqal: gt, the weighted error is 0.400
({'dim': 0, 'thresh': 1.3, 'ineq': 'lt'}, matrix([[0.2]]), array([[-1.],
[ 1.],
[-1.],
[ 1.]]))
這一行可以註釋掉,這裡為了理解函式的執行而列印出來。
將當前錯誤率與已有的最小錯誤率進行對比後,如果當前的值較小,那麼就在字典baseStump中儲存該單層決策樹。字典、錯誤率和類別估計值都會返回給 AdaBoost 演算法。
上述,我們已經構建了單層決策樹,得到了弱學習器。接下來,我們將使用多個弱分類器來構建 AdaBoost 程式碼。
首先給出整個實現的虛擬碼如下:
程式清單 7-2 基於單層決策樹的 AdaBoost 訓練過程
輸入引數:資料集、類別標籤、迭代次數(需要使用者指定)
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
weakClassArr = []
m = shape(dataArr)[0]
# 向量 D 包含了每個資料點的權重,初始化為 1/m
D = mat(ones((m,1))/m) #init D to all equal
# 記錄每個資料點的類別估計累計值
aggClassEst = mat(zeros((m,1)))
for i in range(numIt):
# 呼叫 buildStump() 函式建立一個單層決策樹
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
print ("D:",D.T)
# 計算 alpha,本次單層決策樹輸出結果的權重
# 確保沒有錯誤時不會發生除零溢位
alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
bestStump['alpha'] = alpha
weakClassArr.append(bestStump) #store Stump Params in Array
print("classEst: ",classEst.T)
# 為下一次迭代計算 D
expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon)) #Calc New D for next iteration
D = D/D.sum()
#calc training error of all classifiers, if this is 0 quit for loop early (use break)
# 錯誤率累加計算
aggClassEst += alpha*classEst
print("aggClassEst: ",aggClassEst.T)
# 為了得到二值分類結果呼叫 sign() 函式
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m
print ("total error: ",errorRate)
# 若總錯誤率為 0,則中止 for 迴圈
if errorRate == 0.0: break
return weakClassArr,aggClassEst
在 python 提示符下,執行程式碼並得到結果:
>>> classifierArray = adaboost.adaBoostTrainDS(datMat, classLabels, 9)
D: [[0.2 0.2 0.2 0.2 0.2]]
classEst: [[-1. 1. -1. -1. 1.]]
aggClassEst: [[-0.69314718 0.69314718 -0.69314718 -0.69314718 0.69314718]]
total error: 0.2
D: [[0.5 0.125 0.125 0.125 0.125]]
classEst: [[ 1. 1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789 1.66610226 -1.66610226 -1.66610226 -0.27980789]]
D: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5 ]]
classEst: [[1. 1. 1. 1. 1.]]
aggClassEst: [[ 1.17568763 2.56198199 -0.77022252 -0.77022252 0.61607184]]
total error: 0.0
最後,我們來觀察測試錯誤率。
程式清單 7-3 AdaBoost 分類函式
將弱分類器的訓練過程從程式中抽查來,應用到某個具體的例項上去。
datToClass: 一個或多個待分類樣例
classifierArr: 多個弱分類器組成的陣列
返回 aggClassEst 符號,大於 0 返回1;小於 0 返回 -1
def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)[0]
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix, classifierArr[0][i]['dim'], classifierArr[0][i]['thresh'],
classifierArr[0][i]['ineq'])
aggClassEst += classifierArr[0][i]['alpha']*classEst
print (aggClassEst)
return sign(aggClassEst)
>>> datArr, labelArr = adaboost.loadSimpData()
>>> classifierArr = adaboost.adaBoostTrainDS(datArr, labelArr, 30)
輸入以下命令進行分類:
>>> adaboost.adaClassify([0,0], classifierArr)
[[-0.69314718]]
[[-1.66610226]]
matrix([[-1.]])
隨著迭代的進行,資料點 [0,0] 的分類結果越來越強。也可以在其它點上分類:
>>> adaboost.adaClassify([[5,5],[0,0]], classifierArr)
[[ 0.69314718]
[-0.69314718]]
[[ 1.66610226]
[-1.66610226]]
matrix([[ 1.],
[-1.]])
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/386259/viewspace-2868405/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 【資料科學系統學習】機器學習演算法 # 西瓜書學習記錄 [6] 樸素貝葉斯實踐資料科學機器學習演算法
- 《機器學習》西瓜書學習筆記(五)機器學習筆記
- 機器學習整合學習—Apple的學習筆記機器學習APP筆記
- 機器學習-整合學習機器學習
- 科學組合,系統學習
- 機器學習和資料科學領域,推薦幾本學習書單機器學習資料科學
- 重磅福利!!機器學習和深度學習學習資料合集機器學習深度學習
- Flutter學習記錄(一)Dart學習FlutterDart
- 【Python機器學習實戰】決策樹與整合學習(三)——整合學習(1)Python機器學習
- 機器學習演算法學習筆記機器學習演算法筆記
- 資料科學、資料工程學習路線資料科學
- 機器學習/深度學習書單推薦及學習方法機器學習深度學習
- 學習記錄
- 深度學習——學習目錄——學習中……深度學習
- 深度學習(一)深度學習學習資料深度學習
- 機器學習學習筆記機器學習筆記
- 整合學習(一):簡述整合學習
- QFS檔案系統-學習記錄
- 向敏捷實踐學習,學習敏捷出版敏捷
- 機器學習基礎——整合學習1機器學習
- 機器學習-整合學習LightGBM機器學習
- 【Python機器學習實戰】決策樹與整合學習(四)——整合學習(2)GBDTPython機器學習
- 科二學習筆記筆記
- 科三學習筆記筆記
- Go學習【二】學習資料Go
- 《Python入門與資料科學庫》學習筆記Python資料科學筆記
- 資料型別 - Go 學習記錄資料型別Go
- 記錄我的資料庫學習資料庫
- 向量資料庫Chroma學習記錄資料庫
- 整合學習
- 記錄學習PromisePromise
- windbg學習記錄
- Eureka學習記錄
- Mybatis學習記錄MyBatis
- socket學習記錄
- JQuery學習記錄jQuery
- larabbs 學習記錄
- Tableau學習記錄