統計學習方法——實現AdaBoost

_蓑衣客發表於2021-04-23

Adaboost

適用問題:二分類問題

  • 模型:加法模型

\[f(x)=\sum_{m=1}^{M} \alpha_{m} G_{m}(x) \]

  • 策略:損失函式為指數函式

\[L(y,f(x))=exp[-yf(x)] \]

  • 演算法:前向分步演算法

\[\left(\beta_{m}, \gamma_{m}\right)=\arg \min _{\beta, \gamma} \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+\beta b\left(x_{i} ; \gamma\right)\right) \]

特點:AdaBoost演算法的特點是通過迭代每次學習一個基本分類器。每次迭代中,提高那些被前一輪分類器錯誤分類資料的權值,而降低那些被正確分類的資料的權值。最後,AdaBoost將基本分類器的線性組合作為強分類器,其中給分類誤差率小的基本分類器以大的權值,給分類誤差率大的基本分類器以小的權值。

演算法步驟

1)給每個訓練樣本(\(x_{1},x_{2},….,x_{N}\))分配權重,初始權重\(w_{1}\)均為1/N。

2)針對帶有權值的樣本進行訓練,得到模型\(G_m\)(初始模型為G1)。

3)計算模型\(G_m\)的誤分率\(e_m=\sum_{i=1}^Nw_iI(y_i\not= G_m(x_i))\) (誤分率應小於0.5,否則將預測結果翻轉即可得到誤分率小於0.5的分類器)

4)計算模型\(G_m\)的係數\(\alpha_m=0.5\log[(1-e_m)/e_m]\)

5)根據誤分率e和當前權重向量\(w_m\)更新權重向量\(w_{m+1}\)

6)計算組合模型\(f(x)=\sum_{m=1}^M\alpha_mG_m(x_i)\)的誤分率。

7)當組合模型的誤分率或迭代次數低於一定閾值,停止迭代;否則,回到步驟2)

提升樹

提升樹是以分類樹或迴歸樹為基本分類器的提升方法。提升樹被認為是統計學習中最有效的方法之一。

提升方法:將弱可學習演算法提升為強可學習演算法。提升方法通過反覆修改訓練資料的權值分佈,構建一系列基本分類器(弱分類器),並將這些基本分類器線性組合,構成一個強分類器。AdaBoost演算法是提升方法的一個代表。

AdaBoost原始碼實現

假設弱分類器由 \(x < v\)\(x > v\) 產生,閾值\(v\)使該分類器在訓練集上分類誤差率最低。

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection  import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline


def create_data():
    iris = load_iris()  # 鳶尾花資料集
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    data = np.array(df.iloc[:100, [0, 1, -1]])  # 取前一百個資料,只保留前兩個特徵
    for d in data:
        if d[-1] == 0:
            d[-1] = -1
    return data[:, :2], data[:, -1].astype(np.int)
class AdaBoost:
    def __init__(self, num_classifier, increment=0.5):
        """
        
        num_classifier: 弱分類器的數量
        increment: 在特徵上尋找最優切分點時,搜尋時每次的增加值(資料稀疏時建議根據樣本點來選擇)
        """
        self.num_classifier = num_classifier
        self.increment = increment
        
    def fit(self, X, Y):
        self._init_args(X, Y)
        
        # 逐個訓練分類器
        for m in range(self.num_classifier):
            min_error, v_optimal, preds = float('INF'), None, None
            direct_split = None
            feature_idx = None  # 選定的特徵的列索引
            # 遍歷選擇特徵和切分點使得分類誤差最小
            for j in range(self.num_feature):
                feature_values = self.X[:, j]  # 第j個特徵對應的所有取值
                _ret = self._get_optimal_split(feature_values)
                v_split, _direct_split, error, pred_labels = _ret
                
                if error < min_error:
                    min_error = error
                    v_optimal = v_split
                    preds = pred_labels
                    direct_split = _direct_split
                    feature_idx = j
            
            # 計算分型別權重alpha
            alpha = self._cal_alpha(min_error)
            self.alphas.append(alpha)
            
            # 記錄當前分類器G(x)
            self.classifiers.append((feature_idx, v_optimal, direct_split))
            
            # 更新樣本集合權值分佈
            self._update_weights(alpha, preds)
    
    def predict(self, x):
        res = 0.0
        for i in range(len(self.classifiers)):
            idx, v, direct = self.classifiers[i]
            # 輸入弱分類器進行分類 
            if direct == '>':
                output = 1 if x[idx] > v else -1
            else:  # direct == '<'
                output = -1 if x[idx] > v else 1
                
            res += self.alphas[i] * output
        return 1 if res > 0 else -1  # sign(res)
    
    def score(self, X_test, Y_test):
        cnt = 0
        for i, x in enumerate(X_test):
            if self.predict(x) == Y_test[i]:
                cnt += 1
        return cnt / len(X_test)
    
    def _init_args(self, X, Y):
        self.X = X
        self.Y = Y
        self.N, self.num_feature = X.shape  # N:樣本數,num_feature:特徵數量
        
        # 初始時每個樣本的權重均相同
        self.weights = [1/self.N] * self.N
        
        # 弱分類器集合
        self.classifiers = []
        
        # 每個分類器G(x)的權重
        self.alphas = []
            
    def _update_weights(self, alpha, pred_labels):
        # 計算規範化因子Z
        Z = self._cal_norm_factor(alpha, pred_labels)
        for i in range(self.N):
            self.weights[i] = (self.weights[i] *
                               np.exp(-1*alpha*self.Y[i]*pred_labels[i]) / Z)
                    
    def _cal_alpha(self, error):
        return 0.5 * np.log((1-error)/error)
                
    def _cal_norm_factor(self, alpha, pred_labels):
        return sum([self.weights[i] * np.exp(-1*alpha*self.Y[i]*pred_labels[i])
                    for i in range(self.N)])
                
    def _get_optimal_split(self, feature_values):
        error = float('INF')  # 分類誤差
        pred_labels = []  # 分類結果
        v_split_optimal = None  # 當前特徵的最優切割點
        direct_split = None  # 最優切割點的判別方向
        max_v = max(feature_values)
        min_v = min(feature_values)
        num_step = (max_v - min_v + self.increment)/self.increment
        for i in range(int(num_step)):
            # 選取分割點
            v_split = min_v + i * self.increment
            judge_direct = '>'
            preds = [1 if feature_values[k] > v_split else -1 
                     for k in range(len(feature_values))]
            
            # 錯誤樣本加權誤差
            weight_error = sum([self.weights[k] for k in range(self.N)
                                if preds[k] != self.Y[k]])

            # 計算分類標籤翻轉後的誤差
            preds_inv = [-p for p in preds]
            weight_error_inv = sum([self.weights[k] for k in range(self.N)
                               if preds_inv[k] != self.Y[k]])

            # 取較小誤差的判別方向作為分類器的判別方向
            if weight_error_inv < weight_error:
                preds = preds_inv
                weight_error = weight_error_inv
                judge_direct = '<'

            if weight_error < error:
                error = weight_error
                pred_labels = preds
                v_split_optimal = v_split
                direct_split = judge_direct

        return v_split_optimal, direct_split, error, pred_labels

測試模型準確率:

X, Y = create_data()

res = []
for i in range(10):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

    clf = AdaBoost(num_classifier=50)
    clf.fit(X_train, Y_train)
    res.append(clf.score(X_test, Y_test))
print('My AdaBoost: {}次的平均準確率:  {:.3f}'.format(len(res), sum(res)/len(res)))
My AdaBoost: 10次的平均準確率:  0.970

sklearn庫的AdaBoost例項

from sklearn.ensemble import AdaBoostClassifier

res = []
for i in range(10):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

    clf_sklearn = AdaBoostClassifier(n_estimators=50, learning_rate=0.5)
    clf_sklearn.fit(X_train, Y_train)
    res.append(clf_sklearn.score(X_test, Y_test))
print('sklearn AdaBoostClassifier: {}次的平均準確率:  {:.3f}'.format(
    len(res), sum(res)/len(res)))
sklearn AdaBoostClassifier: 10次的平均準確率:  0.945

相關文章