決策樹中資訊增益、ID3以及C4.5的實現與總結

ihades發表於2020-02-21

原文網址 : https://juejin.im/post/5e4f66a3518825493f6ce023

決策樹其核心是尋找在給定的特徵條件下類（Labels or Category）的條件概率分佈。構建一顆決策樹通常都要經歷以下三個步驟：

特徵提取
決策樹的生成
決策樹的剪枝

本文旨在以分類決策樹為模型，總結決策樹相關思想、實現特徵提取、ID3以及C4.5這兩個決策樹生成演算法。

本文資料集採用李航——《統計學習方法第2版》中的貸款申請樣本資料集。

其中『年齡』屬性的三個取值{0, 1, 2}對應{青年、中年、老年}；『有工作』、『有自己房子』屬性的兩個取值{0, 1}對應{否、是}；『信貸情況』的三個取值{0, 1, 2}對應{一般、好、非常好}；『類別』屬性表示是否同意貸款取值{0, 1}對應{否、是}。

資料集下載地址

特徵選擇

選擇能夠最大程度劃分資料集?的特徵。

資訊增益

要理解資訊增益的首要前提是理解『熵』這一抽象概念。

『熵』是資訊的一個度量單位，在資訊理論與概率統計中，『熵』表示隨機變數不確定性的程度，變數的不確定性越大，『熵』就越大，要了解（評估）這一變數需要的資訊量也就越多。

這一基本概念對資訊增益這個思想起著非常重要的指導作用。

要解釋這一句話我們需要引入『熵』與『條件熵』的公式定義。（當『熵』和『條件熵』的概率由資料估計（似然估計等）得到時，所對應的分別是『經驗熵』和『經驗條件熵』）

設一離散型隨機變數?，及其概率分佈為：

則『熵』或『經驗熵』定義為：

$H(p) = -\sum^{n}_{i = 1}{p_i}{logp_i}$

另設隨機變數(?, ?)，其聯合概率分佈為：

$p(X = x_i, Y = y_i) = p_{ij}, \qquad i = 1, 2, ... ,n; \quad j = 1, 2, ..., m$

則『條件熵』?(?|?) 表示在已經隨機變數X的條件下Y的不確定性，有：

$H(Y|X)=\sum^{n}_{i=1}p_iH(Y|X=x_i)\qquad p_i=P(X=x_i),i=1,2,...,n$

也就是說當我們已經知道一些資訊?，這些資訊對我們瞭解?的幫助程度達到多少。

資訊增益：特徵?對訓練資料集D的資訊增益?(?, ?)，定義為集合?的經驗熵?(?)與特徵?給定條件下?的經驗條件熵?(?|?)之差，即：

在特徵選擇的過程中，需要選擇資訊增益值最大的的特徵?。

當一個集合?給定時，?(?)就是定值，此時如果特徵?使得?(?, ?)取得最大值，就意味著經驗條件熵?(?|?)是所有特徵裡最小的，即此時特徵?能夠最大程度的減少?的不確定性，也就意味著特徵?能夠很好的將?進行區分。這就是資訊增益演算法的核心。

Python實現

import scipy as sci
def information_gain(X: pd.DataFrame, feature, y) -> float:
    """
    param X: the dataset.
    param feature: the feature which need to compute.
    param y: categories or labels
    1. compute the data set D empirical entropy H(D)
    2. compute the empirical conditional entropy for D under the feature A. H(D|A)
    3. compute the infomation gain. g(D, A) = H(D) - H(D|A)
    """
    C = set(y)
    m = int(y.count())
    HD = 0
    for c in C:
        ck = int(y[y==c].count())
        HD += (ck/m * sci.log2(ck/m))
    HD = -HD
    
    A = X[feature]
    DA =  set(A)
    HDA = 0
    for da in DA:
        di = int(A[A==da].count())
        temp_hd = 0
        for c in C:
            dik = float(A[(A==da) & (y == c)].count())
            if dik == 0:
                continue
            temp_hd += (dik/di * sci.log2(dik/di))
        temp_hd = -temp_hd
        HDA += (di/m * temp_hd)
    return HD-HDA
複製程式碼

測試『年齡』特徵的資訊增益值：

X = dataset[["Age", "Working", "Housing", "Credit"]]
y = dataset.Category
information_gain(X, "Age", y)       #  0.083
複製程式碼

資訊增益比

資訊增益比演算法修正了資訊增益演算法中會對某一特徵取值較多時產生偏向的情況。

$\ _{gR} (D, A) = \frac{_{g}(D,A)}{\mathrm{H}_{A}(D)}$

其中

$\mathrm{H}_{A}(D) = -\sum^{n}_{i=1} \frac{\left | D_{i} \right |}{\left | D \right | } log_{2} \frac{\left | D_{i} \right |}{\left | D \right | }$

Python3程式碼

def information_gain_rate(X: pd.DataFrame, feature, y) -> float:
    """
    param X: dataset.
    param feature: which need to compute.
    param y: labels.
    1. compute g(D, A) -> information gain
    2. compute HA(D)
    """
    g = information_gain(X, feature, y)
    m = int(y.count())
    A = X[feature]
    N = set(A)
    HDA = 0
    for i in N:
        di = int(A[A==i].count())
        HDA += (di/m * sci.log2(di/m))
    HDA = -HDA
    return g/HDA
複製程式碼

測試『年齡』特徵資訊增益比：

information_gain_rate(X, "Working", y)     # 0.35
複製程式碼

最優特徵選擇

思路很簡單，就是遍歷所有的特徵尋找資訊增益或者資訊增益比最大的特徵。

def select_optima_feature(alg, X: pd.DataFrame, features: list, y) -> tuple:
    """
    Return the optimatic feature and correspondence value
    """
    opt_col = ""
    opt_ig = -1
    for c in features:
        if not callable(alg):
            raise TypeError("alg must a callable function.")
        ig = alg(X, c, y)
        if ig > opt_ig:
            opt_col = c
            opt_ig = ig
    return opt_col, opt_ig
複製程式碼

測試最優特徵

# 採用資訊增益演算法
select_optima_feature(information_gain, X, ["Age", "Housing", "Working", "Credit"], y)    # ('Housing', 0.4199730940219749)

# 採用資訊增益比演算法
select_optima_feature(information_gain_rate, X, ["Age", "Housing", "Working", "Credit"], y)    # ('Housing', 0.4325380677663126)
複製程式碼

ID3 與 C4.5演算法實現

ID3演算法的核心是在決策樹各個節點上應用資訊增益準則選擇特徵，遞迴地構建決策樹。而C4.5僅僅是在ID3基礎上將資訊增益演算法換成資訊增益比。

在構建決策樹之前，我們還必須實現根據特徵劃分子集的方法。

def spilt_subset(X: pd.DataFrame, feature, feature_labels, y) -> dict:
    result = {}
    for i in feature_labels:
        s = (X[feature] == i)
        result.setdefault(i, {})
        result[i]['X'] = X[s]
        result[i]['y'] = y[s]
    return result
複製程式碼

構建決策樹

實現

def build_decision_tree(fsalg, X: pd.DataFrame, features: list, threshold: float, y):
    """
    create a decision tree
    param fsalg: feature selection algorithm.
    params X: dataset.
    params features: a array like, to describe what features in the D.
    params threshold: in accordance with judge D whether is a single tree.
    params labels_name: colum name which are include category.
    Return a dict-like structure object which namely Decision Tree
    """
    # 1. detect the features is equal null and whether all instances in D belong to the same category.
    if (not features) or (len(set(y))==1):
        return int(y.mode()[0])
    # 2. compute the biggest value for information gain.
    A, ig_value = select_optima_feature(fsalg, X, features, y)
    # 3. detect the value whether large than threshold.
    if ig_value < threshold:
        return int(y.mode()[0])
    DT = {}
    values = list(set(X[A]))
    DT.setdefault(A, {})
    childset = spilt_subset(X, A, values, y)
    features.remove(A)
    for v in values:
        DT[A][v] = build_decision_tree(fsalg, childset[v]['X'], features, threshold, childset[v]['y'])
    return DT
複製程式碼

可以看到生成決策樹時涉及到一個閾值，這個閾值是代表了能夠演算法忍受的最低資訊不確定性因子，因為不管使用資訊增益或者是資訊增益比演算法，其核心都是以最小化特徵?對?的不確定性亦?(?|?)，當?(?|?)無限逼近?(?)時，此時可以說特徵?對於瞭解?事件來說毫無意義，因此這個閾值就是在限定這種情況的最低限度。

測試：

model_id3 = build_decision_tree(information_gain, X, ["Age", "Housing", "Working", "Credit"], 0.2, y)    # {'Housing': {0: {'Working': {0: 0, 1: 1}}, 1: 1}}

model_c45 = build_decision_tree(information_gain_rate, X, ["Age", "Housing", "Working", "Credit"], 0.2, y)    # {'Housing': {0: {'Working': {0: 0, 1: 1}}, 1: 1}}
複製程式碼

該樹表示為