《機器學習Python實現_09_01_決策樹_ID3與C4.5》

努力的番茄發表於2020-05-26

原文網址 : https://www.cnblogs.com/zhulei227/p/12968987.html

簡介

先看一個例子，某銀行是否給使用者放貸的判斷規則集如下：

if 年齡==青年:
    if 有工作==是:
        if 信貸情況==非常好:
            放
        else:
            不放
    else:
        if 有自己的房子==是:
            if 信貸情況==一般:
                不放
            else:
                放
        else:
            if 信貸情況==非常好 or 信貸情況==好:
                放
            else:
                if 有工作==是:
                    放
                else:
                    不放
elif 年齡==中年:
    if 有自己的房子==是:
        放
    else:
        if 信貸情況==非常好 or 信貸情況==好:
            放
        else:
            if 有工作==是:
                放
            else:
                不放
elif 年齡==老年:
    if 有自己的房子==是:
        if 信貸情況==非常好 or 信貸情況==好:
            放
        else:
            不放
    else:
        if 信貸情況==非常好 or 信貸情況==好:
            if 有工作==是:
                放
            else:
                不放
        else:
            不放
if 有自己的房子==是:
    放
else:
    if 有工作==是:
        放
    else:
        不放

眼力好的同學立馬會發現這程式碼寫的有問題，比如只要信貸情況==非常好的使用者都有放款，何必嵌到裡面去？而且很多規則有冗餘，為什麼不重構一下呀？但現實情況是你可能真不敢隨意亂動！因為指不定哪天專案經理又要新增加規則了，所以寧可讓程式碼越來越冗餘，越來越複雜，也不敢隨意亂動之前的規則，亂動兩條，可能會帶來意想不到的災難。簡單總結一下這種複雜巢狀的if else規則可能存在的痛點：

（1）規則可能不完備，存在某些匹配不上的情況；

（2）規則之間存在冗餘，多個if else情況其實是判斷的同樣的條件；

（3）嚴重時，可能會出現矛盾的情況，即相同的條件，即有放，又有不放；

（4）判斷規則的優先順序混亂，比如信貸情況因子可以優先考慮，因為只要它是非常好就可以放款，而不必先判斷其它條件

而決策樹演算法就能解決以上痛點，它能保證所有的規則互斥且完備，即使用者的任意一種情況一定能匹配上一條規則，且該規則唯一，這樣就能解決上面的痛點1~3，且規則判斷的優先順序也很不錯，下面介紹決策樹學習演算法。

決策樹學習

決策樹演算法可以從已標記的資料中自動學習出if else規則集，如下圖（圖片來源>>>），左邊是收集的一系列判斷是否打球的案例，包括4個特徵outlook,temperature,Humidity,Wind,以及y標籤是否打球，通過決策樹學習後得到右邊的決策樹，決策樹的結構如圖所示，它由節點和有向邊組成，而節點又分為兩種：葉子節點和非葉子節點，非葉子節點主要用於對某一特徵做判斷，而它下面所連結的有向邊表示該特徵所滿足的某條件，最終的葉子節點即表示例項的預測值(分類/迴歸)

決策樹學習主要分為兩個階段，決策樹生成和決策樹剪枝，決策樹生成階段最重要便是特徵選擇，下面對相關概念做介紹：

1.特徵選擇

特徵選擇用於選擇對分類有用的特徵，ID3和C4.5通常選擇的準則是資訊增益和資訊增益比，下面對其作介紹並實現

資訊增益

首先介紹兩個隨機變數之間的互資訊公式：

\[MI(Y,X)=H(Y)-H(Y|X) \]

這裡\(H(X)\)表示\(X\)的熵，在最大熵模型那一節已做過介紹：

\[H(X)=-\sum_{i=1}^np_ilogp_i,這裡p_i=P(X=x_i) \]

條件熵\(H(Y|X)\)表示在已知隨機變數\(X\)的條件下，隨機變數\(Y\)的不確定性：

\[H(Y|X)=\sum_{i=1}^np_iH(Y|X=x_i),這裡p_i=P(X=x_i) \]

而資訊增益就是\(Y\)取分類標籤，\(X\)取某一特徵時的互資訊，它表示如果選擇特徵\(X\)對資料進行分割，可以使得分割後\(Y\)分佈的熵降低多少，若降低的越多，說明分割每個子集的\(Y\)的分佈越集中，則\(X\)對分類標籤\(Y\)越有用，下面進行python實現：

"""
定義計算熵的函式,封裝到ml_models.utils
"""
import numpy as np
from collections import Counter
import math
def entropy(x,sample_weight=None):
    x=np.asarray(x)
    #x中元素個數
    x_num=len(x)
    #如果sample_weight為None設均設定一樣
    if sample_weight is None:
        sample_weight=np.asarray([1.0]*x_num)
    x_counter={}
    weight_counter={}
    # 統計各x取值出現的次數以及其對應的sample_weight列表
    for index in range(0,x_num):
        x_value=x[index]
        if x_counter.get(x_value) is None:
            x_counter[x_value]=0
            weight_counter[x_value]=[]
        x_counter[x_value]+=1
        weight_counter[x_value].append(sample_weight[index])
    
    #計算熵
    ent=.0
    for key,value in x_counter.items():
        p_i=1.0*value*np.mean(weight_counter.get(key))/x_num
        ent+=-p_i*math.log(p_i)
    return ent

#測試
entropy([1,2])

0.6931471805599453

def cond_entropy(x, y,sample_weight=None):
    """
    計算條件熵:H(y|x)
    """
    x=np.asarray(x)
    y=np.asarray(y)
    # x中元素個數
    x_num = len(x)
    #如果sample_weight為None設均設定一樣
    if sample_weight is None:
        sample_weight=np.asarray([1.0]*x_num)
    # 計算
    ent = .0
    for x_value in set(x):
        x_index=np.where(x==x_value)
        new_x=x[x_index]
        new_y=y[x_index]
        new_sample_weight=sample_weight[x_index]
        p_i=1.0*len(new_x)/x_num
        ent += p_i * entropy(new_y,new_sample_weight)
    return ent

#測試
cond_entropy([1,2],[1,2])

0.0

def muti_info(x, y,sample_weight=None):
    """
    互資訊/資訊增益:H(y)-H(y|x)
    """
    x_num=len(x)
    if sample_weight is None:
        sample_weight=np.asarray([1.0]*x_num)
    return entropy(y,sample_weight) - cond_entropy(x, y,sample_weight)

接下來，做一個測試，看特徵的取值的個數對資訊增益的影響

import random
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#作epochs次測試
epochs=100
#x的取值的個數：2->class_num_x
class_num_x=100
#y標籤類別數
class_num_y=2
#樣本數量
num_samples=500
info_gains=[]
for _ in range(0,epochs):
    info_gain=[]
    for class_x in range(2,class_num_x):
        x=[]
        y=[]
        for _ in range(0,num_samples):
            x.append(random.randint(1,class_x))
            y.append(random.randint(1,class_num_y))
        info_gain.append(muti_info(x,y))
    info_gains.append(info_gain)
plt.plot(np.asarray(info_gains).mean(axis=0))

[<matplotlib.lines.Line2D at 0x21ed2625ba8>]

png

可以發現一個很有意思的現象，如果特徵的取值的個數越多，越容易被選中，這比較好理解，假設一個極端情況，若對每一個例項特徵\(x\)的取值都不同，則其\(H(Y|X)\)項為0，則\(MI(X,Y)=H(Y)-H(Y|X)\)將會取得最大值（\(H(Y)\)與\(X\)無關），這便是ID3演算法的一個痛點，為了矯正這一問題，C4.5演算法利用資訊增益比作特徵選擇

資訊增益比

資訊增益比其實就是對資訊增益除以了一個\(x\)的熵：

\[\frac{MI(X,Y)}{H(X)} \]

def info_gain_rate(x, y,sample_weight=None):
    """
    資訊增益比
    """
    x_num=len(x)
    if sample_weight is None:
        sample_weight=np.asarray([1.0]*x_num)
    return 1.0 * muti_info(x, y,sample_weight) / (1e-12 + entropy(x,sample_weight))

接下來再作一次相同的測試：

#作epochs次測試
epochs=100
#x的取值的個數：2->class_num_x
class_num_x=100
#y標籤類別數
class_num_y=2
#樣本數量
num_samples=500
info_gain_rates=[]
for _ in range(0,epochs):
    info_gain_rate_=[]
    for class_x in range(2,class_num_x):
        x=[]
        y=[]
        for _ in range(0,num_samples):
            x.append(random.randint(1,class_x))
            y.append(random.randint(1,class_num_y))
        info_gain_rate_.append(info_gain_rate(x,y))
    info_gain_rates.append(info_gain_rate_)
plt.plot(np.asarray(info_gain_rates).mean(axis=0))

[<matplotlib.lines.Line2D at 0x21ed26da978>]

png

雖然整體還是上升的趨勢，當相比於資訊增益已經緩解了很多，將它們畫一起直觀感受一下：

plt.plot(np.asarray(info_gains).mean(axis=0),'r')
plt.plot(np.asarray(info_gain_rates).mean(axis=0),'y')

[<matplotlib.lines.Line2D at 0x21ed267e860>]

png

2.決策樹生成

決策樹的生成就是一個遞迴地呼叫特徵選擇的過程，首先從根節點開始，利用資訊增益/資訊增益比選擇最佳的特徵作為節點特徵，由該特徵的不同取值建立子節點，然後再對子節點呼叫以上方法，直到所有特徵的資訊增益/資訊增益比均很小或者沒有特徵可以選擇時停止，最後得到一顆決策樹。接下來直接進行程式碼實現：

import os
os.chdir('../')
from ml_models import utils
from ml_models.wrapper_models import DataBinWrapper
"""
ID3和C4.5決策樹分類器的實現，放到ml_models.tree模組
"""
class DecisionTreeClassifier(object):
    class Node(object):
        """
        樹節點，用於儲存節點資訊以及關聯子節點
        """

        def __init__(self, feature_index: int = None, target_distribute: dict = None, weight_distribute: dict = None,
                     children_nodes: dict = None, num_sample: int = None):
            """
            :param feature_index: 特徵id
            :param target_distribute: 目標分佈
            :param weight_distribute:權重分佈
            :param children_nodes: 孩子節點
            :param num_sample:樣本量
            """
            self.feature_index = feature_index
            self.target_distribute = target_distribute
            self.weight_distribute = weight_distribute
            self.children_nodes = children_nodes
            self.num_sample = num_sample

    def __init__(self, criterion='c4.5', max_depth=None, min_samples_split=2, min_samples_leaf=1,
                 min_impurity_decrease=0, max_bins=10):
        """
        :param criterion:劃分標準，包括id3,c4.5，預設為c4.5
        :param max_depth:樹的最大深度
        :param min_samples_split:當對一個內部結點劃分時，要求該結點上的最小樣本數，預設為2
        :param min_samples_leaf:設定葉子結點上的最小樣本數，預設為1
        :param min_impurity_decrease:打算劃分一個內部結點時，只有當劃分後不純度(可以用criterion引數指定的度量來描述)減少值不小於該引數指定的值，才會對該結點進行劃分，預設值為0
        """
        self.criterion = criterion
        if criterion == 'c4.5':
            self.criterion_func = utils.info_gain_rate
        else:
            self.criterion_func = utils.muti_info
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_impurity_decrease = min_impurity_decrease

        self.root_node: self.Node = None
        self.sample_weight = None
        self.dbw = DataBinWrapper(max_bins=max_bins)

    def _build_tree(self, current_depth, current_node: Node, x, y, sample_weight):
        """
        遞迴進行特徵選擇，構建樹
        :param x:
        :param y:
        :param sample_weight:
        :return:
        """
        rows, cols = x.shape
        # 計算y分佈以及其權重分佈
        target_distribute = {}
        weight_distribute = {}
        for index, tmp_value in enumerate(y):
            if tmp_value not in target_distribute:
                target_distribute[tmp_value] = 0.0
                weight_distribute[tmp_value] = []
            target_distribute[tmp_value] += 1.0
            weight_distribute[tmp_value].append(sample_weight[index])
        for key, value in target_distribute.items():
            target_distribute[key] = value / rows
            weight_distribute[key] = np.mean(weight_distribute[key])
        current_node.target_distribute = target_distribute
        current_node.weight_distribute = weight_distribute
        current_node.num_sample = rows
        # 判斷停止切分的條件

        if len(target_distribute) <= 1:
            return

        if rows < self.min_samples_split:
            return

        if self.max_depth is not None and current_depth > self.max_depth:
            return

        # 尋找最佳的特徵
        best_index = None
        best_criterion_value = 0
        for index in range(0, cols):
            criterion_value = self.criterion_func(x[:, index], y)
            if criterion_value > best_criterion_value:
                best_criterion_value = criterion_value
                best_index = index

        # 如果criterion_value減少不夠則停止
        if best_index is None:
            return
        if best_criterion_value <= self.min_impurity_decrease:
            return
        # 切分
        current_node.feature_index = best_index
        children_nodes = {}
        current_node.children_nodes = children_nodes
        selected_x = x[:, best_index]
        for item in set(selected_x):
            selected_index = np.where(selected_x == item)
            # 如果切分後的點太少，以至於都不能做葉子節點，則停止分割
            if len(selected_index[0]) < self.min_samples_leaf:
                continue
            child_node = self.Node()
            children_nodes[item] = child_node
            self._build_tree(current_depth + 1, child_node, x[selected_index], y[selected_index],
                             sample_weight[selected_index])

    def fit(self, x, y, sample_weight=None):
        # check sample_weight
        n_sample = x.shape[0]
        if sample_weight is None:
            self.sample_weight = np.asarray([1.0] * n_sample)
        else:
            self.sample_weight = sample_weight
        # check sample_weight
        if len(self.sample_weight) != n_sample:
            raise Exception('sample_weight size error:', len(self.sample_weight))

        # 構建空的根節點
        self.root_node = self.Node()

        # 對x分箱
        self.dbw.fit(x)

        # 遞迴構建樹
        self._build_tree(1, self.root_node, self.dbw.transform(x), y, self.sample_weight)

    # 檢索葉子節點的結果
    def _search_node(self, current_node: Node, x, class_num):
        if current_node.feature_index is None or current_node.children_nodes is None or len(
                current_node.children_nodes) == 0 or current_node.children_nodes.get(
            x[current_node.feature_index]) is None:
            result = []
            total_value = 0.0
            for index in range(0, class_num):
                value = current_node.target_distribute.get(index, 0) * current_node.weight_distribute.get(index, 1.0)
                result.append(value)
                total_value += value
            # 歸一化
            for index in range(0, class_num):
                result[index] = result[index] / total_value
            return result
        else:
            return self._search_node(current_node.children_nodes.get(x[current_node.feature_index]), x, class_num)

    def predict_proba(self, x):
        # 計算結果概率分佈
        x = self.dbw.transform(x)
        rows = x.shape[0]
        results = []
        class_num = len(self.root_node.target_distribute)
        for row in range(0, rows):
            results.append(self._search_node(self.root_node, x[row], class_num))
        return np.asarray(results)

    def predict(self, x):
        return np.argmax(self.predict_proba(x), axis=1)

#造偽資料
from sklearn.datasets import make_classification
data, target = make_classification(n_samples=100, n_features=2, n_classes=2, n_informative=1, n_redundant=0,
                                   n_repeated=0, n_clusters_per_class=1, class_sep=.5,random_state=21)

#訓練檢視效果
tree = DecisionTreeClassifier(max_bins=15)
tree.fit(data, target)
utils.plot_decision_function(data, target, tree)

png

可以發現，如果不對決策樹施加一些限制，它會嘗試創造很細碎的規則去使所有的訓練樣本正確分類，這無疑會使得模型過擬合，所以接下來需要對其進行減枝操作，避免其過擬合

3.決策樹剪枝

顧名思義，剪掉一些不必要的葉子節點，那麼如何確定那些葉子節點需要去掉，哪些不需要去掉呢？這可以通過構建損失函式來量化，如果剪掉某一葉子結點後損失函式能減少，則進行剪枝操作，如果不能減少則不剪枝。一種簡單的量化損失函式可以定義如下：

\[C_\alpha(T)=\sum_{t=1}^{\mid T\mid}N_tH_t(T)+\alpha\mid T\mid \]

這裡\(\mid T \mid\)表示樹\(T\)的葉結點個數，\(t\)是樹\(\mid T \mid\)的葉結點，該葉節點有\(N_t\)個樣本點，其中\(k\)類樣本點有\(N_{tk}\)個，\(k=1,2,3,...,K\)，\(H_t(T)\)為葉結點\(t\)上的經驗熵，\(\alpha\geq 0\)為超引數，其中：

\[H_t(T)=-\sum_k\frac{N_{tk}}{N_t}log\frac{N_{tk}}{N_t} \]

該損失函式可以分為兩部分，第一部分\(\sum_{t=1}^{\mid T\mid}N_tH_t(T)\)為經驗損失，第二部分\(\mid T \mid\)為結構損失，\(\alpha\)為調節其平衡度的係數，如果\(\alpha\)越大則模型結構越簡單，越不容易過擬合，接下來進行剪枝的程式碼實現：

    def _prune_node(self, current_node: Node, alpha):
        # 如果有子結點,先對子結點部分剪枝
        if current_node.children_nodes is not None and len(current_node.children_nodes) != 0:
            for child_node in current_node.children_nodes.values():
                self._prune_node(child_node, alpha)

        # 再嘗試對當前結點剪枝
        if current_node.children_nodes is not None and len(current_node.children_nodes) != 0:
            # 避免跳層剪枝
            for child_node in current_node.children_nodes.values():
                # 當前剪枝的層必須是葉子結點的層
                if child_node.children_nodes is not None and len(child_node.children_nodes) > 0:
                    return
            # 計算剪枝前的損失值
            pre_prune_value = alpha * len(current_node.children_nodes)
            for child_node in current_node.children_nodes.values():
                for key, value in child_node.target_distribute.items():
                    pre_prune_value += -1 * child_node.num_sample * value * np.log(
                        value) * child_node.weight_distribute.get(key, 1.0)
            # 計算剪枝後的損失值
            after_prune_value = alpha
            for key, value in current_node.target_distribute.items():
                after_prune_value += -1 * current_node.num_sample * value * np.log(
                    value) * current_node.weight_distribute.get(key, 1.0)

            if after_prune_value <= pre_prune_value:
                # 剪枝操作
                current_node.children_nodes = None
                current_node.feature_index = None

    def prune(self, alpha=0.01):
        """
        決策樹剪枝 C(T)+alpha*|T|
        :param alpha:
        :return:
        """
        # 遞迴剪枝
        self._prune_node(self.root_node, alpha)

from ml_models.tree import DecisionTreeClassifier
#訓練檢視效果
tree = DecisionTreeClassifier(max_bins=15)
tree.fit(data, target)
tree.prune(alpha=1.5)
utils.plot_decision_function(data, target, tree)

png

通過探索\(\alpha\)，我們可以得到一個比較令人滿意的剪枝結果，這樣的剪枝方式通常又被稱為後剪枝，即從一顆完整生成後的樹開始剪枝，與其對應的還有預剪枝，即在訓練過程中就對其進行剪枝操作，這通常需要另外構建一份驗證集做支援，這裡就不實現了，另外比較通常的做法是，通過一些引數來控制模型的複雜度，比如max_depth控制樹的最大深度，min_samples_leaf控制葉子結點的最小樣本數，min_impurity_decrease控制特徵劃分後的最小不純度，min_samples_split控制結點劃分的最小樣本數，通過調節這些引數，同樣可以達到剪枝的效果，比如下面通過控制葉結點的最小數量達到了和上面剪枝一樣的效果：

tree = DecisionTreeClassifier(max_bins=15,min_samples_leaf=3)
tree.fit(data, target)
utils.plot_decision_function(data, target, tree)

png

決策樹另外一種理解：條件概率分佈

決策樹還可以看作是給定特徵條件下類的條件概率分佈：

（1）訓練時，決策樹會將特徵空間劃分為大大小小互不相交的區域，而每個區域對應了一個類的概率分佈；

（2）預測時，落到某區域的樣本點的類標籤即是該區域對應概率最大的那個類

機器學習之決策樹ID3(python實現)
2019-02-27
機器學習Python
機器學習——決策樹模型：Python實現
2020-11-09
機器學習模型Python
【Python機器學習實戰】決策樹和整合學習（二）——決策樹的實現
2021-08-25
Python機器學習
決策樹中資訊增益、ID3以及C4.5的實現與總結
2020-02-21
機器學習之決策樹(Decision Tree)python實現
2018-06-12
機器學習Python
機器學習|決策樹-sklearn實現
2020-12-19
機器學習
《機器學習Python實現_09_02_決策樹_CART》
2020-05-27
機器學習Python
Python機器學習：決策樹001什麼是決策樹
2020-12-24
Python機器學習
機器學習實戰（三）決策樹ID3：樹的構建和簡單分類
2018-05-17
機器學習
【Python機器學習實戰】決策樹與整合學習（三）——整合學習（1）
2021-08-30
Python機器學習
【Python機器學習實戰】決策樹和整合學習（一）
2021-08-19
Python機器學習
【Python機器學習實戰】決策樹與整合學習（四）——整合學習（2）GBDT
2021-09-03
Python機器學習
機器學習：決策樹
2020-08-01
機器學習
決策樹在機器學習的理論學習與實踐
2018-03-29
機器學習
機器學習——決策樹模型
2023-12-26
機器學習模型
機器學習之決策樹
2024-06-09
機器學習
機器學習之決策樹在sklearn中的實現
2019-03-06
機器學習
【Python機器學習實戰】決策樹與整合學習（六）——整合學習（4）XGBoost原理篇
2021-09-11
Python機器學習
機器學習 - 決策樹：技術全解與案例實戰
2023-12-11
機器學習
【面試考】【入門】決策樹演算法ID3，C4.5和CART
2020-05-24
面試演算法
ML《決策樹（二）C4.5》
2020-12-27
【機器學習】實現層面決策樹並用graphviz視覺化樹
2020-10-28
機器學習視覺化
ML《決策樹（一）ID3》
2020-12-27
【機器學習】--決策樹和隨機森林
2018-03-27
機器學習隨機森林
機器學習筆記（四）決策樹
2020-10-28
機器學習筆記
機器學習Sklearn系列：（三）決策樹
2021-07-16
機器學習
機器學習之決策樹原理和sklearn實踐
2019-06-24
機器學習
鵝廠優文 | 決策樹及ID3演算法學習
2018-03-20
演算法
機器學習 Day 9 | 決策樹基礎
2018-08-16
機器學習
機器學習之決策樹演算法
2019-07-28
機器學習演算法
機器學習(五)：通俗易懂決策樹與隨機森林及程式碼實踐
2021-02-25
機器學習隨機森林
機器學習西瓜書吃瓜筆記之(二)決策樹分類附一鍵生成決策樹&視覺化python程式碼實現
2020-10-13
機器學習筆記視覺化Python
圖解機器學習 | 決策樹模型詳解
2022-03-10
圖解機器學習模型
《統計學習方法》——從零實現決策樹
2021-03-17
機器學習之使用sklearn構造決策樹模型
2019-07-30
機器學習模型
機器學習經典演算法之決策樹
2019-06-16
機器學習演算法
機器學習之分類迴歸樹(python實現CART)
2018-03-04
機器學習Python
決策樹學習總結
2018-04-02