基於資訊增益和基尼指數的二叉決策樹

你这过氧化氢掺水了發表於2024-11-07

原文網址 : https://www.cnblogs.com/h4o3/p/18531757

# coding: UTF-8
'''
基於資訊增益和基尼指數的二叉決策樹的實現。
該決策樹可以用於分類問題，透過選擇合適的特徵來劃分樣本。
'''

from collections import Counter

class biTree_node:
    '''
    二叉樹節點定義
    每個節點可以是葉子節點或內部節點。
    '''

    def __init__(self, f=-1, fvalue=None, leafLabel=None, l=None, r=None, splitInfo="gini"):
        '''
        類初始化函式
        para f: int, 切分的特徵，用樣本中的特徵次序表示
        para fvalue: float or int, 切分特徵的決策值
        para leafLabel: int, 葉節點的標籤
        para l: biTree_node指標, 當前節點的左子樹
        para r: biTree_node指標, 當前節點的右子樹
        para splitInfo: string, 切分的標準, 可取值'infogain'和'gini', 分別表示資訊增益和基尼指數。
        每個節點都儲存了其用於劃分的特徵以及該特徵的具體值，並且指向其左右子樹。
        如果是葉子節點，則儲存了該節點的標籤。
        '''
        self.f = f  # 特徵索引，即樣本中的特徵次序
        self.fvalue = fvalue  # 特徵切分值，用於決定樣本走向左子樹還是右子樹
        self.leafLabel = leafLabel  # 如果是葉節點，則儲存對應的類別標籤
        self.l = l  # 左子樹，指向當前節點的左子節點
        self.r = r  # 右子樹，指向當前節點的右子節點
        self.splitInfo = splitInfo  # 切分標準，用於決定使用何種方法來計算最佳特徵和特徵值


def gini_index(samples):
    '''
    計算基尼指數。
    para samples: list, 樣本列表，每個樣本的最後一個元素是標籤。
    return: float, 基尼指數。
    '''
    label_counts = sum_of_each_label(samples)
    total = len(samples)
    gini = 1.0
    for label in label_counts:
        prob = label_counts[label] / total
        gini -= prob ** 2
    return gini

def info_entropy(samples):
    '''
    計算資訊熵。
    para samples: list, 樣本列表，每個樣本的最後一個元素是標籤。
    return: float, 資訊熵。
    '''
    label_counts = sum_of_each_label(samples)
    total = len(samples)
    entropy = 0.0
    for label in label_counts:
        prob = label_counts[label] / total
        entropy -= prob * (prob * 3.321928094887362)  # 以2為底的對數
    return entropy

def split_samples(samples, feature, value):
    '''
    根據特徵和值分割樣本集。
    para samples: list, 樣本列表。
    para feature: int, 特徵索引。
    para value: float or int, 特徵值。
    return: tuple, 兩個列表，分別為左子集和右子集。
    '''
    left = [sample for sample in samples if sample[feature] < value]
    right = [sample for sample in samples if sample[feature] >= value]
    return left, right

def sum_of_each_label(samples):
    '''
    統計樣本中各類別標籤的分佈。
    para samples: list, 樣本列表。
    return: dict, 標籤及其出現次數的字典。
    '''
    labels = [sample[-1] for sample in samples]
    return Counter(labels)

def build_biTree(samples, splitInfo="gini"):
    '''構建二叉決策樹
    para samples: list, 樣本的列表，每樣本也是一個列表，樣本的最後一項為標籤，其它項為特徵。
    para splitInfo: string, 切分的標準，可取值'infogain'和'gini', 分別表示資訊增益和基尼指數。
    return: biTree_node, 二叉決策樹的根節點。
    該函式遞迴地構建決策樹，每次選擇一個最佳特徵和其值來切分樣本集，直到無法有效切分為止。
    '''
    # 如果沒有樣本，則返回空節點
    if len(samples) == 0:
        return biTree_node()

    # 檢查切分標準是否合法
    if splitInfo != "gini" and splitInfo != "infogain":
        return biTree_node()

    bestInfo = 0.0  # 最佳資訊增益或基尼指數減少量
    bestF = None  # 最佳特徵
    bestFvalue = None  # 最佳特徵的切分值
    bestlson = None  # 左子樹
    bestrson = None  # 右子樹

    # 計算當前集合的基尼指數或資訊熵
    curInfo = gini_index(samples) if splitInfo == "gini" else info_entropy(samples)

    sumOfFeatures = len(samples[0]) - 1  # 樣本中特徵的個數
    for f in range(0, sumOfFeatures):  # 遍歷每個特徵
        featureValues = [sample[f] for sample in samples]  # 提取特徵值
        for fvalue in featureValues:  # 遍歷當前特徵的每個值
            lson, rson = split_samples(samples, f, fvalue)  # 根據特徵及其值切分樣本
            # 計算分裂後兩個集合的基尼指數或資訊熵
            if splitInfo == "gini":
                info = (gini_index(lson) * len(lson) + gini_index(rson) * len(rson)) / len(samples)
            else:
                info = (info_entropy(lson) * len(lson) + info_entropy(rson) * len(rson)) / len(samples)

            gain = curInfo - info  # 計算增益或基尼指數的減少量

            # 找到最佳特徵及其切分值
            if gain > bestInfo and len(lson) > 0 and len(rson) > 0:
                bestInfo = gain
                bestF = f
                bestFvalue = fvalue
                bestlson = lson
                bestrson = rson

    # 如果找到了最佳切分
    if bestInfo > 0.0:
        l = build_biTree(bestlson, splitInfo)  # 遞迴構建左子樹
        r = build_biTree(bestrson, splitInfo)  # 遞迴構建右子樹
        return biTree_node(f=bestF, fvalue=bestFvalue, l=l, r=r, splitInfo=splitInfo)
    else:
        # 如果沒有有效切分，則生成葉節點
        label_counts = sum_of_each_label(samples)
        return biTree_node(leafLabel=max(label_counts, key=label_counts.get), splitInfo=splitInfo)


def predict(sample, tree):
    '''
    對給定樣本進行預測
    para sample: list, 需要預測的樣本
    para tree: biTree_node, 構建好的分類樹
    return: int, 預測樣本所屬的類別
    '''
    if tree.leafLabel is not None:  # 如果當前節點是葉節點
        return tree.leafLabel
    else:
        # 否則根據特徵值選擇子樹
        sampleValue = sample[tree.f]
        branch = tree.r if sampleValue >= tree.fvalue else tree.l
        return predict(sample, branch)  # 遞迴下去


def print_tree(tree, level='0'):
    '''簡單列印樹的結構
    para tree: biTree_node, 樹的根節點
    para level: str, 當前節點在樹中的深度，0表示根，0L表示左子節點，0R表示右子節點
    '''
    if tree.leafLabel is not None:  # 如果是葉節點
        print('*' + level + '-' + str(tree.leafLabel))  # 列印標籤
    else:
        print('+' + level + '-' + str(tree.f) + '-' + str(tree.fvalue))  # 列印特徵索引及切分值
        print_tree(tree.l, level + 'L')  # 列印左子樹
        print_tree(tree.r, level + 'R')  # 列印右子樹


if __name__ == "__main__":

    # 示例資料集：某人相親的資料
    blind_date = [[35, 176, 0, 20000, 0],
                  [28, 178, 1, 10000, 1],
                  [26, 172, 0, 25000, 0],
                  [29, 173, 2, 20000, 1],
                  [28, 174, 0, 15000, 1]]

    print("資訊增益二叉樹：")
    tree = build_biTree(blind_date, splitInfo="infogain")  # 構建資訊增益的二叉樹
    print_tree(tree)  # 列印樹結構
    print('資訊增益二叉樹對樣本進行預測的結果：')

    test_sample = [[24, 178, 2, 17000],
                   [27, 176, 0, 25000],
                   [27, 176, 0, 10000]]

    # 對測試樣本進行預測
    for x in test_sample:
        print(predict(x, tree))

    print("基尼指數二叉樹：")
    tree = build_biTree(blind_date, splitInfo="gini")  # 構建基尼指數的二叉樹
    print_tree(tree)  # 列印樹結構
    print('基尼指數二叉樹對樣本進行預測的結果：')

    # 再次對測試樣本進行預測
    for x in test_sample:
        print(predict(x, tree))  # 預測並列印結果

輸出結果：

基於資訊增益的ID3決策樹介紹。
2018-03-17
決策樹中資訊增益、ID3以及C4.5的實現與總結
2020-02-21
機器學習——基尼指數
2019-07-24
機器學習
python中如何實現資訊增益和資訊增益率
2021-09-11
Python
關於決策樹的理解
2024-10-25
資訊增益
2020-10-01
機器學習 Day 9 | 決策樹基礎
2018-08-16
機器學習
基於二叉樹的高效IP檢索格式MMDB
2023-02-14
二叉樹
決策樹模型(3)決策樹的生成與剪枝
2024-03-28
模型
二叉樹基礎上
2018-11-14
二叉樹
機器學習演算法（五）：基於企鵝資料集的決策樹分類預測
2023-03-25
機器學習演算法
決策樹和隨機森林
2020-12-11
隨機森林
深入學習二叉樹 (一) 二叉樹基礎
2019-06-13
二叉樹
【資料結構】二叉樹的基礎知識
2018-05-07
資料結構二叉樹
Minitab 2021：資料之源，決策之基
2024-01-22
決策樹
2024-07-27
智慧指標和二叉樹(2):資源的自動管理
2019-05-07
指標二叉樹
基於C#的機器學習--我應該接受這份工作嗎-使用決策樹
2019-07-11
C#機器學習
基於資料的決策：模擬與庫存管理（附PPT下載）
2018-08-08
基於數值資料理解和重要資訊驗證的資料到文字生成模型
2020-11-25
模型
【Python機器學習實戰】決策樹和整合學習（二）——決策樹的實現
2021-08-25
Python機器學習
決策樹示例
2021-01-16
通俗地說決策樹演算法（一）基礎概念介紹
2019-07-24
演算法
人工智慧之機器學習基礎——決策樹（Decision Tree）
2024-11-19
人工智慧機器學習
大資料————決策樹（decision tree）
2022-10-20
大資料
【劍指offer】5.二叉樹的映象和列印
2019-02-16
二叉樹
ML《決策樹（四）Bagging 和 Random Forest》
2021-01-02
randomREST
資料結構和演算法面試題系列—二叉樹基礎
2018-09-20
資料結構演算法面試題二叉樹
SKlearn中分類決策樹的重要引數詳解
2019-03-08
01 決策樹 - 數學理論概述 - 熵
2018-10-29
熵
深入理解資料結構--二叉樹（基礎篇）
2021-06-29
資料結構二叉樹
4. 決策樹
2020-10-26
Decision tree——決策樹
2020-04-30
決策樹（Decision Tree）
2021-07-13
NeurIPS Spotlight | 基於資訊理論，決策模型有了全新預訓練正規化統一框架
2024-12-17
模型框架
如何建立用於根本原因分析的決策樹
2022-11-16
Python機器學習：決策樹001什麼是決策樹
2020-12-24
Python機器學習
【資料結構&演算法】11-樹基礎&二叉樹遍歷
2021-11-11
資料結構演算法二叉樹

基於資訊增益和基尼指數的二叉決策樹

相關文章