xgboost特徵選擇

WenwuTao發表於2016-11-27

Xgboost在各大資料探勘比賽中是一個大殺器，往往可以取得比其他各種機器學習演算法更好的效果。資料預處理，特徵工程，調參對Xgboost的效果有著非常重要的影響。這裡介紹一下運用xgboost的特徵選擇，運用xgboost的特徵選擇可以篩選出更加有效的特徵代入Xgboost模型。

這裡採用的資料集來自於Kaggle | Allstate Claims Severity比賽，這裡的訓練集如下所示，有116個離散特徵（cat1-cat116）,14個連續特徵（cont1 -cont14），離散特徵用字串表示，先要對其進行數值化：

   id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9   ...        cont6  \
0   1    A    B    A    B    A    A    A    A    B   ...     0.718367   
1   2    A    B    A    A    A    A    A    A    B   ...     0.438917   
2   5    A    B    A    A    B    A    A    A    B   ...     0.289648   
3  10    B    B    A    B    A    A    A    A    B   ...     0.440945   
4  11    A    B    A    B    A    A    A    A    B   ...     0.178193   

      cont7    cont8    cont9   cont10    cont11    cont12    cont13  \
0  0.335060  0.30260  0.67135  0.83510  0.569745  0.594646  0.822493   
1  0.436585  0.60087  0.35127  0.43919  0.338312  0.366307  0.611431   
2  0.315545  0.27320  0.26076  0.32446  0.381398  0.373424  0.195709   
3  0.391128  0.31796  0.32128  0.44467  0.327915  0.321570  0.605077   
4  0.247408  0.24564  0.22089  0.21230  0.204687  0.202213  0.246011

xgboost的特徵選擇的程式碼如下：

import numpy as np
import pandas as pd
import xgboost as xgb
import operator
import matplotlib.pyplot as plt

def ceate_feature_map(features):
    outfile = open('xgb.fmap', 'w')
    i = 0
    for feat in features:
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
        i = i + 1
    outfile.close()


if __name__ == '__main__':
    train = pd.read_csv("../input/train.csv")
    cat_sel = [n for n in train.columns if n.startswith('cat')]  #類別特徵數值化
    for column in cat_sel:
        train[column] = pd.factorize(train[column].values , sort=True)[0] + 1

    params = {
        'min_child_weight': 100,
        'eta': 0.02,
        'colsample_bytree': 0.7,
        'max_depth': 12,
        'subsample': 0.7,
        'alpha': 1,
        'gamma': 1,
        'silent': 1,
        'verbose_eval': True,
        'seed': 12
    }
    rounds = 10
    y = train['loss']
    X = train.drop(['loss', 'id'], 1)

    xgtrain = xgb.DMatrix(X, label=y)
    bst = xgb.train(params, xgtrain, num_boost_round=rounds)

    features = [x for x in train.columns if x not in ['id','loss']]
    ceate_feature_map(features)

    importance = bst.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=operator.itemgetter(1))

    df = pd.DataFrame(importance, columns=['feature', 'fscore'])
    df['fscore'] = df['fscore'] / df['fscore'].sum()
    df.to_csv("../input/feat_sel/feat_importance.csv", index=False)

    plt.figure()
    df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
    plt.title('XGBoost Feature Importance')
    plt.xlabel('relative importance')
    plt.show()

GitHub : https://github.com/wenwu313/Kaggle-Solution

xgboost 特徵選擇，篩選特徵的正要性
2018-04-17
特徵
使用xgboost進行特徵選擇
2017-08-17
特徵
RF、GBDT、XGboost特徵選擇方法
2018-04-19
特徵
xgboost 特徵重要性選擇 / 看所有特徵哪個重要
2018-06-06
特徵
XGBoost 輸出特徵重要性以及篩選特徵
2018-08-26
特徵
特徵工程之特徵選擇
2018-10-26
特徵工程
【特徵工程】（資料）使用Xgboost篩選特徵重要性
2019-12-14
特徵工程
機器學習特徵工程之特徵選擇
2017-03-25
機器學習特徵工程
機器學習之基於xgboost的特徵篩選
2020-03-19
機器學習特徵
XGBoost學習（六）：輸出特徵重要性以及篩選特徵
2020-09-03
特徵
特徵工程特徵選擇 reliefF演算法
2020-11-07
特徵工程演算法
xgboost特徵重要性
2019-02-16
特徵
特徵選擇和特徵生成問題初探
2018-07-29
特徵
特徵選擇技術總結
2022-11-24
特徵
決策樹模型(2)特徵選擇
2024-03-26
模型特徵
xgboost 特徵重要性計算
2018-11-13
特徵
基於條件熵的特徵選擇
2020-08-09
熵特徵
機器學習之特徵選擇和降維的理解
2017-09-23
機器學習特徵
機器學習中，有哪些特徵選擇的工程方法？
2018-07-09
機器學習特徵
ch11 特徵選擇與稀疏學習
2024-06-21
特徵
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（一）
2020-04-22
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（二）
2020-04-24
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（三）
2020-04-24
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（四）
2020-05-07
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（五）
2020-05-24
特徵工程
Python中XGBoost的特性重要性和特性選擇
2019-03-27
Python
Relief 特徵選擇演算法簡單介紹
2018-08-08
特徵演算法
用遺傳演算法進行特徵選擇
2019-01-20
演算法特徵
決策樹中結點的特徵選擇方法
2018-05-09
特徵
使用XGBoost在Python中的功能重要性和功能選擇
2017-10-28
Python
用xgboost獲取特徵重要性及應用
2019-11-20
特徵
xgboost模型特徵重要性的不同計算方式
2019-09-17
模型特徵
用xgboost模型對特徵重要性進行排序
2018-08-12
模型特徵排序
機器學習—降維-特徵選擇6-4（PCA-Kernel方法）
2022-03-16
機器學習特徵PCA
三大特徵選擇策略，有效提升你的機器學習水準
2017-10-23
特徵機器學習
特徵選擇（一）-維數問題與類內距離
2014-05-08
特徵
用xgboost獲取特徵重要性原理及實踐
2019-04-13
特徵
xgboost輸出特徵重要性排名和權重值
2018-07-29
特徵

xgboost特徵選擇

相關文章