xgboost 特徵選擇,篩選特徵的正要性

weixin_34255793發表於2018-04-17
import pandas as pd
import xgboost as xgb
import operator
from matplotlib import pylab as plt

def ceate_feature_map(features):
    outfile = open('xgb.fmap', 'w')
    i = 0
    for feat in features:
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
        i = i + 1

    outfile.close()

def get_data():
    train = pd.read_csv("../input/train.csv")

    features = list(train.columns[2:])

    y_train = train.Hazard

    for feat in train.select_dtypes(include=['object']).columns:
        m = train.groupby([feat])['Hazard'].mean()
        train[feat].replace(m,inplace=True)

    x_train = train[features]

    return features, x_train, y_train

def get_data2():
    from sklearn.datasets import load_iris
    #獲取資料
    iris = load_iris()
    x_train=pd.DataFrame(iris.data)
    features=["sepal_length","sepal_width","petal_length","petal_width"]
    x_train.columns=features
    y_train=pd.DataFrame(iris.target)
    return features, x_train, y_train

#features, x_train, y_train = get_data()
features, x_train, y_train = get_data2()
ceate_feature_map(features)

xgb_params = {"objective": "reg:linear", "eta": 0.01, "max_depth": 8, "seed": 42, "silent": 1}
num_rounds = 1000

dtrain = xgb.DMatrix(x_train, label=y_train)
gbdt = xgb.train(xgb_params, dtrain, num_rounds)

importance = gbdt.get_fscore(fmap='xgb.fmap')
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

plt.figure()
df.plot()
df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(16, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance')
plt.gcf().savefig('feature_importance_xgb.png')

根據結構分數的增益情況計算出來選擇哪個特徵的哪個分割點,某個特徵的重要性,就是它在所有樹中出現的次數之和。

參考:https://blog.csdn.net/q383700092/article/details/53698760

 

 

另外:使用xgboost,遇到一個問題

看到網上有一個辦法:
       重新新建Python檔案,把你的程式碼拷過去;或者重新命名也可以;還不行,就把程式碼複製到別的地方(不能在原始資料夾內),會重新編譯,就正常了
       但是我覺得本質問題不是這樣解決的,但臨時應急還是可以的,歡迎討論!
 
問題根源:
初學者或者說不太瞭解Python才會犯這種錯誤,其實只需要注意一點!不要使用任何模組名作為檔名,任何型別的檔案都不可以!我的錯誤根源是在資料夾中使用xgboost.*的檔名,當import xgboost時會首先在當前檔案中查詢,才會出現這樣的問題。 所以,再次強調:不要用任何的模組名作為檔名!
 
 
另外:若出現問題:

D:\Program\Python3.5\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

先解除安裝原先版本的xgboost, pip uninstall xgboost

然後下載安裝新版本的xgboost,地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost

命令:pip install xgboost-0.6-cp35-none-win_amd64.whl

相關文章