建模調參

Olaf-雪寶發表於2020-09-24

首先大致說一下各個Model

邏輯迴歸模型

它是一種線性模型,適用於二分類問題,因為他的決策函式可以是sigmoid函式,經過它的轉換之後,就會變成一個0/1值,這就是為什麼適合二分類的原因,它的優點也很多,比如訓練速度較快,因為它在做分類的時候,計算量僅僅只和特徵的數目相關,再比如它記憶體資源佔用小,只需要儲存各個維度的特徵值,但也有很多缺點,比如邏輯迴歸需要預先處理缺失值和異常值,因為它無法處理缺失值。

決策樹模型

它最大的優點就是視覺化之後十分直觀,可以清晰地知道它分類的指標是什麼,而且資料不需要預處理,不需要歸一化,不需要處理缺失資料,決策樹有迴歸和分類決策樹兩種,但缺點也很明顯,因為它極其容易過擬合,所以有了很多剪枝演算法,大致很為兩類,預剪枝和後剪枝,由於採用的是貪心演算法,容易得到區域性最優解,此時有很多種方法跳出區域性最優解,比如模擬退火等

Ensemble Model

通過組合多個學習器來完成學習任務,通過整合方法,可以將多個弱學習器組合成一個強分類器,因此整合學習的泛化能力一般比單一分類器要好。整合方法主要包括Bagging和Boosting,Bagging和Boosting都是將已有的分類或迴歸演算法通過一定方式組合起來,形成一個更加強大的分類。兩種方法都是把若干個分類器整合為一個分類器的方法,只是整合的方式不一樣,最終得到不一樣的效果。常見的基於Baggin思想的整合模型有:隨機森林、基於Boosting思想的整合模型有:Adaboost、GBDT、XgBoost、LightGBM等

模型評價標準

這次選用的是AUC作為評選標準,什麼是ROC呢?這又牽扯到分類問題中的混淆矩陣、召回率,查全率等,西瓜書裡面第二章有很詳細的介紹,這裡就預設讀者已經看過了,下面來解釋什麼是AUC,在邏輯迴歸裡面,對於正負例的界定,通常會設一個閾值,大於閾值的為正類,小於閾值為負類。如果我們減小這個閥值,更多的樣本會被識別為正類,提高正類的識別率,但同時也會使得更多的負類被錯誤識別為正類。為了直觀表示這一現象,引入ROC。根據分類結果計算得到ROC空間中相應的點,連線這些點就形成ROC curve,橫座標為False Positive Rate(FPR:假正率),縱座標為True Positive Rate(TPR:真正率)。 一般情況下,這個曲線都應該處於(0,0)和(1,1)連線的上方ROC曲線中的四個點:

  • 點(0,1):即FPR=0, TPR=1,意味著FN=0且FP=0,將所有的樣本都正確分類;

  • 點(1,0):即FPR=1,TPR=0,最差分類器,避開了所有正確答案;

  • 點(0,0):即FPR=TPR=0,FP=TP=0,分類器把每個例項都預測為負類;

  • 點(1,1):分類器把每個例項都預測為正類

  • 總之:ROC曲線越接近左上角,該分類器的效能越好,其泛化效能就越好。而且一般來說,如果ROC是光滑的,那麼基本可以判斷沒有太大的overfitting。
    但是對於兩個模型,我們如何判斷哪個模型的泛化效能更優呢?這裡我們有主要以下兩種方法:

如果模型A的ROC曲線完全包住了模型B的ROC曲線,那麼我們就認為模型A要優於模型B;

如果兩條曲線有交叉的話,我們就通過比較ROC與X,Y軸所圍得曲線的面積來判斷,面積越大,模型的效能就越優,這個面積我們稱之為AUC(area under ROC curve)

import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相關設定
@return:
"""
# 宣告使用 Seaborn 樣式
sns.set()
# 有五種seaborn的繪圖風格,它們分別是:darkgrid, whitegrid, dark, white, ticks。預設的主題是darkgrid。
sns.set_style("whitegrid")
# 有四個預置的環境,按大小從小到大排列分別為:paper, notebook, talk, poster。其中,notebook是預設的。
sns.set_context('talk')
# 中文字型設定-黑體
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解決儲存影像是負號'-'顯示為方塊的問題
plt.rcParams['axes.unicode_minus'] = False
# 解決Seaborn中文顯示問題並調整字型大小
sns.set(font='SimHei')
#資料壓縮
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
data = pd.read_csv('data_for_model01.csv')
data = reduce_mem_usage(data)
D:\Anaconda1\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)


Memory usage of dataframe is 793236320.00 MB
Memory usage after optimization is: 181245298.00 MB
Decreased by 77.2%
data.head()
loanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnershipannualIncome...grade_to_std_n11grade_to_mean_n12grade_to_std_n12grade_to_mean_n13grade_to_std_n13grade_to_mean_n14grade_to_std_n14samplen2.2n2.3
035008.0519.515625918.00005211612802.02110000.0...4.0117191.8525394.0117191.8574224.0039061.8564453.992188trainNaNNaN
118000.0518.484375462.0000416895385.0046000.0...3.2070311.4824223.2070311.4863283.2050781.4853523.193359trainNaNNaN
212000.0516.984375298.25004171593678.0074000.0...3.2070311.4824223.2070311.4863283.2050781.3154303.146484trainNaNNaN
32050.037.69140663.937513598309.0035000.0...0.8017580.3706050.8017580.3715820.8012700.3442380.793457trainNaNNaN
411504.0314.976562398.5000312852421.0130000.0...2.4062501.1113282.4062501.1142582.4023441.1142582.394531trainNaNNaN

5 rows × 122 columns

from sklearn.model_selection import KFold
# 分離資料集,方便進行交叉驗證
X_train = data.loc[data['sample']=='train', :].drop(['isDefault', 'sample'], axis=1)
X_test = data.loc[data['sample']=='test', :].drop(['isDefault', 'sample'], axis=1)
y_train = data.loc[data['sample']=='train', 'isDefault']

# 5折交叉驗證
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
"""對訓練集資料進行劃分,分成訓練集和驗證集,並進行相應的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 資料集劃分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)

params = {
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'learning_rate': 0.1,
            'metric': 'auc',
            'min_child_weight': 1e-3,
            'num_leaves': 31,
            'max_depth': -1,
            'reg_lambda': 0,
            'reg_alpha': 0,
            'feature_fraction': 1,
            'bagging_fraction': 1,
            'bagging_freq': 0,
            'seed': 2020,
            'nthread': 8,
            'silent': True,
            'verbose': -1,
}

"""使用訓練集資料進行模型訓練"""
model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=20000, verbose_eval=1000, early_stopping_rounds=200)
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[330]	valid_0's auc: 0.731887
from sklearn import metrics
from sklearn.metrics import roc_auc_score

"""預測並計算roc的相關指標"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未調參前lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
未調參前lightgbm單模型在驗證集上的AUC:0.7318871300593701

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-vNOFrVFI-1600963034277)(output_7_1.png)]

import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'learning_rate': 0.1,
                'metric': 'auc',
        
                'min_child_weight': 1e-3,
                'num_leaves': 31,
                'max_depth': -1,
                'reg_lambda': 0,
                'reg_alpha': 0,
                'feature_fraction': 1,
                'bagging_fraction': 1,
                'bagging_freq': 0,
                'seed': 2020,
                'nthread': 8,
                'silent': True,
                'verbose': -1,
    }
    
    model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************


D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[308]	valid_0's auc: 0.729253
[0.729252686605049]
************************************ 2 ************************************


D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[337]	valid_0's auc: 0.730723
[0.729252686605049, 0.7307233610934907]
************************************ 3 ************************************


D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[527]	valid_0's auc: 0.732105
[0.729252686605049, 0.7307233610934907, 0.7321048628412448]
************************************ 4 ************************************


D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[381]	valid_0's auc: 0.727511
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779]
************************************ 5 ************************************


D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[462]	valid_0's auc: 0.732217
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_scotrainre_list:[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_score_mean:0.7303619043815351
lgb_score_std:0.0017871174424543119

from sklearn.model_selection import GridSearchCV

def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, 
                       feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
    # 設定5折交叉驗證
    cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    
    model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                   n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   max_depth=max_depth,
                                   bagging_fraction=bagging_fraction,
                                   feature_fraction=feature_fraction,
                                   bagging_freq=bagging_freq,
                                   min_data_in_leaf=min_data_in_leaf,
                                   min_child_weight=min_child_weight,
                                   min_split_gain=min_split_gain,
                                   reg_lambda=reg_lambda,
                                   reg_alpha=reg_alpha,
                                   n_jobs= 8
                                  )
    grid_search = GridSearchCV(estimator=model_lgb, 
                               cv=cv_fold,
                               param_grid=param_grid,
                               scoring='roc_auc'
                              )
    grid_search.fit(X_train, y_train)

    print('模型當前最優引數為:{}'.format(grid_search.best_params_))
    print('模型當前最優得分為:{}'.format(grid_search.best_score_))
# 設定5折交叉驗證

from sklearn.model_selection import KFold,StratifiedKFold
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
                'boosting_type': 'gbdt',
                'learning_rate': 0.01,
                'num_leaves': 29,
                'max_depth': 7,
                'min_data_in_leaf':45,
                'min_child_weight':0.001,
                'bagging_fraction': 0.9,
                'feature_fraction': 0.9,
                'bagging_freq': 40,
                'min_split_gain': 0,
                'reg_lambda':0,
                'reg_alpha':0,
                'nthread': 6
               }

cv_result = lgb.cv(train_set=lgb_train,
                   early_stopping_rounds=20,
                   num_boost_round=5000,
                   nfold=5,
                   stratified=True,
                   shuffle=True,
                   params=final_params,
                   metrics='auc',
                   seed=0,
                  )

print('迭代次數{}'.format(len(cv_result['auc-mean'])))
print('交叉驗證的AUC為{}'.format(max(cv_result['auc-mean'])))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-20-b7fa2feb7fa4> in <module>
     19                }
     20 
---> 21 cv_result = lgb.cv(train_set=lgb_train,
     22                    early_stopping_rounds=20,
     23                    num_boost_round=5000,


NameError: name 'lgb_train' is not defined
pip install bayesian-optimization
Collecting bayesian-optimization
  Downloading bayesian-optimization-1.2.0.tar.gz (14 kB)
Requirement already satisfied: numpy>=1.9.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.18.1)
Requirement already satisfied: scipy>=0.14.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.4.1)
Requirement already satisfied: scikit-learn>=0.18.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (0.22.1)
Requirement already satisfied: joblib>=0.11 in d:\anaconda1\lib\site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.15.1)
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py): started
  Building wheel for bayesian-optimization (setup.py): finished with status 'done'
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11689 sha256=92f6d72f1257c45277321db01836ffce0c63dac8f591db2b3db6e9c47e6d07c1
  Stored in directory: c:\users\苗苗\appdata\local\pip\cache\wheels\fd\9b\71\f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0
Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import cross_val_score

"""定義優化函式"""
def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, 
              min_child_weight, min_split_gain, reg_lambda, reg_alpha):
    # 建立模型
    model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', bjective='binary', metric='auc',
                                   learning_rate=0.1, n_estimators=5000,
                                   num_leaves=int(num_leaves), max_depth=int(max_depth), 
                                   bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),
                                   bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),
                                   min_child_weight=min_child_weight, min_split_gain=min_split_gain,
                                   reg_lambda=reg_lambda, reg_alpha=reg_alpha,
                                   n_jobs= 8
                                  )
    
    val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
    
    return val
from bayes_opt import BayesianOptimization
"""定義優化引數"""
bayes_lgb = BayesianOptimization(
    rf_cv_lgb, 
    {
        'num_leaves':(10, 200),
        'max_depth':(3, 20),
        'bagging_fraction':(0.5, 1.0),
        'feature_fraction':(0.5, 1.0),
        'bagging_freq':(0, 100),
        'min_data_in_leaf':(10,100),
        'min_child_weight':(0, 10),
        'min_split_gain':(0.0, 1.0),
        'reg_alpha':(0.0, 10),
        'reg_lambda':(0.0, 10),
    }
)

"""開始優化"""
bayes_lgb.maximize(n_iter=10)
|   iter    |  target   | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7171  [0m | [0m 0.5841  [0m | [0m 45.89   [0m | [0m 0.9789  [0m | [0m 15.1    [0m | [0m 4.607   [0m | [0m 48.88   [0m | [0m 0.4838  [0m | [0m 16.29   [0m | [0m 1.699   [0m | [0m 1.449   [0m |



---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
    190         try:
--> 191             target = self._cache[_hashable(x)]
    192         except KeyError:


KeyError: (0.9808468358238472, 95.31683577641724, 0.6846527338078261, 15.254621027977167, 6.084315056472179, 23.81958341293199, 0.6162173085058286, 42.95894924047164, 1.5295440304650598, 6.915985580798569)


During handling of the above exception, another exception occurred:


KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-23-2c2786145eac> in <module>
     18 
     19 """開始優化"""
---> 20 bayes_lgb.maximize(n_iter=10)


D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in maximize(self, init_points, n_iter, acq, kappa, kappa_decay, kappa_decay_delay, xi, **gp_params)
    183                 iteration += 1
    184 
--> 185             self.probe(x_probe, lazy=False)
    186 
    187             if self._bounds_transformer:


D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in probe(self, params, lazy)
    114             self._queue.add(params)
    115         else:
--> 116             self._space.probe(params)
    117             self.dispatch(Events.OPTIMIZATION_STEP)
    118 


D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
    192         except KeyError:
    193             params = dict(zip(self._keys, x))
--> 194             target = self.target_func(**params)
    195             self.register(x, target)
    196         return target


<ipython-input-22-f352aad073e3> in rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, min_child_weight, min_split_gain, reg_lambda, reg_alpha)
     15                                   )
     16 
---> 17     val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
     18 
     19     return val


D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    388                                 fit_params=fit_params,
    389                                 pre_dispatch=pre_dispatch,
--> 390                                 error_score=error_score)
    391     return cv_results['test_score']
    392 


D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    234             return_times=True, return_estimator=return_estimator,
    235             error_score=error_score)
--> 236         for train, test in cv.split(X, y, groups))
    237 
    238     zipped_scores = list(zip(*scores))


D:\Anaconda1\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1030                 self._iterating = self._original_iterator is not None
   1031 
-> 1032             while self.dispatch_one_batch(iterator):
   1033                 pass
   1034 


D:\Anaconda1\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 


D:\Anaconda1\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to


D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    204     def apply_async(self, func, callback=None):
    205         """Schedule a func to be run"""
--> 206         result = ImmediateResult(func)
    207         if callback:
    208             callback(result)


D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    568         # Don't delay the application, to avoid keeping the input
    569         # arguments in memory
--> 570         self.results = batch()
    571 
    572     def get(self):


D:\Anaconda1\lib\site-packages\joblib\parallel.py in __call__(self)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):


D:\Anaconda1\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):


D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    513             estimator.fit(X_train, **fit_params)
    514         else:
--> 515             estimator.fit(X_train, y_train, **fit_params)
    516 
    517     except Exception as e:


D:\Anaconda1\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    798                                         verbose=verbose, feature_name=feature_name,
    799                                         categorical_feature=categorical_feature,
--> 800                                         callbacks=callbacks)
    801         return self
    802 


D:\Anaconda1\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    593                               verbose_eval=verbose, feature_name=feature_name,
    594                               categorical_feature=categorical_feature,
--> 595                               callbacks=callbacks)
    596 
    597         if evals_result:


D:\Anaconda1\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    247                                     evaluation_result_list=None))
    248 
--> 249         booster.update(fobj=fobj)
    250 
    251         evaluation_result_list = []


D:\Anaconda1\lib\site-packages\lightgbm\basic.py in update(self, train_set, fobj)
   1924             _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
   1925                 self.handle,
-> 1926                 ctypes.byref(is_finished)))
   1927             self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
   1928             return is_finished.value == 1


KeyboardInterrupt: 
bayes_lgb.max
{'target': 0.7170845006643078,
 'params': {'bagging_fraction': 0.5841220522935171,
  'bagging_freq': 45.89371469870785,
  'feature_fraction': 0.9788842825399383,
  'max_depth': 15.098220845321368,
  'min_child_weight': 4.606814369239687,
  'min_data_in_leaf': 48.875222916404226,
  'min_split_gain': 0.4837879568993534,
  'num_leaves': 16.292948242912633,
  'reg_alpha': 1.699317625022757,
  'reg_lambda': 1.4494033099871717}}
base_params_lgb = {
                    'boosting_type': 'gbdt',
                    'objective': 'binary',
                    'metric': 'auc',
                    'learning_rate': 0.01,
                    'num_leaves': 14,
                    'max_depth': 19,
                    'min_data_in_leaf': 37,
                    'min_child_weight':1.6,
                    'bagging_fraction': 0.98,
                    'feature_fraction': 0.69,
                    'bagging_freq': 96,
                    'reg_lambda': 9,
                    'reg_alpha': 7,
                    'min_split_gain': 0.4,
                    'nthread': 8,
                    'seed': 2020,
                    'silent': True,
                    'verbose': -1,
}

cv_result_lgb = lgb.cv(
    train_set=train_matrix,
    early_stopping_rounds=1000, 
    num_boost_round=20000,
    nfold=5,
    stratified=True,
    shuffle=True,
    params=base_params_lgb,
    metrics='auc',
    seed=0
)

print('迭代次數{}'.format(len(cv_result_lgb['auc-mean'])))
print('最終模型的AUC為{}'.format(max(cv_result_lgb['auc-mean'])))
import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'learning_rate': 0.01,
                'num_leaves': 14,
                'max_depth': 19,
                'min_data_in_leaf': 37,
                'min_child_weight':1.6,
                'bagging_fraction': 0.98,
                'feature_fraction': 0.69,
                'bagging_freq': 96,
                'reg_lambda': 9,
                'reg_alpha': 7,
                'min_split_gain': 0.4,
                'nthread': 8,
                'seed': 2020,
                'silent': True,
    }
    
    model = lgb.train(params, train_set=train_matrix, num_boost_round=14269, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
base_params_lgb = {
                    'boosting_type': 'gbdt',
                    'objective': 'binary',
                    'metric': 'auc',
                    'learning_rate': 0.01,
                    'num_leaves': 14,
                    'max_depth': 19,
                    'min_data_in_leaf': 37,
                    'min_child_weight':1.6,
                    'bagging_fraction': 0.98,
                    'feature_fraction': 0.69,
                    'bagging_freq': 96,
                    'reg_lambda': 9,
                    'reg_alpha': 7,
                    'min_split_gain': 0.4,
                    'nthread': 8,
                    'seed': 2020,
                    'silent': True,
}

"""使用訓練集資料進行模型訓練"""
final_model_lgb = lgb.train(base_params_lgb, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=13000, verbose_eval=1000, early_stopping_rounds=200)

"""預測並計算roc的相關指標"""
val_pre_lgb = final_model_lgb.predict(X_val)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('調參後lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
import pickle
  pickle.dump(final_model_lgb, open('dataset/model_lgb_best.pkl', 'wb'))






相關文章