建模調參
首先大致說一下各個Model
邏輯迴歸模型
它是一種線性模型,適用於二分類問題,因為他的決策函式可以是sigmoid函式,經過它的轉換之後,就會變成一個0/1值,這就是為什麼適合二分類的原因,它的優點也很多,比如訓練速度較快,因為它在做分類的時候,計算量僅僅只和特徵的數目相關,再比如它記憶體資源佔用小,只需要儲存各個維度的特徵值,但也有很多缺點,比如邏輯迴歸需要預先處理缺失值和異常值,因為它無法處理缺失值。
決策樹模型
它最大的優點就是視覺化之後十分直觀,可以清晰地知道它分類的指標是什麼,而且資料不需要預處理,不需要歸一化,不需要處理缺失資料,決策樹有迴歸和分類決策樹兩種,但缺點也很明顯,因為它極其容易過擬合,所以有了很多剪枝演算法,大致很為兩類,預剪枝和後剪枝,由於採用的是貪心演算法,容易得到區域性最優解,此時有很多種方法跳出區域性最優解,比如模擬退火等
Ensemble Model
通過組合多個學習器來完成學習任務,通過整合方法,可以將多個弱學習器組合成一個強分類器,因此整合學習的泛化能力一般比單一分類器要好。整合方法主要包括Bagging和Boosting,Bagging和Boosting都是將已有的分類或迴歸演算法通過一定方式組合起來,形成一個更加強大的分類。兩種方法都是把若干個分類器整合為一個分類器的方法,只是整合的方式不一樣,最終得到不一樣的效果。常見的基於Baggin思想的整合模型有:隨機森林、基於Boosting思想的整合模型有:Adaboost、GBDT、XgBoost、LightGBM等
模型評價標準
這次選用的是AUC作為評選標準,什麼是ROC呢?這又牽扯到分類問題中的混淆矩陣、召回率,查全率等,西瓜書裡面第二章有很詳細的介紹,這裡就預設讀者已經看過了,下面來解釋什麼是AUC,在邏輯迴歸裡面,對於正負例的界定,通常會設一個閾值,大於閾值的為正類,小於閾值為負類。如果我們減小這個閥值,更多的樣本會被識別為正類,提高正類的識別率,但同時也會使得更多的負類被錯誤識別為正類。為了直觀表示這一現象,引入ROC。根據分類結果計算得到ROC空間中相應的點,連線這些點就形成ROC curve,橫座標為False Positive Rate(FPR:假正率),縱座標為True Positive Rate(TPR:真正率)。 一般情況下,這個曲線都應該處於(0,0)和(1,1)連線的上方ROC曲線中的四個點:
-
點(0,1):即FPR=0, TPR=1,意味著FN=0且FP=0,將所有的樣本都正確分類;
-
點(1,0):即FPR=1,TPR=0,最差分類器,避開了所有正確答案;
-
點(0,0):即FPR=TPR=0,FP=TP=0,分類器把每個例項都預測為負類;
-
點(1,1):分類器把每個例項都預測為正類
-
總之:ROC曲線越接近左上角,該分類器的效能越好,其泛化效能就越好。而且一般來說,如果ROC是光滑的,那麼基本可以判斷沒有太大的overfitting。
但是對於兩個模型,我們如何判斷哪個模型的泛化效能更優呢?這裡我們有主要以下兩種方法:
如果模型A的ROC曲線完全包住了模型B的ROC曲線,那麼我們就認為模型A要優於模型B;
如果兩條曲線有交叉的話,我們就通過比較ROC與X,Y軸所圍得曲線的面積來判斷,面積越大,模型的效能就越優,這個面積我們稱之為AUC(area under ROC curve)
import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相關設定
@return:
"""
# 宣告使用 Seaborn 樣式
sns.set()
# 有五種seaborn的繪圖風格,它們分別是:darkgrid, whitegrid, dark, white, ticks。預設的主題是darkgrid。
sns.set_style("whitegrid")
# 有四個預置的環境,按大小從小到大排列分別為:paper, notebook, talk, poster。其中,notebook是預設的。
sns.set_context('talk')
# 中文字型設定-黑體
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解決儲存影像是負號'-'顯示為方塊的問題
plt.rcParams['axes.unicode_minus'] = False
# 解決Seaborn中文顯示問題並調整字型大小
sns.set(font='SimHei')
#資料壓縮
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
data = pd.read_csv('data_for_model01.csv')
data = reduce_mem_usage(data)
D:\Anaconda1\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Memory usage of dataframe is 793236320.00 MB
Memory usage after optimization is: 181245298.00 MB
Decreased by 77.2%
data.head()
loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | ... | grade_to_std_n11 | grade_to_mean_n12 | grade_to_std_n12 | grade_to_mean_n13 | grade_to_std_n13 | grade_to_mean_n14 | grade_to_std_n14 | sample | n2.2 | n2.3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35008.0 | 5 | 19.515625 | 918.0000 | 5 | 21 | 161280 | 2.0 | 2 | 110000.0 | ... | 4.011719 | 1.852539 | 4.011719 | 1.857422 | 4.003906 | 1.856445 | 3.992188 | train | NaN | NaN |
1 | 18000.0 | 5 | 18.484375 | 462.0000 | 4 | 16 | 89538 | 5.0 | 0 | 46000.0 | ... | 3.207031 | 1.482422 | 3.207031 | 1.486328 | 3.205078 | 1.485352 | 3.193359 | train | NaN | NaN |
2 | 12000.0 | 5 | 16.984375 | 298.2500 | 4 | 17 | 159367 | 8.0 | 0 | 74000.0 | ... | 3.207031 | 1.482422 | 3.207031 | 1.486328 | 3.205078 | 1.315430 | 3.146484 | train | NaN | NaN |
3 | 2050.0 | 3 | 7.691406 | 63.9375 | 1 | 3 | 59830 | 9.0 | 0 | 35000.0 | ... | 0.801758 | 0.370605 | 0.801758 | 0.371582 | 0.801270 | 0.344238 | 0.793457 | train | NaN | NaN |
4 | 11504.0 | 3 | 14.976562 | 398.5000 | 3 | 12 | 85242 | 1.0 | 1 | 30000.0 | ... | 2.406250 | 1.111328 | 2.406250 | 1.114258 | 2.402344 | 1.114258 | 2.394531 | train | NaN | NaN |
5 rows × 122 columns
from sklearn.model_selection import KFold
# 分離資料集,方便進行交叉驗證
X_train = data.loc[data['sample']=='train', :].drop(['isDefault', 'sample'], axis=1)
X_test = data.loc[data['sample']=='test', :].drop(['isDefault', 'sample'], axis=1)
y_train = data.loc[data['sample']=='train', 'isDefault']
# 5折交叉驗證
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
"""對訓練集資料進行劃分,分成訓練集和驗證集,並進行相應的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 資料集劃分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'learning_rate': 0.1,
'metric': 'auc',
'min_child_weight': 1e-3,
'num_leaves': 31,
'max_depth': -1,
'reg_lambda': 0,
'reg_alpha': 0,
'feature_fraction': 1,
'bagging_fraction': 1,
'bagging_freq': 0,
'seed': 2020,
'nthread': 8,
'silent': True,
'verbose': -1,
}
"""使用訓練集資料進行模型訓練"""
model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=20000, verbose_eval=1000, early_stopping_rounds=200)
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[330] valid_0's auc: 0.731887
from sklearn import metrics
from sklearn.metrics import roc_auc_score
"""預測並計算roc的相關指標"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未調參前lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
未調參前lightgbm單模型在驗證集上的AUC:0.7318871300593701
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-vNOFrVFI-1600963034277)(output_7_1.png)]
import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
print('************************************ {} ************************************'.format(str(i+1)))
X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'learning_rate': 0.1,
'metric': 'auc',
'min_child_weight': 1e-3,
'num_leaves': 31,
'max_depth': -1,
'reg_lambda': 0,
'reg_alpha': 0,
'feature_fraction': 1,
'bagging_fraction': 1,
'bagging_freq': 0,
'seed': 2020,
'nthread': 8,
'silent': True,
'verbose': -1,
}
model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
cv_scores.append(roc_auc_score(y_val, val_pred))
print(cv_scores)
print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[308] valid_0's auc: 0.729253
[0.729252686605049]
************************************ 2 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[337] valid_0's auc: 0.730723
[0.729252686605049, 0.7307233610934907]
************************************ 3 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[527] valid_0's auc: 0.732105
[0.729252686605049, 0.7307233610934907, 0.7321048628412448]
************************************ 4 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[381] valid_0's auc: 0.727511
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779]
************************************ 5 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\basic.py:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[462] valid_0's auc: 0.732217
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_scotrainre_list:[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_score_mean:0.7303619043815351
lgb_score_std:0.0017871174424543119
from sklearn.model_selection import GridSearchCV
def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0,
feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
# 設定5折交叉驗證
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
n_estimators=n_estimators,
num_leaves=num_leaves,
max_depth=max_depth,
bagging_fraction=bagging_fraction,
feature_fraction=feature_fraction,
bagging_freq=bagging_freq,
min_data_in_leaf=min_data_in_leaf,
min_child_weight=min_child_weight,
min_split_gain=min_split_gain,
reg_lambda=reg_lambda,
reg_alpha=reg_alpha,
n_jobs= 8
)
grid_search = GridSearchCV(estimator=model_lgb,
cv=cv_fold,
param_grid=param_grid,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
print('模型當前最優引數為:{}'.format(grid_search.best_params_))
print('模型當前最優得分為:{}'.format(grid_search.best_score_))
# 設定5折交叉驗證
from sklearn.model_selection import KFold,StratifiedKFold
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'num_leaves': 29,
'max_depth': 7,
'min_data_in_leaf':45,
'min_child_weight':0.001,
'bagging_fraction': 0.9,
'feature_fraction': 0.9,
'bagging_freq': 40,
'min_split_gain': 0,
'reg_lambda':0,
'reg_alpha':0,
'nthread': 6
}
cv_result = lgb.cv(train_set=lgb_train,
early_stopping_rounds=20,
num_boost_round=5000,
nfold=5,
stratified=True,
shuffle=True,
params=final_params,
metrics='auc',
seed=0,
)
print('迭代次數{}'.format(len(cv_result['auc-mean'])))
print('交叉驗證的AUC為{}'.format(max(cv_result['auc-mean'])))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-b7fa2feb7fa4> in <module>
19 }
20
---> 21 cv_result = lgb.cv(train_set=lgb_train,
22 early_stopping_rounds=20,
23 num_boost_round=5000,
NameError: name 'lgb_train' is not defined
pip install bayesian-optimization
Collecting bayesian-optimization
Downloading bayesian-optimization-1.2.0.tar.gz (14 kB)
Requirement already satisfied: numpy>=1.9.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.18.1)
Requirement already satisfied: scipy>=0.14.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.4.1)
Requirement already satisfied: scikit-learn>=0.18.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (0.22.1)
Requirement already satisfied: joblib>=0.11 in d:\anaconda1\lib\site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.15.1)
Building wheels for collected packages: bayesian-optimization
Building wheel for bayesian-optimization (setup.py): started
Building wheel for bayesian-optimization (setup.py): finished with status 'done'
Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11689 sha256=92f6d72f1257c45277321db01836ffce0c63dac8f591db2b3db6e9c47e6d07c1
Stored in directory: c:\users\苗苗\appdata\local\pip\cache\wheels\fd\9b\71\f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0
Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import cross_val_score
"""定義優化函式"""
def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf,
min_child_weight, min_split_gain, reg_lambda, reg_alpha):
# 建立模型
model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', bjective='binary', metric='auc',
learning_rate=0.1, n_estimators=5000,
num_leaves=int(num_leaves), max_depth=int(max_depth),
bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),
bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),
min_child_weight=min_child_weight, min_split_gain=min_split_gain,
reg_lambda=reg_lambda, reg_alpha=reg_alpha,
n_jobs= 8
)
val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
return val
from bayes_opt import BayesianOptimization
"""定義優化引數"""
bayes_lgb = BayesianOptimization(
rf_cv_lgb,
{
'num_leaves':(10, 200),
'max_depth':(3, 20),
'bagging_fraction':(0.5, 1.0),
'feature_fraction':(0.5, 1.0),
'bagging_freq':(0, 100),
'min_data_in_leaf':(10,100),
'min_child_weight':(0, 10),
'min_split_gain':(0.0, 1.0),
'reg_alpha':(0.0, 10),
'reg_lambda':(0.0, 10),
}
)
"""開始優化"""
bayes_lgb.maximize(n_iter=10)
| iter | target | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1 [0m | [0m 0.7171 [0m | [0m 0.5841 [0m | [0m 45.89 [0m | [0m 0.9789 [0m | [0m 15.1 [0m | [0m 4.607 [0m | [0m 48.88 [0m | [0m 0.4838 [0m | [0m 16.29 [0m | [0m 1.699 [0m | [0m 1.449 [0m |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
190 try:
--> 191 target = self._cache[_hashable(x)]
192 except KeyError:
KeyError: (0.9808468358238472, 95.31683577641724, 0.6846527338078261, 15.254621027977167, 6.084315056472179, 23.81958341293199, 0.6162173085058286, 42.95894924047164, 1.5295440304650598, 6.915985580798569)
During handling of the above exception, another exception occurred:
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-23-2c2786145eac> in <module>
18
19 """開始優化"""
---> 20 bayes_lgb.maximize(n_iter=10)
D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in maximize(self, init_points, n_iter, acq, kappa, kappa_decay, kappa_decay_delay, xi, **gp_params)
183 iteration += 1
184
--> 185 self.probe(x_probe, lazy=False)
186
187 if self._bounds_transformer:
D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in probe(self, params, lazy)
114 self._queue.add(params)
115 else:
--> 116 self._space.probe(params)
117 self.dispatch(Events.OPTIMIZATION_STEP)
118
D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
192 except KeyError:
193 params = dict(zip(self._keys, x))
--> 194 target = self.target_func(**params)
195 self.register(x, target)
196 return target
<ipython-input-22-f352aad073e3> in rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, min_child_weight, min_split_gain, reg_lambda, reg_alpha)
15 )
16
---> 17 val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
18
19 return val
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
388 fit_params=fit_params,
389 pre_dispatch=pre_dispatch,
--> 390 error_score=error_score)
391 return cv_results['test_score']
392
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
234 return_times=True, return_estimator=return_estimator,
235 error_score=error_score)
--> 236 for train, test in cv.split(X, y, groups))
237
238 zipped_scores = list(zip(*scores))
D:\Anaconda1\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1030 self._iterating = self._original_iterator is not None
1031
-> 1032 while self.dispatch_one_batch(iterator):
1033 pass
1034
D:\Anaconda1\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
845 return False
846 else:
--> 847 self._dispatch(tasks)
848 return True
849
D:\Anaconda1\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
763 with self._lock:
764 job_idx = len(self._jobs)
--> 765 job = self._backend.apply_async(batch, callback=cb)
766 # A job can complete so quickly than its callback is
767 # called before we get here, causing self._jobs to
D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
204 def apply_async(self, func, callback=None):
205 """Schedule a func to be run"""
--> 206 result = ImmediateResult(func)
207 if callback:
208 callback(result)
D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
568 # Don't delay the application, to avoid keeping the input
569 # arguments in memory
--> 570 self.results = batch()
571
572 def get(self):
D:\Anaconda1\lib\site-packages\joblib\parallel.py in __call__(self)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in self.items]
254
255 def __reduce__(self):
D:\Anaconda1\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in self.items]
254
255 def __reduce__(self):
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
513 estimator.fit(X_train, **fit_params)
514 else:
--> 515 estimator.fit(X_train, y_train, **fit_params)
516
517 except Exception as e:
D:\Anaconda1\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
798 verbose=verbose, feature_name=feature_name,
799 categorical_feature=categorical_feature,
--> 800 callbacks=callbacks)
801 return self
802
D:\Anaconda1\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
593 verbose_eval=verbose, feature_name=feature_name,
594 categorical_feature=categorical_feature,
--> 595 callbacks=callbacks)
596
597 if evals_result:
D:\Anaconda1\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
247 evaluation_result_list=None))
248
--> 249 booster.update(fobj=fobj)
250
251 evaluation_result_list = []
D:\Anaconda1\lib\site-packages\lightgbm\basic.py in update(self, train_set, fobj)
1924 _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
1925 self.handle,
-> 1926 ctypes.byref(is_finished)))
1927 self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
1928 return is_finished.value == 1
KeyboardInterrupt:
bayes_lgb.max
{'target': 0.7170845006643078,
'params': {'bagging_fraction': 0.5841220522935171,
'bagging_freq': 45.89371469870785,
'feature_fraction': 0.9788842825399383,
'max_depth': 15.098220845321368,
'min_child_weight': 4.606814369239687,
'min_data_in_leaf': 48.875222916404226,
'min_split_gain': 0.4837879568993534,
'num_leaves': 16.292948242912633,
'reg_alpha': 1.699317625022757,
'reg_lambda': 1.4494033099871717}}
base_params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
'verbose': -1,
}
cv_result_lgb = lgb.cv(
train_set=train_matrix,
early_stopping_rounds=1000,
num_boost_round=20000,
nfold=5,
stratified=True,
shuffle=True,
params=base_params_lgb,
metrics='auc',
seed=0
)
print('迭代次數{}'.format(len(cv_result_lgb['auc-mean'])))
print('最終模型的AUC為{}'.format(max(cv_result_lgb['auc-mean'])))
import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
print('************************************ {} ************************************'.format(str(i+1)))
X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
}
model = lgb.train(params, train_set=train_matrix, num_boost_round=14269, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
cv_scores.append(roc_auc_score(y_val, val_pred))
print(cv_scores)
print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
base_params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
}
"""使用訓練集資料進行模型訓練"""
final_model_lgb = lgb.train(base_params_lgb, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=13000, verbose_eval=1000, early_stopping_rounds=200)
"""預測並計算roc的相關指標"""
val_pre_lgb = final_model_lgb.predict(X_val)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('調參後lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
import pickle
pickle.dump(final_model_lgb, open('dataset/model_lgb_best.pkl', 'wb'))
相關文章
- 建模與調參
- task04金融風控 建模調參
- 0基礎入門金融風控的 Task4 建模調參
- 金融風控-貸款違約預測-Task04 建模與調參
- lightgbm調參
- 資料競賽入門-金融風控(貸款違約預測)四、建模與調參
- 零基礎入門金融風控-貸款違約預測-Task04——建模與調參
- LightGBM核心解析與調參
- 15.調參(Tuning hyperparameters)
- 使用argparse進行調參
- Spark 模型選擇和調參Spark模型
- PID原理及調參經驗
- 深度學習調參tricks總結!深度學習
- 深度學習調參tricks總結深度學習
- 深度學習模型調參總結深度學習模型
- 樹模型調參指南——官網文件模型
- 卡爾曼濾波 跑通調參
- 貝葉斯全域性優化(LightGBM調參)優化
- 2020年深度學習調參技巧合集深度學習
- Medallia:消費者參與調查報告
- 終於做了一把MySQL調參boyMySql
- SVM 的核函式選擇和調參函式
- 關係建模ER建模-維度建模
- 機器學習中調參的基本思想機器學習
- 機器學習狗太苦逼了!自動化調參哪家強?機器學習
- 參觀和調研內蒙包頭工業園區有感
- 尼爾森:廣告代表性與參與度調查
- 財務建模最佳實踐 - DDD相關建模
- 當GridSearch遇上XGBoost 一段程式碼解決調參問題
- 工程能力UP | LightGBM的調參乾貨教程與並行優化並行優化
- 資料建模
- 三維建模
- 數學建模
- 維度建模
- StarUML 建模使用
- 誠邀參與 | 《2020中國安全運營中心調研分析報告》市場調查正式啟動
- CVPR 2020 Tutorial:自動化深度學習,你還在手動調參嗎?深度學習
- 微軟發起Java on Azure調查,呼籲Java社群積極參與微軟Java