金融風控-貸款違約預測-Task04 建模與調參
金融風控學習賽
https://tianchi.aliyun.com/competition/entrance/531830/information
一、賽題資料
賽題以預測使用者貸款是否違約為任務,資料集報名後可見並可下載,該資料來自某信貸平臺的貸款記錄,總資料量超過120w,包含47列變數資訊,其中15列為匿名變數。為了保證比賽的公平性,將會從中抽取80萬條作為訓練集,20萬條作為測試集A,20萬條作為測試集B,同時會對employmentTitle、purpose、postCode和title等資訊進行脫敏。
匯入資料分析相關庫
# 匯入標準庫
import io, os, sys, types, time, datetime, math, random, requests, subprocess,io, tempfile, math
# 匯入第三方庫
# 資料處理
import numpy as np
import pandas as pd
# 資料視覺化
import matplotlib.pyplot as plt
from tqdm import tqdm
import missingno
import seaborn as sns
# from pandas.tools.plotting import scatter_matrix # No module named 'pandas.tools'
from mpl_toolkits.mplot3d import Axes3D
# plt.style.use('seaborn') # 改變影像風格
plt.rcParams['font.family'] = ['Arial Unicode MS', 'Microsoft Yahei', 'SimHei', 'sans-serif'] # 解決中文亂碼
plt.rcParams['axes.unicode_minus'] = False # simhei黑體字 負號亂碼 解決
# 特徵選擇和編碼
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize # Imputer
# from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute
# 機器學習
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
# 網格搜尋、隨機搜尋
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# 模型度量(分類)
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
# 警告處理
import warnings
warnings.filterwarnings('ignore')
# 在Jupyter上畫圖
%matplotlib inline
# 資料預處理
import numpy as np
import scipy as sc
import sklearn as sk
import matplotlib.pyplot as plt
# 繪圖工具包
import seaborn as sns
import pyecharts.options as opts
from pyecharts.charts import Line, Grid
資料集匯入
- train
- test
# 資料集路徑
train_path = 'train.csv'
test_path = 'testA.csv'
dataset_path = './'
data_train_path = dataset_path + train_path
data_test_path = dataset_path + test_path
# 2.資料集csv讀入
train = pd.read_csv(data_train_path)
test_a = pd.read_csv(data_test_path)
Task4 建模與調參
- 學習在金融分控領域常用的機器學習模型
- 學習機器學習模型的建模過程與調參流程
模型相關原理介紹
由於相關演算法原理篇幅較長,推薦一些部落格與教材供初學者們進行學習,用於補全相關知識。
-
1 邏輯迴歸模型
- https://blog.csdn.net/han_xiaoyang/article/details/49123419
-
2 決策樹模型
- https://blog.csdn.net/c406495762/article/details/76262487
-
3 GBDT模型
- https://zhuanlan.zhihu.com/p/45145899
-
4 XGBoost模型
- https://blog.csdn.net/wuzhongqiang/article/details/104854890
-
5 LightGBM模型
- https://blog.csdn.net/wuzhongqiang/article/details/105350579
-
6 Catboost模型
- https://mp.weixin.qq.com/s/xloTLr5NJBgBspMQtxPoFA
-
7 時間序列模型(選學)
-
RNN:https://zhuanlan.zhihu.com/p/45289691
-
LSTM:https://zhuanlan.zhihu.com/p/83496936
-
-
8 推薦教材:
-
《機器學習》 https://book.douban.com/subject/26708119/
-
《統計學習方法》 https://book.douban.com/subject/10590856/
-
《面向機器學習的特徵工程》 https://book.douban.com/subject/26826639/
-
《信用評分模型技術與應用》https://book.douban.com/subject/1488075/
-
《資料化風控》https://book.douban.com/subject/30282558/
-
建模程式碼
匯入相關模組並初始化配置
import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相關設定
@return:
"""
# 宣告使用 Seaborn 樣式
sns.set()
# 有五種seaborn的繪圖風格,它們分別是:darkgrid, whitegrid, dark, white, ticks。預設的主題是darkgrid。
sns.set_style("whitegrid")
# 有四個預置的環境,按大小從小到大排列分別為:paper, notebook, talk, poster。其中,notebook是預設的。
sns.set_context('talk')
# 中文字型設定-黑體
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解決儲存影像是負號'-'顯示為方塊的問題
plt.rcParams['axes.unicode_minus'] = False
# 解決Seaborn中文顯示問題並調整字型大小
sns.set(font='SimHei')
讀取資料
- reduce_mem_usage函式,轉換資料集格式,用於reduce資料集,這有利於減少記憶體佔用,適合資料集較大的適合使用
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
# 讀取資料
train = pd.read_csv('./train.csv')
test = pd.read_csv('./testA.csv')
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
Memory usage of dataframe is 300800080.00 MB
Memory usage after optimization is: 72834896.00 MB
Decreased by 75.8%
Memory usage of dataframe is 73600080.00 MB
Memory usage after optimization is: 18034472.00 MB
Decreased by 75.5%
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id 800000 non-null int32
loanAmnt 800000 non-null float16
term 800000 non-null int8
interestRate 800000 non-null float16
installment 800000 non-null float16
grade 800000 non-null category
subGrade 800000 non-null category
employmentTitle 799999 non-null float32
employmentLength 753201 non-null category
homeOwnership 800000 non-null int8
annualIncome 800000 non-null float32
verificationStatus 800000 non-null int8
issueDate 800000 non-null category
isDefault 800000 non-null int8
purpose 800000 non-null int8
postCode 799999 non-null float16
regionCode 800000 non-null int8
dti 799761 non-null float16
delinquency_2years 800000 non-null float16
ficoRangeLow 800000 non-null float16
ficoRangeHigh 800000 non-null float16
openAcc 800000 non-null float16
pubRec 800000 non-null float16
pubRecBankruptcies 799595 non-null float16
revolBal 800000 non-null float32
revolUtil 799469 non-null float16
totalAcc 800000 non-null float16
initialListStatus 800000 non-null int8
applicationType 800000 non-null int8
earliesCreditLine 800000 non-null category
title 799999 non-null float16
policyCode 800000 non-null float16
n0 759730 non-null float16
n1 759730 non-null float16
n2 759730 non-null float16
n3 759730 non-null float16
n4 766761 non-null float16
n5 759730 non-null float16
n6 759730 non-null float16
n7 759730 non-null float16
n8 759729 non-null float16
n9 759730 non-null float16
n10 766761 non-null float16
n11 730248 non-null float16
n12 759730 non-null float16
n13 759730 non-null float16
n14 759730 non-null float16
dtypes: category(5), float16(30), float32(3), int32(1), int8(8)
memory usage: 69.5 MB
# 特徵工程參考上一篇
from sklearn.model_selection import KFold
# 分離資料集,方便進行交叉驗證
y_train = train.loc[:,'isDefault']
X_train = train.drop(['id','issueDate','isDefault'], axis=1)
X_test = test.drop(['id','issueDate'], axis=1)
# 5折交叉驗證
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
建模示例
- 使用機器學習的整合演算法lightgbm,並5折交叉驗證
import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
print('************************************ {} ************************************'.format(str(i+1)))
X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'learning_rate': 0.1,
'metric': 'auc',
'min_child_weight': 1e-3,
'num_leaves': 31,
'max_depth': -1,
'reg_lambda': 0,
'reg_alpha': 0,
'feature_fraction': 1,
'bagging_fraction': 1,
'bagging_freq': 0,
'seed': 2020,
'nthread': 8,
'silent': True,
'verbose': -1,
}
model = lgb.train(params, train_set=train_matrix, num_boost_round=2000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=10)
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
cv_scores.append(roc_auc_score(y_val, val_pred))
print(cv_scores)
print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[108] valid_0's auc: 0.71916
[0.7191601264391831]
************************************ 2 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[110] valid_0's auc: 0.715545
[0.7191601264391831, 0.715544695574905]
************************************ 3 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[88] valid_0's auc: 0.718961
[0.7191601264391831, 0.715544695574905, 0.7189611956227128]
************************************ 4 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[132] valid_0's auc: 0.718808
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632]
************************************ 5 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[126] valid_0's auc: 0.71875
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_scotrainre_list:[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_score_mean:0.718244815494374
lgb_score_std:0.0013575028097615738
from sklearn import metrics
from sklearn.metrics import roc_auc_score
"""預測並計算roc的相關指標"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未調參前lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
未調參前lightgbm單模型在驗證集上的AUC:0.7187502453796062
模型調參
-
網格搜尋(一般推薦使用)
-
sklearn 提供GridSearchCV用於進行網格搜尋,只需要把模型的引數輸進去,就能給出最優化的結果和引數。相比起貪心調參,網格搜尋的結果會更優,但是網格搜尋只適合於小資料集,一旦資料的量級上去了,很難得出結果。
-
同樣以Lightgbm演算法為例,進行網格搜尋調參:
"""通過網格搜尋確定最優引數"""
from sklearn.model_selection import GridSearchCV
def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0,
feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
# 設定5折交叉驗證
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
n_estimators=n_estimators,
num_leaves=num_leaves,
max_depth=max_depth,
bagging_fraction=bagging_fraction,
feature_fraction=feature_fraction,
bagging_freq=bagging_freq,
min_data_in_leaf=min_data_in_leaf,
min_child_weight=min_child_weight,
min_split_gain=min_split_gain,
reg_lambda=reg_lambda,
reg_alpha=reg_alpha,
n_jobs= 8
)
grid_search = GridSearchCV(estimator=model_lgb,
cv=cv_fold,
param_grid=param_grid,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
print('模型當前最優引數為:{}'.format(grid_search.best_params_))
print('模型當前最優得分為:{}'.format(grid_search.best_score_))
"""以下程式碼未執行,耗時較長,請謹慎執行,且每一步的最優引數需要在下一步進行手動更新,請注意"""
"""
需要注意一下的是,除了獲取上面的獲取num_boost_round時候用的是原生的lightgbm(因為要用自帶的cv)
下面配合GridSearchCV時必須使用sklearn介面的lightgbm。
"""
"""設定n_estimators 為581,調整num_leaves和max_depth,這裡選擇先粗調再細調"""
lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}
get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=None, max_depth=None, min_data_in_leaf=20,
min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""num_leaves為30,max_depth為7,進一步細調num_leaves和max_depth"""
lgb_params = {'num_leaves': range(25, 35, 1), 'max_depth': range(5,9,1)}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=None, max_depth=None, min_data_in_leaf=20,
min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""
確定min_data_in_leaf為45,min_child_weight為0.001 ,下面進行bagging_fraction、feature_fraction和bagging_freq的調參
"""
lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)],
'feature_fraction': [i/10 for i in range(5,10,1)],
'bagging_freq': range(0,81,10)
}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45,
min_child_weight=0.001,bagging_fraction=None, feature_fraction=None, bagging_freq=None,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""
確定bagging_fraction為0.4、feature_fraction為0.6、bagging_freq為 ,下面進行reg_lambda、reg_alpha的調參
"""
lgb_params = {'reg_lambda': [0,0.001,0.01,0.03,0.08,0.3,0.5], 'reg_alpha': [0,0.001,0.01,0.03,0.08,0.3,0.5]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45,
min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40,
min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)
"""
確定reg_lambda、reg_alpha都為0,下面進行min_split_gain的調參
"""
lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45,
min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40,
min_split_gain=None, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""
引數確定好了以後,我們設定一個比較小的learning_rate 0.005,來確定最終的num_boost_round
"""
# 設定5折交叉驗證
# cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'num_leaves': 29,
'max_depth': 7,
'min_data_in_leaf':45,
'min_child_weight':0.001,
'bagging_fraction': 0.9,
'feature_fraction': 0.9,
'bagging_freq': 40,
'min_split_gain': 0,
'reg_lambda':0,
'reg_alpha':0,
'nthread': 6
}
cv_result = lgb.cv(train_set=lgb_train,
early_stopping_rounds=20,
num_boost_round=5000,
nfold=5,
stratified=True,
shuffle=True,
params=final_params,
metrics='auc',
seed=0,
)
print('迭代次數{}'.format(len(cv_result['auc-mean'])))
print('交叉驗證的AUC為{}'.format(max(cv_result['auc-mean'])))
-
在實際調整過程中,可先設定一個較大的學習率(上面的例子中0.1),通過Lgb原生的cv函式進行樹個數的確定,之後再通過上面的例項程式碼進行引數的調整優化。
-
最後針對最優的引數設定一個較小的學習率(例如0.05),同樣通過cv函式確定樹的個數,確定最終的引數。
-
需要注意的是,針對大資料集,上面每一層引數的調整都需要耗費較長時間,
總結
- 我們建模基本流程包括劃分資料集、K折交叉驗證等方式對模型的效能進行評估驗證,並通過視覺化方式繪製模型ROC曲線,經過調參後使得模型引數匹配對應資料集的分佈,得到理想的分類效果。
二、評測標準
提交結果為每個測試樣本是1的概率,也就是y為1的概率。評價方法為AUC評估模型效果(越大越好)。
分類常用使用的評估指標是:
- Accuracy(精確度),AUC,Recall(召回率),Precision(準確度),F1,Kappa
本次是學習賽使用的評估指標是AUC
- AUC也就是ROC曲線下與座標軸圍成的面積
- ROC空間將假正例率(FPR)定義為 X 軸,真正例率(TPR)定義為 Y 軸。
- TPR:在所有實際為正例的樣本中,被正確地判斷為正例之比率。
- FPR:在所有實際為負例的樣本中,被錯誤地判斷為正例之比率。
- AUC的取值範圍子是0.5和1之間,面積越大,精準度越高,因此AUC越接近1.0,模型精準率預告,AUC為1時精準率為100%,
三、結果提交
提交前請確保預測結果的格式與sample_submit.csv中的格式一致,以及提交檔案字尾名為csv。
相關文章
- 零基礎入門金融風控-貸款違約預測-Task04——建模與調參
- 資料競賽入門-金融風控(貸款違約預測)四、建模與調參
- task04金融風控 建模調參
- 資料競賽入門-金融風控(貸款違約預測)五、模型融合模型
- 零基礎入門金融風控之貸款違約預測—模型融合模型
- 天池金融風控-貸款違約挑戰賽 Task5 模型融合模型
- 零基礎入門金融風控之貸款違約預測挑戰賽——簡單實現
- 資料探勘實踐(金融風控):金融風控之貸款違約預測挑戰賽(上篇)[xgboots/lightgbm/Catboost等模型]--模型融合:stacking、blendingboot模型
- 貸款違約預測專案-資料分箱
- 0基礎入門金融風控的 Task4 建模調參
- 建模與調參
- 【 專案:信用卡客戶使用者畫像 及 貸款違約預測模型 】模型
- 建模調參
- 機器學習股票價格預測從爬蟲到預測-預測與調參機器學習爬蟲
- 美聯邦學生貸款基於收入還款佔比擴大 仍有11%出現違約
- 12月第1周業務風控關注 | 100款違法違規APP下架整改APP
- 一季度信貸不良率上升 金融風控策略升級迫在眉睫
- 信貸風控全流程-反欺詐
- 二手房、金融貸款微站
- 借貸寶智慧風控協助警方,高效打擊套路貸
- 信貸風控模型開發----模型簡介模型
- 風控大講堂:做汽車金融風控有前途嗎?
- 鄭州擬調整公積金貸款政策:首套住房最高貸款70%HGR
- 國美金融是“持牌大戶”還是“違規大戶”:國美易卡貸超出現套路貸APPAPP
- 乾貨丨AI助力金融風控的趨勢與挑戰AI
- 貸款借錢平臺 貸款原始碼 小額貸款系統 卡卡貸原始碼 小額貸款原始碼 貸款平臺開發搭建原始碼
- 網際網路金融風控模型大全模型
- 《中國金融科技風控報告2020》:2018年以來金融科技風控投資連續兩年下滑
- 《中國金融科技風控報告2020》: 2019年金融科技風控廠商融資為63.5億元
- 華為安全檢測服務加碼,招行金融風控創新升級
- LightGBM核心解析與調參
- 1月第3周業務風控關注 |官方監測發現24款違法App,包括高鐵管家、搜狗瀏覽器等APP瀏覽器
- SOLIDWORKS引數化設計工具講座預約 免費參與Solid
- 利用威脅建模防範金融和網際網路風險
- 金融風控反欺詐之圖演算法演算法
- 資料安全與風控解決方案測試實踐與思考
- 有頭有臉的大資料風控服務商,終究逃不過搞貸款這條路大資料
- 《仙劍世界》TapTap預約突破50w!參與測試招募,探索仙劍IP全新開放世界!APT