金融風控-貸款違約預測-Task04 建模與調參

火星有星火發表於2020-09-24

金融風控學習賽

https://tianchi.aliyun.com/competition/entrance/531830/information

一、賽題資料

賽題以預測使用者貸款是否違約為任務,資料集報名後可見並可下載,該資料來自某信貸平臺的貸款記錄,總資料量超過120w,包含47列變數資訊,其中15列為匿名變數。為了保證比賽的公平性,將會從中抽取80萬條作為訓練集,20萬條作為測試集A,20萬條作為測試集B,同時會對employmentTitle、purpose、postCode和title等資訊進行脫敏。

匯入資料分析相關庫

# 匯入標準庫
import io, os, sys, types, time, datetime, math, random, requests, subprocess,io, tempfile, math

# 匯入第三方庫
# 資料處理
import numpy as np
import pandas as pd

# 資料視覺化
import matplotlib.pyplot as plt
from tqdm import tqdm
import missingno
import seaborn as sns 
# from pandas.tools.plotting import scatter_matrix  # No module named 'pandas.tools'
from mpl_toolkits.mplot3d import Axes3D
# plt.style.use('seaborn')  # 改變影像風格
plt.rcParams['font.family'] = ['Arial Unicode MS', 'Microsoft Yahei', 'SimHei', 'sans-serif']  # 解決中文亂碼
plt.rcParams['axes.unicode_minus'] = False  # simhei黑體字 負號亂碼 解決

# 特徵選擇和編碼
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize # Imputer
# from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute

# 機器學習
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

# 網格搜尋、隨機搜尋
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# 模型度量(分類)
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss

# 警告處理 
import warnings
warnings.filterwarnings('ignore')

# 在Jupyter上畫圖
%matplotlib inline

# 資料預處理
import numpy as np
import scipy as sc
import sklearn as sk
import matplotlib.pyplot as plt

# 繪圖工具包
import seaborn as sns
import pyecharts.options as opts
from pyecharts.charts import Line, Grid

資料集匯入

  • train
  • test
# 資料集路徑

train_path = 'train.csv'
test_path = 'testA.csv'
dataset_path = './'
data_train_path = dataset_path + train_path
data_test_path = dataset_path + test_path


# 2.資料集csv讀入
train = pd.read_csv(data_train_path)
test_a = pd.read_csv(data_test_path)

Task4 建模與調參

  • 學習在金融分控領域常用的機器學習模型
  • 學習機器學習模型的建模過程與調參流程

模型相關原理介紹

由於相關演算法原理篇幅較長,推薦一些部落格與教材供初學者們進行學習,用於補全相關知識。

  • 1 邏輯迴歸模型

    • https://blog.csdn.net/han_xiaoyang/article/details/49123419
  • 2 決策樹模型

    • https://blog.csdn.net/c406495762/article/details/76262487
  • 3 GBDT模型

    • https://zhuanlan.zhihu.com/p/45145899
  • 4 XGBoost模型

    • https://blog.csdn.net/wuzhongqiang/article/details/104854890
  • 5 LightGBM模型

    • https://blog.csdn.net/wuzhongqiang/article/details/105350579
  • 6 Catboost模型

    • https://mp.weixin.qq.com/s/xloTLr5NJBgBspMQtxPoFA
  • 7 時間序列模型(選學)

    • RNN:https://zhuanlan.zhihu.com/p/45289691

    • LSTM:https://zhuanlan.zhihu.com/p/83496936

  • 8 推薦教材:

    • 《機器學習》 https://book.douban.com/subject/26708119/

    • 《統計學習方法》 https://book.douban.com/subject/10590856/

    • 《面向機器學習的特徵工程》 https://book.douban.com/subject/26826639/

    • 《信用評分模型技術與應用》https://book.douban.com/subject/1488075/

    • 《資料化風控》https://book.douban.com/subject/30282558/

建模程式碼

匯入相關模組並初始化配置

import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相關設定
@return:
"""
# 宣告使用 Seaborn 樣式
sns.set()
# 有五種seaborn的繪圖風格,它們分別是:darkgrid, whitegrid, dark, white, ticks。預設的主題是darkgrid。
sns.set_style("whitegrid")
# 有四個預置的環境,按大小從小到大排列分別為:paper, notebook, talk, poster。其中,notebook是預設的。
sns.set_context('talk')
# 中文字型設定-黑體
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解決儲存影像是負號'-'顯示為方塊的問題
plt.rcParams['axes.unicode_minus'] = False
# 解決Seaborn中文顯示問題並調整字型大小
sns.set(font='SimHei')

讀取資料

  • reduce_mem_usage函式,轉換資料集格式,用於reduce資料集,這有利於減少記憶體佔用,適合資料集較大的適合使用
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
# 讀取資料
train = pd.read_csv('./train.csv')
test = pd.read_csv('./testA.csv')
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
Memory usage of dataframe is 300800080.00 MB
Memory usage after optimization is: 72834896.00 MB
Decreased by 75.8%
Memory usage of dataframe is 73600080.00 MB
Memory usage after optimization is: 18034472.00 MB
Decreased by 75.5%
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id                    800000 non-null int32
loanAmnt              800000 non-null float16
term                  800000 non-null int8
interestRate          800000 non-null float16
installment           800000 non-null float16
grade                 800000 non-null category
subGrade              800000 non-null category
employmentTitle       799999 non-null float32
employmentLength      753201 non-null category
homeOwnership         800000 non-null int8
annualIncome          800000 non-null float32
verificationStatus    800000 non-null int8
issueDate             800000 non-null category
isDefault             800000 non-null int8
purpose               800000 non-null int8
postCode              799999 non-null float16
regionCode            800000 non-null int8
dti                   799761 non-null float16
delinquency_2years    800000 non-null float16
ficoRangeLow          800000 non-null float16
ficoRangeHigh         800000 non-null float16
openAcc               800000 non-null float16
pubRec                800000 non-null float16
pubRecBankruptcies    799595 non-null float16
revolBal              800000 non-null float32
revolUtil             799469 non-null float16
totalAcc              800000 non-null float16
initialListStatus     800000 non-null int8
applicationType       800000 non-null int8
earliesCreditLine     800000 non-null category
title                 799999 non-null float16
policyCode            800000 non-null float16
n0                    759730 non-null float16
n1                    759730 non-null float16
n2                    759730 non-null float16
n3                    759730 non-null float16
n4                    766761 non-null float16
n5                    759730 non-null float16
n6                    759730 non-null float16
n7                    759730 non-null float16
n8                    759729 non-null float16
n9                    759730 non-null float16
n10                   766761 non-null float16
n11                   730248 non-null float16
n12                   759730 non-null float16
n13                   759730 non-null float16
n14                   759730 non-null float16
dtypes: category(5), float16(30), float32(3), int32(1), int8(8)
memory usage: 69.5 MB
# 特徵工程參考上一篇
from sklearn.model_selection import KFold
# 分離資料集,方便進行交叉驗證
y_train = train.loc[:,'isDefault']
X_train = train.drop(['id','issueDate','isDefault'], axis=1)
X_test = test.drop(['id','issueDate'], axis=1)

# 5折交叉驗證
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

建模示例

  • 使用機器學習的整合演算法lightgbm,並5折交叉驗證
import lightgbm as lgb
"""使用lightgbm 5折交叉驗證進行建模預測"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'learning_rate': 0.1,
                'metric': 'auc',
        
                'min_child_weight': 1e-3,
                'num_leaves': 31,
                'max_depth': -1,
                'reg_lambda': 0,
                'reg_alpha': 0,
                'feature_fraction': 1,
                'bagging_fraction': 1,
                'bagging_freq': 0,
                'seed': 2020,
                'nthread': 8,
                'silent': True,
                'verbose': -1,
    }
    
    model = lgb.train(params, train_set=train_matrix, num_boost_round=2000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=10)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[108]	valid_0's auc: 0.71916
[0.7191601264391831]
************************************ 2 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[110]	valid_0's auc: 0.715545
[0.7191601264391831, 0.715544695574905]
************************************ 3 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[88]	valid_0's auc: 0.718961
[0.7191601264391831, 0.715544695574905, 0.7189611956227128]
************************************ 4 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[132]	valid_0's auc: 0.718808
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632]
************************************ 5 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[126]	valid_0's auc: 0.71875
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_scotrainre_list:[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_score_mean:0.718244815494374
lgb_score_std:0.0013575028097615738
from sklearn import metrics
from sklearn.metrics import roc_auc_score

"""預測並計算roc的相關指標"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未調參前lightgbm單模型在驗證集上的AUC:{}'.format(roc_auc))
"""畫出roc曲線圖"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 畫出對角線
plt.plot([0,1],[0,1],'r--')
plt.show()
未調參前lightgbm單模型在驗證集上的AUC:0.7187502453796062

在這裡插入圖片描述

模型調參

  • 網格搜尋(一般推薦使用)

  • sklearn 提供GridSearchCV用於進行網格搜尋,只需要把模型的引數輸進去,就能給出最優化的結果和引數。相比起貪心調參,網格搜尋的結果會更優,但是網格搜尋只適合於小資料集,一旦資料的量級上去了,很難得出結果。

  • 同樣以Lightgbm演算法為例,進行網格搜尋調參:

"""通過網格搜尋確定最優引數"""
from sklearn.model_selection import GridSearchCV

def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, 
                       feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
    # 設定5折交叉驗證
    cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    
    model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                   n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   max_depth=max_depth,
                                   bagging_fraction=bagging_fraction,
                                   feature_fraction=feature_fraction,
                                   bagging_freq=bagging_freq,
                                   min_data_in_leaf=min_data_in_leaf,
                                   min_child_weight=min_child_weight,
                                   min_split_gain=min_split_gain,
                                   reg_lambda=reg_lambda,
                                   reg_alpha=reg_alpha,
                                   n_jobs= 8
                                  )
    grid_search = GridSearchCV(estimator=model_lgb, 
                               cv=cv_fold,
                               param_grid=param_grid,
                               scoring='roc_auc'
                              )
    grid_search.fit(X_train, y_train)

    print('模型當前最優引數為:{}'.format(grid_search.best_params_))
    print('模型當前最優得分為:{}'.format(grid_search.best_score_))
"""以下程式碼未執行,耗時較長,請謹慎執行,且每一步的最優引數需要在下一步進行手動更新,請注意"""

"""
需要注意一下的是,除了獲取上面的獲取num_boost_round時候用的是原生的lightgbm(因為要用自帶的cv)
下面配合GridSearchCV時必須使用sklearn介面的lightgbm。
"""
"""設定n_estimators 為581,調整num_leaves和max_depth,這裡選擇先粗調再細調"""
lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}
get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                   min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""num_leaves為30,max_depth為7,進一步細調num_leaves和max_depth"""
lgb_params = {'num_leaves': range(25, 35, 1), 'max_depth': range(5,9,1)}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                   min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""
確定min_data_in_leaf為45,min_child_weight為0.001 ,下面進行bagging_fraction、feature_fraction和bagging_freq的調參
"""
lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)], 
              'feature_fraction': [i/10 for i in range(5,10,1)],
              'bagging_freq': range(0,81,10)
             }
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=None, feature_fraction=None, bagging_freq=None, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""
確定bagging_fraction為0.4、feature_fraction為0.6、bagging_freq為 ,下面進行reg_lambda、reg_alpha的調參
"""
lgb_params = {'reg_lambda': [0,0.001,0.01,0.03,0.08,0.3,0.5], 'reg_alpha': [0,0.001,0.01,0.03,0.08,0.3,0.5]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                   min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)

"""
確定reg_lambda、reg_alpha都為0,下面進行min_split_gain的調參
"""
lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                   min_split_gain=None, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""
引數確定好了以後,我們設定一個比較小的learning_rate 0.005,來確定最終的num_boost_round
"""
# 設定5折交叉驗證
# cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
                'boosting_type': 'gbdt',
                'learning_rate': 0.01,
                'num_leaves': 29,
                'max_depth': 7,
                'min_data_in_leaf':45,
                'min_child_weight':0.001,
                'bagging_fraction': 0.9,
                'feature_fraction': 0.9,
                'bagging_freq': 40,
                'min_split_gain': 0,
                'reg_lambda':0,
                'reg_alpha':0,
                'nthread': 6
               }

cv_result = lgb.cv(train_set=lgb_train,
                   early_stopping_rounds=20,
                   num_boost_round=5000,
                   nfold=5,
                   stratified=True,
                   shuffle=True,
                   params=final_params,
                   metrics='auc',
                   seed=0,
                  )

print('迭代次數{}'.format(len(cv_result['auc-mean'])))
print('交叉驗證的AUC為{}'.format(max(cv_result['auc-mean'])))
  • 在實際調整過程中,可先設定一個較大的學習率(上面的例子中0.1),通過Lgb原生的cv函式進行樹個數的確定,之後再通過上面的例項程式碼進行引數的調整優化。

  • 最後針對最優的引數設定一個較小的學習率(例如0.05),同樣通過cv函式確定樹的個數,確定最終的引數。

  • 需要注意的是,針對大資料集,上面每一層引數的調整都需要耗費較長時間,

總結

  • 我們建模基本流程包括劃分資料集、K折交叉驗證等方式對模型的效能進行評估驗證,並通過視覺化方式繪製模型ROC曲線,經過調參後使得模型引數匹配對應資料集的分佈,得到理想的分類效果。

二、評測標準

提交結果為每個測試樣本是1的概率,也就是y為1的概率。評價方法為AUC評估模型效果(越大越好)。

分類常用使用的評估指標是:

  • Accuracy(精確度),AUC,Recall(召回率),Precision(準確度),F1,Kappa

本次是學習賽使用的評估指標是AUC

  • AUC也就是ROC曲線下與座標軸圍成的面積
  • ROC空間將假正例率(FPR)定義為 X 軸,真正例率(TPR)定義為 Y 軸。
    • TPR:在所有實際為正例的樣本中,被正確地判斷為正例之比率。
    • FPR:在所有實際為負例的樣本中,被錯誤地判斷為正例之比率。
  • AUC的取值範圍子是0.5和1之間,面積越大,精準度越高,因此AUC越接近1.0,模型精準率預告,AUC為1時精準率為100%,

三、結果提交

提交前請確保預測結果的格式與sample_submit.csv中的格式一致,以及提交檔案字尾名為csv。

相關文章