Datawhale零基礎入門金融風控 Task5 模型融合打卡

蘋果樹MM發表於2020-09-27

原文網址 : https://blog.csdn.net/weixin_33569641/article/details/108839169

模型

Task5 模型融合

Tip:此部分為零基礎入門金融風控的 Task5 模型融合部分，歡迎大家後續多多交流。

賽題：零基礎入門資料探勘 - 零基礎入門金融風控之貸款違約預測

專案地址：https://github.com/datawhalechina/team-learning-data-mining/tree/master/FinancialRiskControl

比賽地址：https://tianchi.aliyun.com/competition/entrance/531830/introduction

5.1 學習目標¶

將之前建模調參的結果進行模型融合。嘗試多種融合方案，提交融合結果並打卡。（模型融合一般用於A榜比賽的尾聲和B榜比賽的全程）

5.2 內容介紹

模型融合是比賽後期上分的重要手段，特別是多人組隊學習的比賽中，將不同隊友的模型進行融合，可能會收穫意想不到的效果哦，往往模型相差越大且模型表現都不錯的前提下，模型融合後結果會有大幅提升，以下是模型融合的方式。

平均：
- 簡單平均法
- 加權平均法
投票：
- 簡單投票法
- 加權投票法
綜合：
- 排序融合
- log融合
stacking:
- 構建多層模型，並利用預測結果再擬合預測。
blending：
- 選取部分資料預測訓練得到預測結果作為新特徵，帶入剩下的資料中預測。
boosting/bagging（在Task4中已經提及，就不再贅述）

5.3 stacking\blending詳解

stacking 將若干基學習器獲得的預測結果，將預測結果作為新的訓練集來訓練一個學習器。如下圖假設有五個基學習器，將資料帶入五基學習器中得到預測結果，再帶入模型六中進行訓練預測。但是由於直接由五個基學習器獲得結果直接帶入模型六中，容易導致過擬合。所以在使用五個及模型進行預測的時候，可以考慮使用K折驗證，防止過擬合。

blending 與stacking不同，blending是將預測的值作為新的特徵和原特徵合併，構成新的特徵值，用於預測。為了防止過擬合，將資料分為兩部分d1、d2，使用d1的資料作為訓練集，d2資料作為測試集。預測得到的資料作為新特徵使用d2的資料作為訓練集結合新特徵，預測測試集結果

Blending與stacking的不同
- stacking
  stacking中由於兩層使用的資料不同，所以可以避免資訊洩露的問題。
  在組隊競賽的過程中，不需要給隊友分享自己的隨機種子。
- Blending
  由於blending對將資料劃分為兩個部分，在最後預測時有部分資料資訊將被忽略。
  同時在使用第二層資料時可能會因為第二層資料較少產生過擬合現象。
參考資料：還是沒有理解透徹嗎？可以檢視參考資料進一步瞭解哦! https://blog.csdn.net/wuzhongqiang/article/details/105012739

5.4 程式碼示例

5.4.1 平均：¶

簡單加權平均，結果直接融合求多個預測結果的平均值。pre1-pren分別是n組模型預測出來的結果，將其進行加權

pre = (pre1 + pre2 + pre3 +…+pren )/n

加權平均法一般根據之前預測模型的準確率，進行加權融合，將準確性高的模型賦予更高的權重。

pre = 0.3pre1 + 0.3pre2 + 0.4pre3

5.4.2 投票

簡單投票

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2, subsample=0.7,objective='binary:logistic')
 
vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))

加權投票
在VotingClassifier中加入引數 voting=‘soft’, weights=[2, 1, 1]，weights用於調節基模型的權重

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2, subsample=0.7,objective='binary:logistic')
 
vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)], voting='soft', weights=[2, 1, 1])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))

5.4.3 Stacking

import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions


# 以python自帶的鳶尾花資料集為例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target


clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)


label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
    
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)


clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
        
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
    clf_cv_mean.append(scores.mean())
    clf_cv_std.append(scores.std())
        
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(label)


plt.show()

5.4.2 blending

# 以python自帶的鳶尾花資料集為例
data_0 = iris.data
data = data_0[:100,:]


target_0 = iris.target
target = target_0[:100]
 
#模型融合中基學習器
clfs = [LogisticRegression(),
        RandomForestClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier()]
 
#切分一部分資料作為測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=914)


#切分訓練資料集為d1,d2兩部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=914)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
 
for j, clf in enumerate(clfs):
    #依次訓練各個單模型
    clf.fit(X_d1, y_d1)
    y_submission = clf.predict_proba(X_d2)[:, 1]
    dataset_d1[:, j] = y_submission
    #對於測試集，直接用這k個模型的預測值作為新的特徵。
    dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))


#融合使用的模型
clf = GradientBoostingClassifier()
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))

5.5 經驗總結

簡單平均和加權平均是常用的兩種比賽中模型融合的方式。其優點是快速、簡單。
stacking在眾多比賽中大殺四方，但是跑過程式碼的小夥伴想必能感受到速度之慢，同時stacking多層提升幅度並不能抵消其帶來的時間和記憶體消耗，所以實際環境中應用還是有一定的難度，同時在有答辯環節的比賽中，主辦方也會一定程度上考慮模型的複雜程度，所以說並不是模型融合的層數越多越好的。
當然在比賽中將加權平均、stacking、blending等混用也是一種策略，可能會收穫意想不到的效果哦！

5.6 Reference

https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.33.3b302f06qm1e4L&postId=129323

最後打卡關頭抄作業，也給自己一個思路整理

零基礎入門金融風控之貸款違約預測—模型融合
2020-09-27
模型
天池金融風控-貸款違約挑戰賽 Task5 模型融合
2020-09-27
模型
零基礎入門推薦系統-【排序模型+模型融合】
2020-12-06
排序模型
資料競賽入門-金融風控（貸款違約預測）五、模型融合
2020-09-27
模型
零基礎入門金融風控-貸款違約預測-Task04——建模與調參
2020-09-24
零基礎入門金融風控之貸款違約預測挑戰賽——簡單實現
2022-11-28
DataWhale打卡Day01--推薦系統入門
2020-10-19
0基礎入門金融風控的 Task4 建模調參
2020-09-24
DataWhale17期-task5
2020-09-27
資料探勘實踐（金融風控）：金融風控之貸款違約預測挑戰賽（上篇）[xgboots/lightgbm/Catboost等模型]--模型融合：stacking、blending
2023-05-17
boot模型
網際網路金融風控模型大全
2020-11-05
模型
Datawhale-爬蟲-Task5（selenium學習）
2019-03-05
爬蟲
零基礎入門Serverless：Hello World
2021-09-14
Server
零基礎入門Python的路徑
2018-11-17
Python
零基礎入門│帶你理解Kubernetes
2018-12-18
《Kubernetes零基礎快速入門》簡介
2022-06-10
從零基礎入門Tensorflow2.0 ----九、44.3 keras模型轉換成savedmodel
2020-06-17
Keras模型
零基礎入門前端的修煉之道
2018-10-30
前端
Java零基礎入門（三）流程控制
2020-09-25
Java
零基礎學習Alfred(一)：入門操作
2020-10-13
Alfred
給零基礎小白的Python入門教程
2019-07-10
Python
資料競賽入門-金融風控（貸款違約預測）四、建模與調參
2020-09-24
風控大講堂：做汽車金融風控有前途嗎？
2018-04-16
資料分析 | 零基礎入門資料分析（一）：從入門到摔門？
2018-06-21
PS 零基礎入門到精通視訊教程
2019-02-15
零基礎快速入門：java的命名規範
2020-10-26
Java
(Python篇）零基礎入門第三篇
2018-03-28
Python
C#零基礎小白快速入門指導
2023-02-03
C#
零基礎入門Python教程4節與基礎語法
2021-03-24
Python
最新python3完全零基礎入門（目前最新）
2018-11-14
Python
python萌新：從零基礎入門到放棄
2019-02-24
Python
大資料零基礎由入門到實戰
2018-05-03
大資料
2023年零基礎怎麼學習Java入門？
2023-05-12
Java
前端零基礎入門學習！前端真簡單
2020-10-20
前端
教你零基礎如何快速入門大資料技巧
2019-05-12
大資料
微信小程式零基礎入門踩坑之路
2018-04-02
微信小程式
零基礎學Java需知：Java小白入門解疑大全
2021-09-24
Java
零基礎入門前端，從小白到大神進階
2021-09-09
前端

Datawhale零基礎入門金融風控 Task5 模型融合 打卡