貸款違約預測專案-資料分箱

ReidChenJX發表於2020-11-09

原文網址 : https://blog.csdn.net/ReidChenJX/article/details/109575305

我們知道，在使用誤差平方和作為損失函式的模型中，離群點的存在會極大提高誤差值。但如果直接刪除離群點樣本，訓練資料減少也會降低模型的精度，特別是在本例中，離群點佔正例的概率很大。為了保留離群點，同時又能夠起到優化模型的效果，我們可採用資料分箱的技術。

特徵篩除

在分箱之前，我們先剔除掉資訊量少的特徵。在資訊理論中，我們學習到，如果一個訊號的方差很大，那麼其中包含的資訊量就多，同理，如果一個特徵的數值分佈廣，那麼包含的資訊就多。處理資料前，我們可以剔除單一變數的特徵，可認為該類特徵較少資訊。

# 特徵篩選：單一變數比重檢測，刪除比重超過95%的特徵
def drop_single_variable(data):
    drop_list = []
    for col in data.columns:
        precent = data[col].value_counts().max() / float(len(data))
        if precent >= 0.95:
            drop_list.append(col)
    data.drop(drop_list,axis=1,inplace=True)
    return drop_list
drop_list = drop_single_variable(data_train)
data_test.drop(drop_list,axis=1,inplace=True)

卡方分箱

最先想到的方法是卡方分箱。卡方分箱值利用卡方係數來合併資料分組，進而達到分箱的目的。
卡方統計的公式：
其中fo指實際頻數，fe指期望頻數。卡方係數的計算過程如下：

統計正樣本個數佔所有樣本個數的比值，將其作為期望比例fp。
對特徵A的取值進行從大到小排序，並去重。通過計算樣本中A=a值的個數，乘以期望比例fp，當做特徵A為a時的期望頻數fe。通過統計樣本中A=a時，正樣本的個數作為實際頻數fo。運用上面公式計算特徵A=a時的卡方係數。
改變A的取值，遍歷特徵A所有的可能取值，可得到每一個取值下的卡方係數。

卡方係數的意義：如果特徵A=a樣本下，計算得到卡方係數接近於0，則意味著特徵A的值是否為a，與標籤Y是否為正，是兩個獨立條件。也意味著特徵A=a對標籤沒有貢獻。
以上步驟，我們對特徵A所有的可能情況計算了卡方係數，預設根據特徵A進行了多次分箱。接下來我們要考慮如何進行分箱合併。

選擇出卡方係數最小的一箱資料，向前或向後進行合併，合併規則為：選擇前一或後一個最小卡方係數的箱，進行合併。合併後計算新的卡方係數。
通過不斷的重複步驟4，直到滿足我們的分箱條件，如：滿足最小分箱個數，或者每個箱子滿足一定的卡方閾值。

根據順序對資料進行分箱後，我們即可根據資料區間實現樣本的分箱：利用 cut 函式。

卡方分箱實際操作程式碼如下：

# 計算資料特徵的卡方值
def get_chi2(data, col):
    # 計算樣本期望頻率
    pos_cnt = data['isDefault'].sum()
    all_cnt = data['isDefault'].count()
    expected_ratio = float(pos_cnt/all_cnt)
    
    # 對變數按照順序排序
    df = data[[col,'isDefault']]
    col_value = list(set(df[col]))    # 用set排除重複項
    col_value.sort()
    
    # 計算每一個區間的卡方統計量
    chi_list = []
    pos_list = []
    expected_pos_list = []
    
    for value in col_value:
        df_pos_cnt = df.loc[df[col]==value, 'isDefault'].sum()    # 實際頻數
        df_all_cnt = df.loc[df[col]==value, 'isDefault'].count()
        
        expected_pos_cnt = df_all_cnt * expected_ratio    # 期望頻數
        chi_square = (df_pos_cnt - expected_pos_cnt)**2 / expected_pos_cnt
        
        chi_list.append(chi_square)
        pos_list.append(df_pos_cnt)
        expected_pos_list.append(expected_pos_cnt)
    
    # 將結果匯入DataFrame格式
    chi_result = pd.DataFrame({col:col_value,'chi_square':chi_list,
                              'pos_cnt':pos_list,'expected_pos_cnt':expected_pos_list})
    return chi_result

# 根據給定的自由度和顯著性水平, 計算卡方閾值
def cal_chisqure_threshold(dfree=4, cf=0.1):
    
    percents = [0.95, 0.90, 0.5, 0.1, 0.05, 0.025, 0.01, 0.005]
    
    ## 計算每個自由度，在每個顯著性水平下的卡方閾值
    df = pd.DataFrame(np.array([chi2.isf(percents, df=i) for i in range(1, 30)]))
    df.columns = percents
    df.index = df.index+1
    
    pd.set_option('precision', 3)
    return df.loc[dfree, cf]

# 給定資料集與特徵名稱，通過最大分箱數與卡方閾值，得出卡方表與最佳分箱區間
def chiMerge_chisqure(data, col, dfree=4, cf=0.1, maxInterval=7):

    chi_result = get_chi2(data, col)
    threshold = cal_chisqure_threshold(dfree, cf)
    min_chiSquare = chi_result['chi_square'].min()
    group_cnt = len(chi_result)
    
    # 如果變數區間的最小卡方值小於閾值，則繼續合併直到最小值大於等於閾值
    
    while(min_chiSquare < threshold or group_cnt > maxInterval):
        min_index = chi_result[chi_result['chi_square']==chi_result['chi_square'].min()].index.tolist()[0]
        
        # 如果分箱區間在最前,則向下合併
        if min_index == 0:
            chi_result = merge_chiSquare(chi_result, min_index+1, min_index)    # min_index+1, min_index的順序可保證最小值在前，便於切分割槽間
        
        # 如果分箱區間在最後，則向上合併
        elif min_index == group_cnt-1:
            chi_result = merge_chiSquare(chi_result, min_index-1, min_index)    # min_index-1, min_index的順序保證最大值在最後，便於切分割槽間
        
        # 如果分箱區間在中間，則判斷與其相鄰的最小卡方的區間，然後進行合併
        else:
            if chi_result.loc[min_index-1, 'chi_square'] > chi_result.loc[min_index+1, 'chi_square']:
                chi_result = merge_chiSquare(chi_result, min_index, min_index+1)
            else:
                chi_result = merge_chiSquare(chi_result, min_index-1, min_index)
        
        min_chiSquare = chi_result['chi_square'].min()
        
        group_cnt = len(chi_result)

    boundary = list(chi_result.iloc[:,0])
    
    return chi_result, boundary


#     按index進行合併，並計算合併後的卡方值，mergeindex 是合併後的序列值
def merge_chiSquare(chi_result, index, mergeIndex, a = 'expected_pos_cnt',b = 'pos_cnt', c = 'chi_square'):

    chi_result.loc[mergeIndex, a] = chi_result.loc[mergeIndex, a] + chi_result.loc[index, a]
    chi_result.loc[mergeIndex, b] = chi_result.loc[mergeIndex, b] + chi_result.loc[index, b]
    ## 兩個區間合併後，新的chi2值如何計算
    chi_result.loc[mergeIndex, c] = (chi_result.loc[mergeIndex, b] - chi_result.loc[mergeIndex, a])**2 /chi_result.loc[mergeIndex, a]
    
    chi_result = chi_result.drop([index])
    ## 重置index
    chi_result = chi_result.reset_index(drop=True)
    
    return chi_result

for fea in continuous_fea:
    chi_result, boundary = chiMerge_chisqure(data_train, fea)
    data_train[fea+'kf_bins'] = pd.cut(data_train[fea], bins= boundary, labels=False)
    data_test[fea+'kf_bins'] = pd.cut(data_test[fea], bins= boundary, labels=False)

卡方分箱優點：

減少離群點與異常點對模型的影響。將上萬資料區分為個位數的資料箱，可有效防止異常資料的影響。在分箱過程中，還可以對缺失資料進行單獨分箱，替代缺失值填充的過程。
防止過擬合，模型目的為二分類，整合資料能有效防止過擬合。
對採用梯度下降演算法的模型，能加速擬合過程，採用分箱後的資料代替原始資料，可實現資料標準化的功能，避免量綱對模型的干擾。

卡方分箱缺點：分箱過程較慢。分箱涉及大量重複性計算過程。當然可以採用設定初始箱數的方法來加數計算過程，如：開始以100數為整體，進行初始分箱。

決策樹分箱

分箱的實際意義在於：選擇合適的切分點，對資料集進行切分。
在傳統機器學習演算法中，決策樹剛好是最直觀的進行資料切分的模型。
我們構造一顆以資訊熵為指標的決策樹，決策樹的葉子節點數就是我們需要的分箱數。訓練決策樹後，通過獲取樹生成過程中的切分點，即可獲得分箱區間，過程如下：

from sklearn.tree import DecisionTreeClassifier
# 利用決策樹獲得最優分箱的邊界值列表
def optimal_binning_boundary(x: pd.Series, y: pd.Series, nan: float = -999.) -> list:
    
    boundary = []  # 待return的分箱邊界值列表
    
    x = x.fillna(nan).values  # 填充缺失值
    y = y.values
    
    clf = DecisionTreeClassifier(criterion='entropy',    #“資訊熵”最小化準則劃分
                                 max_leaf_nodes=6,       # 最大葉子節點數
                                 min_samples_leaf=0.05)  # 葉子節點樣本數量最小佔比

    clf.fit(x.reshape(-1, 1), y)  # 訓練決策樹
    
    n_nodes = clf.tree_.node_count
    children_left = clf.tree_.children_left
    children_right = clf.tree_.children_right
    threshold = clf.tree_.threshold
    
    for i in range(n_nodes):
        if children_left[i] != children_right[i]:  # 獲得決策樹節點上的劃分邊界值
            boundary.append(threshold[i])

    boundary.sort()

    min_x = x.min() - 0.1  
    max_x = x.max() + 0.1  # -0.1 +0.1是為了考慮後續groupby操作時，能包含特徵最小值，最大值的樣本
    boundary = [min_x] + boundary + [max_x]
    return boundary

for fea in continuous_fea:
    boundary = optimal_binning_boundary(x=data_train[fea],y=data_train['isDefault'])
    data_train[fea+'_tr_bins'] = pd.cut(data_train[fea], bins= boundary, labels=False)
    data_test[fea+'_tr_bins'] = pd.cut(data_test[fea], bins= boundary, labels=False)

採用決策樹分箱，明顯比卡方分箱更快。並且能獲得不弱於卡方分箱的WOE值。在實際過程中，更推薦使用決策樹來進行分箱。

WOE值與IV值

分箱後，為進一步利用資料，可進行WOE與IV值的轉換。

# 計算特徵的WOE與IV值
def call_WOE_IV(data, var, target):
    eps = 0.0001
    gbi = pd.crosstab(data[var], data[target]) + eps
    gb = data[target].value_counts() + eps
    gbri = gbi / gb
    gbri.rename(columns={'0':'0_i','1':'1_i'},inplace=True)

    gbri['WOE'] = np.log(gbri[1] / gbri[0])
    gbri['IV'] = (gbri[1] - gbri[0]) * gbri['WOE']
    
    congb = pd.concat([gbi,gbri],axis=1)
    return congb
# 計算分箱後的WOE值，並生成新的特徵
for col in data_train.columns:
    if 'tr_bins' in col:
        WOE_table = dict(call_WOE_IV(data_train,col,'isDefault')['WOE'])
        data_train[col+'_woe'] = data_train[col].map(WOE_table)
        data_test[col+'_woe'] = data_test[col].map(WOE_table)

WOE值與IV值的意義有很多部落格進行解釋，這裡不再贅述。我們將資料的WOE值作為新的特徵帶入模型，在一定程度上能夠提高模型的精確度。

資料競賽入門-金融風控（貸款違約預測）五、模型融合
2020-09-27
模型
【專案：信用卡客戶使用者畫像及貸款違約預測模型】
2020-06-26
模型
資料競賽入門-金融風控（貸款違約預測）四、建模與調參
2020-09-24
金融風控-貸款違約預測-Task04 建模與調參
2020-09-24
零基礎入門金融風控之貸款違約預測—模型融合
2020-09-27
模型
零基礎入門金融風控-貸款違約預測-Task04——建模與調參
2020-09-24
零基礎入門金融風控之貸款違約預測挑戰賽——簡單實現
2022-11-28
資料探勘實踐（金融風控）：金融風控之貸款違約預測挑戰賽（上篇）[xgboots/lightgbm/Catboost等模型]--模型融合：stacking、blending
2023-05-17
boot模型
天池金融風控-貸款違約挑戰賽 Task5 模型融合
2020-09-27
模型
美聯邦學生貸款基於收入還款佔比擴大仍有11%出現違約
2020-02-24
資料預處理-資料歸約
2020-01-19
貸款借錢平臺貸款原始碼小額貸款系統卡卡貸原始碼小額貸款原始碼貸款平臺開發搭建
2024-05-29
原始碼
專案功能--批次匯入預約設定
2024-12-05
yearn.finance創始人新推出的ENM專案遭遇Flash貸款攻擊
2020-09-29
NaN
在Xamarin.iOS專案中使用預設資料庫
2018-08-24
iOS資料庫
資料預處理- 資料清理資料整合資料變換資料規約
2020-01-15
資料庫週刊41丨9月資料庫排行榜；資料庫簽約專案盤點；2020 資料技術嘉年華活動預告…
2020-09-17
資料庫
小米消費金融也被牽扯到業如小貸“首付貸”專案
2022-01-15
國外專家預測2019年資料架構趨勢
2019-03-19
架構
監管即將啟動不法貸款中介專項整治！
2023-03-09
2023年資料工程預測
2022-12-06
資料預測“加成”，解鎖“預測未來”新玩法！
2021-12-28
搭建springboot專案，檢測資料庫是否連線成功
2019-07-14
Spring Boot資料庫
拍拍貸資料庫審計
2019-12-24
資料庫
拉鉤專案(一)--專案流程+資料提取
2020-06-14
資料專案與erp專案的差異
2022-11-06
專案2：運營商客戶流失分析與預測
2024-07-28
機器學習專案---預測心臟病（二）
2020-12-02
機器學習
機器學習專案---預測心臟病（一）
2020-12-01
機器學習
Alex Woodie：2019大資料預測
2019-01-09
大資料
資料探勘-預測模型彙總
2020-11-08
模型
新媒體文章違規資訊檢測，檢測新媒體文章，告別違規封號
2020-06-04
中國人民銀行：2022年小額貸款公司統計資料包告
2023-02-04
鄭州擬調整公積金貸款政策：首套住房最高貸款70％HGR
2022-03-19
如何實現一流的專案可預測性？
2022-04-24
資料治理--房產專案
2024-06-09
Python做點選率資料預測
2024-06-18
Python
300萬預約，上線3周流水破億，爆款《高能手辦團》背後的資料之道
2021-06-03

貸款違約預測專案-資料分箱

特徵篩除

卡方分箱

決策樹分箱

WOE值與IV值

相關文章