六、混淆矩陣:
混淆矩陣是由一個座標系組成的,有x軸以及y軸,在x軸裡面有0和1,在y軸裡面有0和1。x軸表達的是預測的值,y軸表達的是真實的值。可以對比真實值與預測值之間的差異,可以計算當前模型衡量的指標值。
這裡精度的表示:(136+138)/(136+13+9+138)。之前有提到recall=TP/(TP+FN),在這裡的表示具體如下:
下面定義繪製混淆矩陣的函式:
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): # This function prints and plots the confusion matrix plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) # cneter 改為 center thresh = cm.max() / 2 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label")
下面根據上面得出的最好的那個C值,根據下采樣資料集繪製出混淆矩陣。
import itertools lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear') lr.fit(X_train_undersample, y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset:", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])) # Plot non-normalized confusion.matrix class_names = [0, 1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()
可以看出recall值達到93%,但是因為上面測試資料集採用的下采樣資料集,資料利用率太低。
下面根據原始的劃分的測試資料集來進行測試:
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear') lr.fit(X_train_undersample, y_train_undersample.values.ravel()) y_pred = lr.predict(X_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test, y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset:", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])) # Plot non-normalized confusion matrix class_names = [0, 1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix") plt.show()
可以看到,這次測試的樣本資料有八萬多。達到的效果還行。這裡誤預測的值有一萬多個,有點小多。
那下面如果我們直接拿原始資料集來進行建模,來看看在樣本資料集分佈不均衡的情況recall值的情況。
best_c = printing_Kfold_scores(X_train, y_train)
-------------------------------------------
C parameter: 0.01
-------------------------------------------
Iteration 0 : recall score = 0.4925373134328358
Iteration 1 : recall score = 0.6027397260273972
Iteration 2 : recall score = 0.6833333333333333
Iteration 3 : recall score = 0.5692307692307692
Iteration 4 : recall score = 0.45
Mean recall score 0.5595682284048672
-------------------------------------------
C parameter: 0.1
-------------------------------------------
Iteration 0 : recall score = 0.5671641791044776
Iteration 1 : recall score = 0.6164383561643836
Iteration 2 : recall score = 0.6833333333333333
Iteration 3 : recall score = 0.5846153846153846
Iteration 4 : recall score = 0.525
Mean recall score 0.5953102506435158
-------------------------------------------
C parameter: 1
-------------------------------------------
Iteration 0 : recall score = 0.5522388059701493
Iteration 1 : recall score = 0.6164383561643836
Iteration 2 : recall score = 0.7166666666666667
Iteration 3 : recall score = 0.6153846153846154
Iteration 4 : recall score = 0.5625
Mean recall score 0.612645688837163
-------------------------------------------
C parameter: 10
-------------------------------------------
Iteration 0 : recall score = 0.5522388059701493
Iteration 1 : recall score = 0.6164383561643836
Iteration 2 : recall score = 0.7333333333333333
Iteration 3 : recall score = 0.6153846153846154
Iteration 4 : recall score = 0.575
Mean recall score 0.6184790221704963
-------------------------------------------
C parameter: 100
-------------------------------------------
Iteration 0 : recall score = 0.5522388059701493
Iteration 1 : recall score = 0.6164383561643836
Iteration 2 : recall score = 0.7333333333333333
Iteration 3 : recall score = 0.6153846153846154
Iteration 4 : recall score = 0.575
Mean recall score 0.6184790221704963
*********************************************************************************
Best model to choose from cross validation is with C parameter 10.0
*********************************************************************************
可以看出,recall值基本在60%左右。
繪製出混淆矩陣看看:
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear') lr.fit(X_train, y_train.values.ravel()) # 注意這裡不是x_pred_undersample 而是y_pred_undersample y_pred_undersample = lr.predict(X_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test, y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])) # Plot non-normalized confusion matrix class_names = [0, 1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusison matrix') plt.show()
可以看出,在樣本資料分佈不均衡的情況下,直接進行建立模型,結果並不太好。
在以前學習的邏輯迴歸模型中,預設是根據0.5來對結果進行分類。那我們可以作出猜想,可不可以通過改變這個閾值來確定到底哪個閾值對模型的最終結果更好呢?
lr = LogisticRegression(C=0.01, penalty='l1', solver='liblinear') lr.fit(X_train_undersample, y_train_undersample.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) # 返回預測的概率值 thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # 閾值列表 plt.figure(figsize=(10, 10)) j = 1 for i in thresholds: y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i plt.subplot(3, 3, j) j += 1 # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall) np.set_printoptions(precision=2) print("Recall metric in the testing dataset:", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])) # Plot non-normalized confusion matrix class_names = [0, 1] plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s' % i)
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.9795918367346939
Recall metric in the testing dataset: 0.9387755102040817
Recall metric in the testing dataset: 0.891156462585034
Recall metric in the testing dataset: 0.8367346938775511
Recall metric in the testing dataset: 0.7687074829931972
Recall metric in the testing dataset: 0.5850340136054422
圖上可以看出,不同的閾值,混淆矩陣是長什麼樣子的。根據精度、recall值和誤預測的值來綜合考慮,可以看出閾值在0.5和0.6模型的效果不錯。
七、過取樣操作
過取樣操作(SMOTE演算法):
(1)對於少數類中每一個樣本x,以歐氏距離為標準計算它到少數類樣本集中所有樣本的距離,得到其k近鄰。
(2)根據樣本不平衡比例設定一個取樣比例以確定取樣倍率N,對於每一個少數類樣本x,從其k近鄰中隨機選擇若干個樣本,假設選擇的近鄰為xn。
(3)對於每一個隨機選出的近鄰xn,分別與原樣本按照如下的公式構建新的樣本。
匯入相關的Python庫
import pandas as pd from imblearn.over_sampling import SMOTE # pip install imblearn from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split
得到特徵和標籤資料
credit_cards = pd.read_csv('creditcard.csv') columns = credit_cards.columns # The labels are in the last column ('Class'). Simply remove it to obtain features columns features_columns = columns.delete(len(columns) - 1) features = credit_cards[features_columns] labels = credit_cards['Class']
劃分訓練集測試集
features_train, features_test, labels_train, labels_test = train_test_split( features, labels, test_size=0.2, random_state=0)
根據SMOTE演算法得到過取樣資料集
oversampler = SMOTE(random_state=0) os_features,os_labels = oversampler.fit_sample(features_train,labels_train) # OS oversampler
可以看看過取樣資料集大小
len(os_labels[os_labels==1])
227454
下面根據過取樣資料集來進行交叉驗證及邏輯迴歸模型建立
os_features = pd.DataFrame(os_features) os_labels = pd.DataFrame(os_labels) best_c = printing_Kfold_scores(os_features, os_labels)
-------------------------------------------
C parameter: 0.01
-------------------------------------------
Iteration 0 : recall score = 0.8903225806451613
Iteration 1 : recall score = 0.8947368421052632
Iteration 2 : recall score = 0.9687728228394379
Iteration 3 : recall score = 0.9578813158791396
Iteration 4 : recall score = 0.958167089831943
Mean recall score 0.933976130260189
-------------------------------------------
C parameter: 0.1
-------------------------------------------
Iteration 0 : recall score = 0.8903225806451613
Iteration 1 : recall score = 0.8947368421052632
Iteration 2 : recall score = 0.9703884032311608
Iteration 3 : recall score = 0.9593981160901727
Iteration 4 : recall score = 0.9605082379837548
Mean recall score 0.9350708360111024
-------------------------------------------
C parameter: 1
-------------------------------------------
Iteration 0 : recall score = 0.8903225806451613
Iteration 1 : recall score = 0.8947368421052632
Iteration 2 : recall score = 0.9704105344694036
Iteration 3 : recall score = 0.9585847594552709
Iteration 4 : recall score = 0.9595410030665743
Mean recall score 0.9347191439483347
-------------------------------------------
C parameter: 10
-------------------------------------------
Iteration 0 : recall score = 0.8903225806451613
Iteration 1 : recall score = 0.8947368421052632
Iteration 2 : recall score = 0.9705433218988603
Iteration 3 : recall score = 0.9601894901133203
Iteration 4 : recall score = 0.9604862553720007
Mean recall score 0.9352556980269211
-------------------------------------------
C parameter: 100
-------------------------------------------
Iteration 0 : recall score = 0.8903225806451613
Iteration 1 : recall score = 0.8947368421052632
Iteration 2 : recall score = 0.9703220095164324
Iteration 3 : recall score = 0.9604093162308613
Iteration 4 : recall score = 0.9607170727954188
Mean recall score 0.9353015642586275
*********************************************************************************
Best model to choose from cross validation is with C parameter 100.0
*********************************************************************************
再來看看混淆矩陣
lr = LogisticRegression(C = best_c, penalty = 'l1', solver='liblinear') lr.fit(os_features,os_labels.values.ravel()) y_pred = lr.predict(features_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(labels_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
經過前面的學習,綜合考慮精度,recall值和誤預測的值,發現過取樣的效果比下采樣的效果要好一點。
八、總結:
對於樣本不均衡資料,要利用越多的資料越好。下采樣誤預測值很高,這是模型本身自帶的一個問題,因為0和1一樣少,模型會認為原始資料0和1的資料一樣少,導致誤預測值偏高。在這次的案例中,過取樣的結果偏好一些,雖然recall偏低了一點,但是整體的效果還是不錯的。
流程:
(1)首先要觀察資料,當前資料是否分佈均衡,不均衡的情況下就要想一些方法。(這次的資料是比較純淨的,就不需要做其他一些預處理的操作,直接原封不動的拿出來就可以了。很多情況下,不見得可以直接拿到特徵資料。)
(2)讓資料進行標準化,讓資料的浮動比較小一些,然後再進行資料的選擇。
(3)混淆矩陣以及模型的評估標準,然後通過交叉驗證的方式來進行引數的選擇。
(4)通過閾值與預測值進行比較,然後得到最終的一個預測結果。不同的閾值會使結果發生很大的變化。
(5)SMOTE演算法。
通過對信用卡欺詐檢測這個案例瞭解了機器學習中樣本資料分佈不均衡的解決方案、交叉驗證、正則化懲罰、混淆矩陣和模型的評估方法等等。