機器學習專案實戰----信用卡欺詐檢測(一)

|舊市拾荒|發表於2019-07-18

原文網址 : https://www.cnblogs.com/xiaoyh/p/11194053.html

一、任務基礎

資料集包含由歐洲人於2013年9月使用信用卡進行交易的資料。此資料集顯示兩天內發生的交易，其中284807筆交易中有492筆被盜刷。資料集非常不平衡，正例（被盜刷）佔所有交易的0.172％。，這是因為由於保密問題，我們無法提供有關資料的原始功能和更多背景資訊。特徵V1，V2，... V28是使用PCA獲得的主要元件，沒有用PCA轉換的唯一特徵是“Class”和“Amount”。特徵'Time'包含資料集中每個刷卡時間和第一次刷卡時間之間經過的秒數。特徵'Class'是響應變數，如果發生被盜刷，則取值1，否則為0。

任務目的是完成資料集中正常交易資料和異常交易資料的分類，並對測試資料進行預測。

資料集連結：https://pan.baidu.com/s/1GTeCYPhDEan_8c5t7Si_qw 提取碼：b93f

首先匯入需要使用的庫

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

讀取資料集檔案，檢視資料集前5行資料

data = pd.read_csv("creditcard.csv")
data.head()

在上圖中Class標籤代表資料分類，0代表正常資料，1代表欺詐資料。

這裡是做信用卡資料的欺詐檢測。在整個資料裡面，有正常的資料，也有問題的資料。對於一般情況來說，有問題的資料肯定只佔了極少部分。

下面繪出柱狀圖可以直觀顯示正常資料與異常資料的數量差異。　　

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar') # 使用pandas可以繪製一些簡單的圖
# 欺詐類別柱狀圖
plt.title("Fraud class histogram")
plt.xlabel("Class")
# 頻率
plt.ylabel("Frequency")

從輸出的結果可以看出正常的樣本0大概有28萬個，異常的樣本1非常少，從圖中不太容易看出來，但是實際上是存在的，大概只有那麼幾百個。

因為Amount這列的資料浮動太大，在做機器學習的過程中，需要保證特徵值差異不能過大，於是需要對Amount進行預處理，標準化資料。

Time這一列本身沒有多大用處，Amount這一列被標準化後的資料代替。所有刪除這兩列的資料。

# 預處理  標準化資料
from sklearn.preprocessing import StandardScaler
# norm 標準  -1表示自動判斷X維度  對比原始碼 這裡要加上.values
# 加上新的特徵列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

二、樣本資料分佈不均衡解決方案

上面說到資料集裡面正常資料和異常資料數量差異極大，對於這種樣本資料不均衡問題，一般有以下兩種策略：

（1）下采樣策略：之前統計的結果可以看出0的樣本有28萬個，而1的樣本只有幾百個。現在將0的資料也變成幾百個就可以了。下采樣，是使樣本的資料同樣少
（2）過取樣策略：之前統計的結果可以看出0的樣本有28萬個，而1的樣本只有幾百個。0比較多1比較少,對1的樣本資料進行生成數列，讓生成的資料與0的樣本資料一樣多。

下面首先採用下采樣策略

# loc 基於標籤索引  iloc 基於行號索引
# ix 基於行號和標籤索引都行  但是已被放棄

# X = data.ix[:, data.columns != 'Class']
# # print(X)
# y = data.ix[:, data.columns == 'Class']

X = data.iloc[:, data.columns != 'Class'] # 特徵資料
# print(X)
y = data.iloc[:, data.columns == 'Class'] # 

# Number of data points in the minority class 選取少部分異常資料集
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes 選取正常類的索引
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
# 從正常類的索引中隨機選取 X 個資料  replace 代替的意思
random_normal_indices = np.random.choice(normal_indices,
                                         number_records_fraud,
                                         replace=False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices, :]

X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']

# Showing ratio   transactions:交易
print(
    "Percentage of normal transactions:",
    len(under_sample_data[under_sample_data.Class == 0]) /
    len(under_sample_data))
print(
    "Percentage of fraud transactions:",
    len(under_sample_data[under_sample_data.Class == 1]) /
    len(under_sample_data))
print("Total number of transactions in resampled data:",
      len(under_sample_data))

Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in resampled data: 984

可以看出經過下采樣策略過後，正常資料與異常資料各佔50%，並且總樣本數也只有少部分。

下面對原始資料集和下采樣後的資料集分別進行切分操作。

# sklearn更新後在執行以下程式碼時可能會出現這樣的問題：
# from sklearn.cross_validation import train_test_split
# ModuleNotFoundError: No module named 'sklearn.cross_validation'
# 原因新版本已經不支援 改為以下程式碼
from sklearn.model_selection import train_test_split

# Whole dataset  test_size 表示訓練集測試集的比例  
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=0)

print("Number transactions train dataset:", len(X_train))
print("Number transactions test dataset:", len(X_test))
print("Total number of transactions:", len(X_train) + len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(
    X_undersample, y_undersample, test_size=0.3, random_state=0)

print("")
print("Number transactions train dataset:", len(X_train_undersample))
print("Number transactions test dataset:", len(X_test_undersample))
print("Total number of transactions:", len(X_train_undersample) + len(X_test_undersample))

Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807

Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984

三、模型評估方法：

假設有1000個病人的資料，有990個人不患癌症，10個人是患癌症。用一個最常見的評估標準，比方說精度，就是真實值與預測值之間的差異，真實值用y來表示，預測值用y1來表示。y真實值1，2，3...10,共有10個樣本，y1預測值1，2，3...10，共有10個樣本，精度就是看真實值y與預測值y1是否一樣的，要麼都是0，要麼都是1，如果是一致，就用“=”表示，比如1號真實值樣本=預測值的1號樣本,如果不相等就用不等號來表示。如果等號出現了8個，那麼它的精確度為8/10=80%,從而確定模型的精度。

990個人不患癌症，10個人是患癌症建立一個模型，所有的預測值都會建立一個正樣本。對1000個樣本輸入到模型,它的精確度是多少呢?990/1000=99%。這個模型把所有的值都預測成正樣本，但是沒有得到任何一個負樣本。在醫院是想得到癌症的識別，但是檢查出來的結果是0個，雖然精度達到了99%，但這個模型是沒有任何的含義的，因為一個癌症病人都找不出來。在建立模型的時候一定要想好一件事，模型雖然很容易建立出來，那麼難點是應該怎麼樣去評估這樣的模型呢?

剛才提到了用精度去評估模型，但是精度有些時候是騙人的。尤其是在樣本資料不均衡的情況下。接下來要講到一個知識點叫recall，叫召回率或叫查全率。recall有0或者1，我們的目標是找出患有癌症的那10個人。因此根據目標制定衡量的標準，就是有10個癌症病人，能夠檢測出來有幾個?如果檢測0個癌症病人，那麼recall值就是0/10=0。如果檢測2個癌症病人，那麼recall值就是2/10=20%。用recall檢測模型的效果更科學一些。建立模型無非是選擇一些引數，recall的表示也並非那麼容易.在統計學中會經常提到的4個詞，分別如下：

# Recall = TP/(TP+FN) Recall(召回率或查全率)
from sklearn.linear_model import LogisticRegression  # 使用邏輯迴歸模型
# from sklearn.cross_validation import KFold, cross_val_score  版本更新這行程式碼也不再支援
from sklearn.model_selection import KFold, cross_val_score  # fold:摺疊 KFold 表示切分成幾分資料進行交叉驗證
from sklearn.metrics import confusion_matrix, recall_score, classification_report

四、正則化懲罰：

比如有A模型的權重引數：θ1、θ2、θ3...θ10，比如還有B模型的權重引數：θ1、θ2、θ3...θ10，這兩個模型的recall值都是等於90%。如果兩個模型的recall值都是等於90%，是不是隨便選一個都可以呢？
但是假如A模型的引數浮動比較大，具體如截圖：

B模型的引數浮動較小，如截圖所示：

雖然兩個模型的recall值都是等於90%，但是A模型的浮動範圍太大了，我們希望模型更加穩定一些，不光滿足訓練的資料，還要儘可能的滿足測試資料。因此希望模型的浮動差異更小一些，差異小可以使過度擬合的風險更小一些。

過度擬合的意思是在訓練集表達效果很好，但是在測試集表達效果很差，因此這組模型發生了過擬合。過擬合是非常常見的現象，很大程度上是因為權重引數浮動較大引起的，因此希望得到B模型，因為B模型的浮動差異比較小。那麼怎麼樣能夠得到B模型呢？從而就引入了正則化的東西，懲罰模型引數θ，因為模型的資料有時候分佈大，有時候分佈小。希望大力度懲罰A模型，小力度懲罰B模型。我們可以利用正則化找到更為簡潔的描述方式的量化過程，我們將損失函式改造為：

C₀表示未引入正則化懲罰之前的損失函式，C表示引入正則化懲罰後新的損失函式，w代表權重引數值。上面這個式子表達的是L1正則化。對於A模型，w值浮動比較大，如果計算|w|的話，這樣的話計算的目標損失函式的值就會更大。所有就加上λ引數來懲罰這個權重值。下面還有一種L2正則化。

於是最主要就是需要設定當前懲罰的力度到底有多大？可以設定成0.1，那麼懲罰力度就比較小，也可以設定懲罰力度為1，也可以設定懲罰力度為10。但是懲罰力度等於多少的時候，效果比較好呢？具體多少也不知道，需要通過交叉驗證，去評估一下什麼樣的引數達到更好的效果。C_param_range = [0.01,0.1,1,10,100]這裡就是前面提到的λ引數。需要將這5個引數不斷的嘗試。

五、交叉驗證　　

比如有個集合叫data，通常建立機器模型的時候，先對資料進行切分或者選擇，取前面80%的資料當成訓練集，取20%的資料當成測試集。80%的資料是來建立一個模型，剩下的20%的資料是用來測試模型。因此第一步是將資料進行切分，切分成訓練集以及測試集。這部分操作是必須要做的。第二步還要在訓練集進行平均切分，比如平均切分成3份，分別是資料集1,2,3。

在建立模型的時候，不管建立什麼樣的模型，這個模型伴隨著很多引數，有不同的引數進行選擇，這個引數選擇大比較好，還是選擇小比較好一些？從經驗值角度來說，肯定沒辦法很準的，怎麼樣去確定這個引數呢？只能通過交叉驗證的方式。

那什麼又叫交叉驗證呢？

第一次：將資料集1,2分別建立模型，用資料集3在當前權重下去驗證當前模型的效果。資料集3是個驗證集，驗證集是訓練集的一部分。用驗證集去驗證模型是好還是壞。
第二次：將資料集1,3分別建立模型，用資料集2在當前權重下去驗證當前模型的效果。
第三次：將資料集2,3分別建立模型，用資料集1在當前權重下去驗證當前模型的效果。

如果只是求一次的交叉驗證，這樣的操作會存在風險。比如只做第一次交叉驗證，會使3驗證集偏簡單一些。會使模型效果偏高，此外模型有些資料是錯誤值以及離群值，如果把這些不太好的資料當成驗證集，會使模型的效果偏低的。模型當然是不希望偏高也不希望偏低，那就需要多做幾次交叉驗證模型，求平均值。這裡有1，2，3分別作驗證集，每個驗證集都有評估的標準。最終模型的效果將1，2，3的評估效果加在一起，再除以3，就可以得到模型一個大致的效果。

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    
    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]
    
    result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
    
    # the k-fold will give 2 lists:train_indices=indices[0],test_indices = indices[1]
    j=0  # 迴圈找到最好的懲罰力度
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter:',c_param)
        print('-------------------------------------------')
        print('')
        
        recall_accs = []
        for iteration,indices in enumerate(fold.split(x_train_data)):
            
            # 使用特定的C引數呼叫邏輯迴歸模型
            # Call the logistic regression model with a certain C parameter
            # 引數 solver=’liblinear’ 消除警告
            # 出現警告：模型未能收斂 ，請增加收斂次數
            #  ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
            #  "the number of iterations.", ConvergenceWarning)
            #  增加引數 max_iter 預設1000
            lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
            # Use the training data to fit the model. In this case, we use the portion
            # of the fold to train the model with indices[0], We then predict on the
            # portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            
            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            
            # Calculate the recall score and append it to a list for recall scores 
            # representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ',iteration,': recall score = ',recall_acc)
            
        # the mean value of those recall scores is the metric we want to save and get
        # hold of.
        result_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ',np.mean(recall_accs))
        print('')
        
    # 注意此處報錯  原始碼沒有astype('float64')
    best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter',best_c)
    print('*********************************************************************************')
    
    return best_c

使用下采樣資料集呼叫上面這個函式　　

best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

輸出結果：

-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.958904109589041
Iteration  1 : recall score =  0.9178082191780822
Iteration  2 : recall score =  1.0
Iteration  3 : recall score =  0.9864864864864865
Iteration  4 : recall score =  0.9545454545454546

Mean recall score  0.9635488539598128

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.8356164383561644
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9322033898305084
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.8939393939393939

Mean recall score  0.8941437733404299

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.8493150684931506
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9090909090909091

Mean recall score  0.9100832939235539

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9324324324324325
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9131506202785514

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9158533229812542

*********************************************************************************
Best model to choose from cross validation is with C parameter 0.01
*********************************************************************************

根據上面結果可以看出，當正則化引數為0.01時，recall的值最高。

未完待續。。。

機器學習專案實戰----信用卡欺詐檢測(二)
2019-07-19
機器學習
機器學習案例實戰之信用卡欺詐檢測【人工智慧工程師--AI轉型必修課】
2020-04-04
機器學習人工智慧工程師AI
機器學習在實時性欺詐檢測中的應用案例
2018-06-04
機器學習
用深度學習進行欺詐檢測
2019-04-28
深度學習
揭秘Stripe欺詐檢測系統背後的機器學習演算法 - quastor
2022-01-20
機器學習演算法AST
揭祕Stripe欺詐檢測系統背後的機器學習演算法 - quastor
2022-01-20
機器學習演算法AST
實戰 | 如何上線一個機器學習專案？
2018-05-08
機器學習
機器學習實戰專案-預測數值型迴歸
2019-04-08
機器學習
華為例項：機器學習攻克金融欺詐難題
2019-01-09
機器學習
反欺詐中所用到的機器學習模型有哪些？
2018-03-11
機器學習模型
Stripe如何解決信用卡欺詐？ - Patrick
2022-12-01
機器學習專案實戰----泰坦尼克號獲救預測(二)
2019-08-12
機器學習
Spark機器學習實戰 (十一) - 文字情感分類專案實戰
2019-04-19
Spark機器學習
機器學習專案---預測心臟病（一）
2020-12-01
機器學習
JavaScript玩轉機器學習-Tensorflow.js專案實戰
2020-12-29
JavaScript機器學習JS
經典圖模型欺詐檢測系統BotGraph
2019-04-28
模型OTG
欺騙機器學習模型
2018-04-06
機器學習模型
人臉識別檢測專案實戰
2023-01-03
《機器學習實戰》第一章機器學習基礎
2018-11-25
機器學習
信用卡欺詐行為邏輯迴歸資料分析-大資料ML樣本集案例實戰
2018-12-08
邏輯迴歸大資料
機器學習專案---預測心臟病（二）
2020-12-02
機器學習
《機器學習實戰》學習大綱
2018-12-01
機器學習
泰坦尼克生還預測：完整的機器學習專案(一)
2018-05-23
機器學習
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
機器學習實戰（一）—— 線性迴歸
2020-12-01
機器學習
資料探勘—邏輯迴歸分類—信用卡欺詐分析
2020-12-26
邏輯迴歸
機器學習股票價格預測初級實戰
2019-03-03
機器學習
【Python機器學習實戰】決策樹和整合學習（一）
2021-08-19
Python機器學習
python機器學習實戰（二）
2018-12-26
Python機器學習
TF專案實戰（基於SSD目標檢測）——人臉檢測1
2019-07-20
機器學習入門系列(2)--如何構建一個完整的機器學習專案(一)
2019-01-26
機器學習
開源一個機器學習文字分析專案
2018-06-01
機器學習
印度欺詐檢測初創企業TrustCheckr獲天使輪融資
2018-03-14
Rust
Juniper Research：50%的廣告主使用實時檢測廣告欺詐的服務
2022-03-09
基於圖資料庫 NebulaGraph 實現的欺詐檢測方案及程式碼示例
2023-02-21
資料庫
做機器學習專案的checklist
2020-01-21
機器學習
[譯] 機器學習專案清單
2019-02-18
機器學習
回顧·機器學習/深度學習工程實戰
2019-02-21
機器學習深度學習

機器學習專案實戰----信用卡欺詐檢測(一)

一、任務基礎

二、樣本資料分佈不均衡解決方案

三、模型評估方法：

四、正則化懲罰：

五、交叉驗證

相關文章

五、交叉驗證