Pandas多維特徵資料預處理及sklearn資料不均衡處理相關技術實踐-大資料ML樣本集案例實戰

開心雲技術社群發表於2018-12-19

原文網址 : https://juejin.im/post/5c1a76fef265da61561f3759

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：1120746959@qq.com，如有任何學術交流，可隨時聯絡。

1 機器學習調優步驟（第一行不平衡問題處理）

Pandas多維特徵資料預處理及sklearn資料不均衡處理相關技術實踐-大資料ML樣本集案例實戰

2 Pandas多維特徵資料預處理

資料初始化展示

  import pandas as pd
  
  取出第一行
  loans_2007 = pd.read_csv('C:\\ML\\MLData\\filtered_loans_2007.csv', skiprows=1)
  print(len(loans_2007))
  half_count = len(loans_2007) / 2
  loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
  #loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
  loans_2007.to_csv('loans_2007.csv', index=False)
  
  loans_2007.head(3)
複製程式碼

顯示第0行資料

  import pandas as pd
  loans_2007 = pd.read_csv("loans_2007.csv")
  #loans_2007.drop_duplicates()
  print(loans_2007.iloc[0])
  print(loans_2007.shape[1])
  
  id                                1077501
  member_id                      1.2966e+06
  loan_amnt                            5000
  funded_amnt                          5000
  funded_amnt_inv                      4975
  term                            36 months
  int_rate                           10.65%
  installment                        162.87
  grade                                   B
  sub_grade                              B2
  emp_title                             NaN
  emp_length                      10+ years
  home_ownership                       RENT
  annual_inc                          24000
  verification_status              Verified
  issue_d                          Dec-2011
  loan_status                    Fully Paid
  pymnt_plan                              n
  purpose                       credit_card
  title                            Computer
  zip_code                            860xx
  addr_state                             AZ
  dti                                 27.65
  delinq_2yrs                             0
  earliest_cr_line                 Jan-1985
  inq_last_6mths                          1
  open_acc                                3
  pub_rec                                 0
  revol_bal                           13648
  revol_util                          83.7%
  total_acc                               9
  initial_list_status                     f
  out_prncp                               0
  out_prncp_inv                           0
  total_pymnt                       5863.16
  total_pymnt_inv                   5833.84
  total_rec_prncp                      5000
  total_rec_int                      863.16
  total_rec_late_fee                      0
  recoveries                              0
  collection_recovery_fee                 0
  last_pymnt_d                     Jan-2015
  last_pymnt_amnt                    171.62
  last_credit_pull_d               Nov-2016
  collections_12_mths_ex_med              0
  policy_code                             1
  application_type               INDIVIDUAL
  acc_now_delinq                          0
  chargeoff_within_12_mths                0
  delinq_amnt                             0
  pub_rec_bankruptcies                    0
  tax_liens                               0
  Name: 0, dtype: object
  52
複製程式碼

刪除無意義列

  loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
  loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
  loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
  print(loans_2007.iloc[0])
  print(loans_2007.shape[1])
複製程式碼

檢視預測值的狀態型別

  print(loans_2007['loan_status'].value_counts())
  
  Fully Paid                                             33902
  Charged Off                                             5658
  Does not meet the credit policy. Status:Fully Paid      1988
  Does not meet the credit policy. Status:Charged Off      761
  Current                                                  201
  Late (31-120 days)                                        10
  In Grace Period                                            9
  Late (16-30 days)                                          5
  Default                                                    1
  Name: loan_status, dtype: int64
複製程式碼

根據貸款狀態，捨棄部分不清晰結論，給出明確分類0和1，進行替換

  loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
  
  status_replace = {
      "loan_status" : {
          "Fully Paid": 1,
          "Charged Off": 0,
      }
  }
  
  loans_2007 = loans_2007.replace(status_replace)
複製程式碼

去除每一列值都相同的列

  #let's look for any columns that contain only one unique value and remove them
  
  orig_columns = loans_2007.columns
  drop_columns = []
  for col in orig_columns:
      col_series = loans_2007[col].dropna().unique()
      if len(col_series) == 1:
          drop_columns.append(col)
  loans_2007 = loans_2007.drop(drop_columns, axis=1)
  print(drop_columns)
  print loans_2007.shape
  loans_2007.to_csv('filtered_loans_2007.csv', index=False)

  ['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
  (39560, 24)
複製程式碼

空值處理

  import pandas as pd
  loans = pd.read_csv('filtered_loans_2007.csv')
  null_counts = loans.isnull().sum()
  print(null_counts)
  
  loan_amnt                 0
  term                      0
  int_rate                  0
  installment               0
  emp_length                0
  home_ownership            0
  annual_inc                0
  verification_status       0
  loan_status               0
  pymnt_plan                0
  purpose                   0
  title                    10
  addr_state                0
  dti                       0
  delinq_2yrs               0
  earliest_cr_line          0
  inq_last_6mths            0
  open_acc                  0
  pub_rec                   0
  revol_bal                 0
  revol_util               50
  total_acc                 0
  last_credit_pull_d        2
  pub_rec_bankruptcies    697
  dtype: int64
  
  loans = loans.drop("pub_rec_bankruptcies", axis=1)
  loans = loans.dropna(axis=0)
複製程式碼

String型別分佈

  print(loans.dtypes.value_counts())
  object     12
  float64    10
  int64       1
  dtype: int64
  
  object_columns_df = loans.select_dtypes(include=["object"])
  print(object_columns_df.iloc[0])
  
  term                     36 months
  int_rate                    10.65%
  emp_length               10+ years
  home_ownership                RENT
  verification_status       Verified
  pymnt_plan                       n
  purpose                credit_card
  title                     Computer
  addr_state                      AZ
  earliest_cr_line          Jan-1985
  revol_util                   83.7%
  last_credit_pull_d        Nov-2016
  Name: 0, dtype: object
  
  cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
  for c in cols:
      print(loans[c].value_counts())
      
  RENT        18780
  MORTGAGE    17574
  OWN          3045
  OTHER          96
  NONE            3
  Name: home_ownership, dtype: int64
  
  Not Verified       16856
  Verified           12705
  Source Verified     9937
  Name: verification_status, dtype: int64
  
  10+ years    8821
  < 1 year     4563
  2 years      4371
  3 years      4074
  4 years      3409
  5 years      3270
  1 year       3227
  6 years      2212
  7 years      1756
  8 years      1472
  9 years      1254
  n/a          1069
  Name: emp_length, dtype: int64
  
   36 months    29041
   60 months    10457
  Name: term, dtype: int64
  
  CA    7070
  NY    3788
  FL    2856
  TX    2714
  NJ    1838
  IL    1517
  PA    1504
  VA    1400
  GA    1393
  MA    1336
  OH    1208
  MD    1049
  AZ     874
  WA     834
  CO     786
  NC     780
  CT     747
  MI     722
  MO     682
  MN     611
  NV     492
  SC     470
  WI     453
  AL     446
  OR     445
  LA     435
  KY     325
  OK     298
  KS     269
  UT     256
  AR     243
  DC     211
  RI     198
  NM     188
  WV     176
  HI     172
  NH     172
  DE     113
  MT      84
  WY      83
  AK      79
  SD      63
  VT      54
  MS      19
  TN      17
  IN       9
  ID       6
  IA       5
  NE       5
  ME       3
  Name: addr_state, dtype: int64   
複製程式碼

String型別分佈2

print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

debt_consolidation    18533
credit_card            5099
other                  3963
home_improvement       2965
major_purchase         2181
small_business         1815
car                    1544
wedding                 945
medical                 692
moving                  581
vacation                379
house                   378
educational             320
renewable_energy        103
Name: purpose, dtype: int64

Debt Consolidation                         2168
Debt Consolidation Loan                    1706
Personal Loan                               658
Consolidation                               509
debt consolidation                          502
Credit Card Consolidation                   356
Home Improvement                            354
Debt consolidation                          333
Small Business Loan                         322
Credit Card Loan                            313
Personal                                    308
Consolidation Loan                          255
Home Improvement Loan                       246
personal loan                               234
personal                                    220
Loan                                        212
Wedding Loan                                209
consolidation                               200
Car Loan                                    200
Other Loan                                  190
Credit Card Payoff                          155
Wedding                                     152
Major Purchase Loan                         144
Credit Card Refinance                       143
Consolidate                                 127
Medical                                     122
Credit Card                                 117
home improvement                            111
My Loan                                      94
Credit Cards                                 93
                                           ... 
DebtConsolidationn                            1
 Freedom                                      1
Credit Card Consolidation Loan - SEG          1
SOLAR PV                                      1
Pay on Credit card                            1
To pay off balloon payments due               1
Paying off the debt                           1
Payoff ING PLOC                               1
Josh CC Loan                                  1
House payoff                                  1
Taking care of Business                       1
Gluten Free Bakery in ideal town for it       1
Startup Money for Small Business              1
FundToFinanceCar                              1
getting ready for Baby                        1
Dougs Wedding Loan                            1
d rock                                        1
LC Loan 2                                     1
swimming pool repair                          1
engagement                                    1
Cut the credit cards Loan                     1
vinman                                        1
working hard to get out of debt               1
consolidate the rest of my debt               1
Medical/Vacation                              1
2BDebtFree                                    1
Paying Off High Interest Credit Cards!        1
Baby on the way!                              1
cart loan                                     1
Consolidaton                                  1
Name: title, dtype: int64
複製程式碼

型別轉換

  mapping_dict = {
      "emp_length": {
          "10+ years": 10,
          "9 years": 9,
          "8 years": 8,
          "7 years": 7,
          "6 years": 6,
          "5 years": 5,
          "4 years": 4,
          "3 years": 3,
          "2 years": 2,
          "1 year": 1,
          "< 1 year": 0,
          "n/a": 0
      }
  }
  loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
  loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
  loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
  loans = loans.replace(mapping_dict)
複製程式碼

獨熱編碼

  cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
  dummy_df = pd.get_dummies(loans[cat_columns])
  loans = pd.concat([loans, dummy_df], axis=1)
  loans = loans.drop(cat_columns, axis=1)
  loans = loans.drop("pymnt_plan", axis=1)
複製程式碼

檢視轉換型別

  import pandas as pd
  loans = pd.read_csv("cleaned_loans2007.csv")
  print(loans.info())
  
  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 39498 entries, 0 to 39497
  Data columns (total 37 columns):
  loan_amnt                              39498 non-null float64
  int_rate                               39498 non-null float64
  installment                            39498 non-null float64
  annual_inc                             39498 non-null float64
  loan_status                            39498 non-null int64
  dti                                    39498 non-null float64
  delinq_2yrs                            39498 non-null float64
  inq_last_6mths                         39498 non-null float64
  open_acc                               39498 non-null float64
  pub_rec                                39498 non-null float64
  revol_bal                              39498 non-null float64
  revol_util                             39498 non-null float64
  total_acc                              39498 non-null float64
  home_ownership_MORTGAGE                39498 non-null int64
  home_ownership_NONE                    39498 non-null int64
  home_ownership_OTHER                   39498 non-null int64
  home_ownership_OWN                     39498 non-null int64
  home_ownership_RENT                    39498 non-null int64
  verification_status_Not Verified       39498 non-null int64
  verification_status_Source Verified    39498 non-null int64
  verification_status_Verified           39498 non-null int64
  purpose_car                            39498 non-null int64
  purpose_credit_card                    39498 non-null int64
  purpose_debt_consolidation             39498 non-null int64
  purpose_educational                    39498 non-null int64
  purpose_home_improvement               39498 non-null int64
  purpose_house                          39498 non-null int64
  purpose_major_purchase                 39498 non-null int64
  purpose_medical                        39498 non-null int64
  purpose_moving                         39498 non-null int64
  purpose_other                          39498 non-null int64
  purpose_renewable_energy               39498 non-null int64
  purpose_small_business                 39498 non-null int64
  purpose_vacation                       39498 non-null int64
  purpose_wedding                        39498 non-null int64
  term_ 36 months                        39498 non-null int64
  term_ 60 months                        39498 non-null int64
  dtypes: float64(12), int64(25)
  memory usage: 11.1 MB
複製程式碼

3 準確率理論及不均衡處理

初始定義

  import pandas as pd
  # False positives.
  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
  fp = len(predictions[fp_filter])
  
  # True positives.
  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
  tp = len(predictions[tp_filter])
  
  # False negatives.
  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
  fn = len(predictions[fn_filter])
  
  # True negatives
  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
  tn = len(predictions[tn_filter])
複製程式碼

邏輯迴歸不處理不均衡

      from sklearn.linear_model import LogisticRegression
      lr = LogisticRegression()
      cols = loans.columns
      train_cols = cols.drop("loan_status")
      features = loans[train_cols]
      target = loans["loan_status"]
      lr.fit(features, target)
      predictions = lr.predict(features)
      
      from sklearn.linear_model import LogisticRegression
      from sklearn.cross_validation import cross_val_predict, KFold
      lr = LogisticRegression()
      kf = KFold(features.shape[0], random_state=1)
      predictions = cross_val_predict(lr, features, target, cv=kf)
      predictions = pd.Series(predictions)
      
      # False positives.
      fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
      fp = len(predictions[fp_filter])
      
      # True positives.
      tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
      tp = len(predictions[tp_filter])
      
      # False negatives.
      fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
      fn = len(predictions[fn_filter])
      
      # True negatives
      tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
      tn = len(predictions[tn_filter])
      
      # Rates
      tpr = tp / float((tp + fn))
      fpr = fp / float((fp + tn))
      
      print(tpr)
      print(fpr)
      print predictions[:20]
      
      0.999084438406
      0.998049299521
      0     1
      1     1
      2     1
      3     1
      4     1
      5     1
      6     1
      7     1
      8     1
      9     1
      10    1
      11    1
      12    1
      13    1
      14    1
      15    1
      16    1
      17    1
      18    1
      19    1
      dtype: int64
複製程式碼

邏輯迴歸balanced處理不均衡

  from sklearn.linear_model import LogisticRegression
  from sklearn.cross_validation import cross_val_predict
  lr = LogisticRegression(class_weight="balanced")
  kf = KFold(features.shape[0], random_state=1)
  predictions = cross_val_predict(lr, features, target, cv=kf)
  predictions = pd.Series(predictions)
  
  # False positives.
  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
  fp = len(predictions[fp_filter])
  
  # True positives.
  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
  tp = len(predictions[tp_filter])
  
  # False negatives.
  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
  fn = len(predictions[fn_filter])
  
  # True negatives
  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
  tn = len(predictions[tn_filter])
  
  # Rates
  tpr = tp / float((tp + fn))
  fpr = fp / float((fp + tn))
  
  print(tpr)
  print(fpr)
  print predictions[:20]
  
  0.670781771464
  0.400780280192
  0     1
  1     0
  2     0
  3     1
  4     1
  5     0
  6     0
  7     0
  8     0
  9     0
  10    1
  11    0
  12    1
  13    1
  14    0
  15    0
  16    1
  17    1
  18    1
  19    0
  dtype: int64
複製程式碼

邏輯迴歸penalty處理不均衡

  from sklearn.linear_model import LogisticRegression
  from sklearn.cross_validation import cross_val_predict
  penalty = {
      0: 5,
      1: 1
  }
  
  lr = LogisticRegression(class_weight=penalty)
  kf = KFold(features.shape[0], random_state=1)
  predictions = cross_val_predict(lr, features, target, cv=kf)
  predictions = pd.Series(predictions)
  
  # False positives.
  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
  fp = len(predictions[fp_filter])
  
  # True positives.
  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
  tp = len(predictions[tp_filter])
  
  # False negatives.
  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
  fn = len(predictions[fn_filter])
  
  # True negatives
  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
  tn = len(predictions[tn_filter])
  
  # Rates
  tpr = tp / float((tp + fn))
  fpr = fp / float((fp + tn))
  
  print(tpr)
  print(fpr)

  0.731799521545
  0.478985635751
複製程式碼

隨機森林balanced處理不均衡

  from sklearn.ensemble import RandomForestClassifier
  from sklearn.cross_validation import cross_val_predict
  rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
  #print help(RandomForestClassifier)
  kf = KFold(features.shape[0], random_state=1)
  predictions = cross_val_predict(rf, features, target, cv=kf)
  predictions = pd.Series(predictions)
  
  # False positives.
  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
  fp = len(predictions[fp_filter])
  
  # True positives.
  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
  tp = len(predictions[tp_filter])
  
  # False negatives.
  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
  fn = len(predictions[fn_filter])
  
  # True negatives
  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
  tn = len(predictions[tn_filter])
  
  # Rates
  tpr = tp / float((tp + fn))
  fpr = fp / float((fp + tn))
複製程式碼

4 總結

通過本文，對於資料特徵工程，具有非常重要的意義。

秦凱新於深圳 20181220

Python環境安裝及資料基本預處理-大資料ML樣本集案例實戰
2019-03-01
Python大資料
時間序列資料的預處理及基於ARIMA模型進行趨勢預測-大資料ML樣本集案例實戰
2018-12-10
模型大資料
資料處理利器python與scala物件導向對比分析2-大資料ML樣本集案例實戰
2018-12-16
Python物件大資料
Python技術棧與Spark交叉資料分析雙向整合技術實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
（特徵工程實戰）ML最實用的資料預處理與特徵工程常用函式！
2020-12-13
特徵工程函式
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
2019-02-17
指標大資料
Python技術棧與Spark交叉資料分析雙向整合進階實戰–大資料ML樣本集案例實戰
2019-03-01
PythonSpark大資料
Python技術棧與Spark交叉資料分析雙向整合進階實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
sklearn基礎及資料處理
2019-09-03
java大資料處理：如何使用Java技術實現高效的大資料處理
2023-11-22
Java大資料
我的《海量資料處理與大資料技術實戰》出版啦！
2020-08-28
大資料
sklearn中常用資料預處理方法
2018-03-27
資料預處理和特徵工程
2020-07-24
特徵工程
大資料處理的關鍵技術及應用
2022-05-19
大資料
sklearn 第二篇：資料預處理
2019-07-30
資料預處理之 pandas 讀表
2020-03-01
特徵工程之資料預處理（下）
2019-02-13
特徵工程
Python基礎演算法庫及視覺化庫使用實踐-大資料ML樣本集案例實戰
2018-12-11
Python演算法視覺化大資料
阿里巴巴資深大資料工程師：大資料處理實踐
2018-06-11
阿里大資料工程師
資料分析--資料預處理
2023-12-14
資料預處理-資料清理
2020-01-19
資料預處理
2021-09-09
Sklearn之資料預處理——StandardScaler歸一化
2020-10-18
資料處理--pandas問題
2024-08-04
Python資料處理-pandas用法
2020-12-17
Python
處理XML資料應用實踐
2021-02-25
XML
信用卡欺詐行為邏輯迴歸資料分析-大資料ML樣本集案例實戰
2018-12-08
邏輯迴歸大資料
快手關於海量模型資料處理的實踐
2024-02-07
模型
資料預處理-資料歸約
2020-01-19
Serverless 在大規模資料處理的實踐
2020-06-08
Server
基於python的大資料分析-資料處理（程式碼實戰）
2019-08-30
Python大資料
文字資料預處理：sklearn 中 CountVectorizer、TfidfTransformer 和 TfidfVectorizer
2018-09-13
ORM
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
基於邏輯迴歸及隨機森林的多分類問題資料分析-大資料ML樣本集案例實戰
2018-12-08
邏輯迴歸隨機森林大資料
資料預處理（資料清洗）的一般方法及python實現
2019-01-28
Python
資料預處理 demo
2020-02-19
GO web 開發實戰三，資料庫預處理
2023-02-08
GoWeb資料庫
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel

Pandas多維特徵資料預處理及sklearn資料不均衡處理相關技術實踐-大資料ML樣本集案例實戰

1 機器學習調優步驟（第一行不平衡問題處理）

2 Pandas多維特徵資料預處理

3 準確率理論及不均衡處理

4 總結

相關文章