基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰

凱新雲技術社群發表於2019-02-17

版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何技術交流,可隨時聯絡。

1 線性迴歸問題實踐(車的能耗預測)

基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 資料集欣賞

      import pandas as pd
      import matplotlib.pyplot as plt
      columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
      cars = pd.read_table("C:\ML\MLData\auto-mpg.data", delim_whitespace=True, names=columns)
      print(cars.head(5))
      
             mpg       cylinders  displacement horsepower  weight  acceleration  model year  
          0  18.0          8         307.0      130.0      3504.0          12.0          70   
          1  15.0          8         350.0      165.0      3693.0          11.5          70   
          2  18.0          8         318.0      150.0      3436.0          11.0          70   
          3  16.0          8         304.0      150.0      3433.0          12.0          70   
          4  17.0          8         302.0      140.0      3449.0          10.5          70   
          
               origin                   car name  
          0       1           chevrolet chevelle malibu  
          1       1                buick skylark 320  
          2       1             plymouth satellite  
          3       1                   amc rebel sst  
          4       1                   ford torino  
    複製程式碼
  • 檢視資料集分佈( ax=ax1指定所在區域)

      fig = plt.figure()
      ax1 = fig.add_subplot(2,1,1)
      ax2 = fig.add_subplot(2,1,2)
      cars.plot("weight", "mpg", kind=`scatter`, ax=ax1)
      cars.plot("acceleration", "mpg", kind=`scatter`, ax=ax2)
      plt.show()
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 建立線性迴歸模型

      import sklearn
      from sklearn.linear_model import LinearRegression
      lr = LinearRegression()
      lr.fit(cars[["weight"]], cars["mpg"])
    複製程式碼
  • fit進行線性訓練,predict進行預測並展示

      import sklearn
      from sklearn.linear_model import LinearRegression
      lr = LinearRegression(fit_intercept=True)
      lr.fit(cars[["weight"]], cars["mpg"])
      predictions = lr.predict(cars[["weight"]])
      print(predictions[0:5])
      print(cars["mpg"][0:5])
      
      [19.41852276 17.96764345 19.94053224 19.96356207 19.84073631]
      0    18.0
      1    15.0
      2    18.0
      3    16.0
      4    17.0
      Name: mpg, dtype: float64
    複製程式碼
  • 預測與實際值的分佈

      plt.scatter(cars["weight"], cars["mpg"], c=`red`)
      plt.scatter(cars["weight"], predictions, c=`blue`)
      plt.show()
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 均方差指標

      lr = LinearRegression()
      lr.fit(cars[["weight"]], cars["mpg"])
      predictions = lr.predict(cars[["weight"]])
      from sklearn.metrics import mean_squared_error
      mse = mean_squared_error(cars["mpg"], predictions)
      print(mse)
      18.780939734628397
    複製程式碼
  • 均方根誤差指標

      mse = mean_squared_error(cars["mpg"], predictions)
      rmse = mse ** (0.5)
      print (rmse)
      4.333698159150957
    複製程式碼

2 邏輯迴歸問題實踐

  • 資料集欣賞

      import pandas as pd
      import matplotlib.pyplot as plt
      admissions = pd.read_csv("C:\ML\MLData\admissions.csv")
      print (admissions.head())
      plt.scatter(admissions[`gpa`], admissions[`admit`])
      plt.show()
      
              admit    gpa         gre
          0      0  3.177277  594.102992
          1      0  3.412655  631.528607
          2      0  2.728097  553.714399
          3      0  3.093559  551.089985
          4      0  3.141923  537.184894
          <Figure size 640x480 with 1 Axes>
    複製程式碼
  • sigmod函式欣賞

      import numpy as np
      # Logit Function
      def logit(x):
          # np.exp(x) raises x to the exponential power, ie e^x. e ~= 2.71828
          return np.exp(x)  / (1 + np.exp(x)) 
          
      # Generate 50 real values, evenly spaced, between -6 and 6.
      x = np.linspace(-6,6,50, dtype=float)
      
      # Transform each number in t using the logit function.
      y = logit(x)
      
      # Plot the resulting data.
      plt.plot(x, y)
      plt.ylabel("Probability")
      plt.show()
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 邏輯迴歸

      from sklearn.linear_model import LinearRegression
      linear_model = LinearRegression()
      linear_model.fit(admissions[["gpa"]], admissions["admit"])
      
      from sklearn.linear_model import LogisticRegression
      logistic_model = LogisticRegression()
      logistic_model.fit(admissions[["gpa"]], admissions["admit"])
      
      LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
        intercept_scaling=1, max_iter=100, multi_class=`warn`,
        n_jobs=None, penalty=`l2`, random_state=None, solver=`warn`,
        tol=0.0001, verbose=0, warm_start=False)
    複製程式碼
  • 邏輯迴歸概率值預測

      logistic_model = LogisticRegression()
      logistic_model.fit(admissions[["gpa"]], admissions["admit"])
      pred_probs = logistic_model.predict_proba(admissions[["gpa"]])
      
      print(pred_probs)
      
      ## pred_probs[:,1] 接收的可能性  pred_probs[:,0] 沒有被接收的可能性 
      plt.scatter(admissions["gpa"], pred_probs[:,1])
      plt.show()
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 邏輯迴歸分類值預測

      logistic_model = LogisticRegression()
      logistic_model.fit(admissions[["gpa"]], admissions["admit"])
      fitted_labels = logistic_model.predict(admissions[["gpa"]])
      plt.scatter(admissions["gpa"], fitted_labels)
      plt.show()
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 邏輯迴歸模型評估

    基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 模型評估案例分析(先找目標,比如女生,它就是正例P)

    基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • TPR 指標(召回率,不均衡的情況下,精確度是有問題的100個樣本,10個異常的,及時異常全部預測正確了,也能達到90%的概率。檢查正例的效果。比如在所有女生的樣本下)

基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 精確度

      admissions["actual_label"] = admissions["admit"]
      matches = admissions["predicted_label"] == admissions["actual_label"]
      correct_predictions = admissions[matches]
      print(correct_predictions.head())
      accuracy = len(correct_predictions) / float(len(admissions))
      print(accuracy)
      
         admit       gpa         gre  predicted_label  actual_label
          0      0  3.177277  594.102992                0             0
          1      0  3.412655  631.528607                0             0
          2      0  2.728097  553.714399                0             0
          3      0  3.093559  551.089985                0             0
          4      0  3.141923  537.184894                0             0
          0.645962732919
    複製程式碼
  • TPR指標(檢查正例的效果)

      true_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 1)
      true_positives = len(admissions[true_positive_filter])
      
      true_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 0)
      true_negatives = len(admissions[true_negative_filter])
      
      print(true_positives)
      print(true_negatives)
      
      31
      385   
      
      true_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 1)
      true_positives = len(admissions[true_positive_filter])
      false_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 1)
      false_negatives = len(admissions[false_negative_filter])
      
      sensitivity = true_positives / float((true_positives + false_negatives))
      
      print(sensitivity)
      
      0.127049180328
    複製程式碼
  • FPR 指標(檢查負例的效果)

      true_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 1)
      true_positives = len(admissions[true_positive_filter])
      false_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 1)
      false_negatives = len(admissions[false_negative_filter])
      true_negative_filter = (admissions["predicted_label"] == 0) & (admissions["actual_label"] == 0)
      true_negatives = len(admissions[true_negative_filter])
      false_positive_filter = (admissions["predicted_label"] == 1) & (admissions["actual_label"] == 0)
      false_positives = len(admissions[false_positive_filter])
      specificity = (true_negatives) / float((false_positives + true_negatives))
      print(specificity)
      0.9625
    複製程式碼

3 ROC 綜合評價指標(看面積,通過不同的閾值,觀察TPR及FPR都接近於1最好)

  • 使用numpy來洗牌(permutation洗牌後會返回index)

      import numpy as np
      np.random.seed(8)
      admissions = pd.read_csv("C:\ML\MLData\admissions.csv")
      admissions["actual_label"] = admissions["admit"]
      admissions = admissions.drop("admit", axis=1)
      
      #permutation洗牌後會返回index
      shuffled_index = np.random.permutation(admissions.index)
      #print shuffled_index
      shuffled_admissions = admissions.loc[shuffled_index]
      
      train = shuffled_admissions.iloc[0:515]
      test = shuffled_admissions.iloc[515:len(shuffled_admissions)]
      
      print(shuffled_admissions.head())
    
                  gpa         gre        actual_label
          260  3.177277  594.102992             0
          173  3.412655  631.528607             0
          256  2.728097  553.714399             0
          167  3.093559  551.089985             0
          400  3.141923  537.184894             0
    複製程式碼
  • 洗牌後的準確率

      shuffled_index = np.random.permutation(admissions.index)
      shuffled_admissions = admissions.loc[shuffled_index]
      train = shuffled_admissions.iloc[0:515]
      test = shuffled_admissions.iloc[515:len(shuffled_admissions)]
      model = LogisticRegression()
      model.fit(train[["gpa"]], train["actual_label"])
      
      labels = model.predict(test[["gpa"]])
      test["predicted_label"] = labels
      
      matches = test["predicted_label"] == test["actual_label"]
      correct_predictions = test[matches]
      accuracy = len(correct_predictions) / float(len(test))
      print(accuracy)
    複製程式碼
  • TPR指標和FPR指標

      model = LogisticRegression()
      model.fit(train[["gpa"]], train["actual_label"])
      labels = model.predict(test[["gpa"]])
      test["predicted_label"] = labels
      matches = test["predicted_label"] == test["actual_label"]
      correct_predictions = test[matches]
      accuracy = len(correct_predictions) / len(test)
      true_positive_filter = (test["predicted_label"] == 1) & (test["actual_label"] == 1)
      true_positives = len(test[true_positive_filter])
      false_negative_filter = (test["predicted_label"] == 0) & (test["actual_label"] == 1)
      false_negatives = len(test[false_negative_filter])
      
      sensitivity = true_positives / float((true_positives + false_negatives))
      print(sensitivity)
      
      false_positive_filter = (test["predicted_label"] == 1) & (test["actual_label"] == 0)
      false_positives = len(test[false_positive_filter])
      true_negative_filter = (test["predicted_label"] == 0) & (test["actual_label"] == 0)
      true_negatives = len(test[true_negative_filter])
      
      specificity = (true_negatives) / float((false_positives + true_negatives))
      print(specificity)
    複製程式碼
  • ROC 曲線(根據不同預測閾值進行ROC值得求解)

    import matplotlib.pyplot as plt
    from sklearn import metrics
    
    # 看面積(綜合評價指標)
    probabilities = model.predict_proba(test[["gpa"]])
    fpr, tpr, thresholds = metrics.roc_curve(test["actual_label"], probabilities[:,1])
    print(fpr)
    print(tpr)
    
    print (thresholds)
    plt.plot(fpr, tpr)
    plt.show()
    
    [0.         0.         0.         0.01204819 0.01204819 0.03614458
     0.03614458 0.04819277 0.04819277 0.08433735 0.08433735 0.12048193
     0.12048193 0.14457831 0.14457831 0.15662651 0.15662651 0.1686747
     0.1686747  0.20481928 0.20481928 0.22891566 0.22891566 0.25301205
     0.25301205 0.26506024 0.26506024 0.28915663 0.28915663 0.36144578
     0.36144578 0.37349398 0.37349398 0.39759036 0.39759036 0.42168675
     0.42168675 0.4939759  0.4939759  0.51807229 0.51807229 0.56626506
     0.56626506 0.57831325 0.57831325 0.59036145 0.59036145 0.60240964
     0.60240964 0.71084337 0.71084337 0.74698795 0.74698795 0.80722892
     0.80722892 1.        ]
    [0.         0.04347826 0.10869565 0.10869565 0.13043478 0.13043478
     0.15217391 0.15217391 0.17391304 0.17391304 0.30434783 0.30434783
     0.34782609 0.34782609 0.39130435 0.39130435 0.45652174 0.45652174
     0.47826087 0.47826087 0.5        0.5        0.52173913 0.52173913
     0.54347826 0.54347826 0.58695652 0.58695652 0.60869565 0.60869565
     0.63043478 0.63043478 0.65217391 0.65217391 0.67391304 0.67391304
     0.69565217 0.69565217 0.7173913  0.7173913  0.73913043 0.73913043
     0.76086957 0.76086957 0.82608696 0.82608696 0.84782609 0.84782609
     0.91304348 0.91304348 0.95652174 0.95652174 0.97826087 0.97826087
     1.         1.        ]
    [1.56004565 0.56004565 0.54243179 0.53628712 0.51105566 0.48893268
     0.48527053 0.48509715 0.4813168  0.46947103 0.45346756 0.44834885
     0.44706728 0.44427478 0.43934894 0.43829984 0.43663204 0.43619054
     0.43403896 0.42768342 0.42552944 0.42037834 0.41902425 0.4106753
     0.40808302 0.40553849 0.40263935 0.39614129 0.39051837 0.3829498
     0.38157447 0.38068362 0.38048866 0.37569719 0.37508847 0.37197763
     0.36994072 0.36246461 0.36217119 0.36104391 0.36079299 0.3545887
     0.354584   0.35222722 0.35015605 0.34812221 0.34735443 0.34725692
     0.34146873 0.32738061 0.32312995 0.32262176 0.32259245 0.3058184
     0.30278689 0.19267703]
    複製程式碼
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 綜合評分的給出

          from sklearn.metrics import roc_auc_score
          probabilities = model.predict_proba(test[["gpa"]])
          
          # Means we can just use roc_auc_curve() instead of metrics.roc_auc_curve()
          #求面積
          auc_score = roc_auc_score(test["actual_label"], probabilities[:,1])
          print(auc_score)
          0.7210690192008302
    複製程式碼

4 交叉驗證

基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
  • 手寫交叉驗證

      import pandas as pd
      import numpy as np
      
      admissions = pd.read_csv("C:\ML\MLData\admissions.csv")
      admissions["actual_label"] = admissions["admit"]
      admissions = admissions.drop("admit", axis=1)
      
      shuffled_index = np.random.permutation(admissions.index)
      shuffled_admissions = admissions.loc[shuffled_index]
      admissions = shuffled_admissions.reset_index()
      admissions.loc[0:128, "fold"] = 1
      admissions.loc[129:257, "fold"] = 2
      admissions.loc[258:386, "fold"] = 3
      admissions.loc[387:514, "fold"] = 4
      admissions.loc[515:644, "fold"] = 5
      # Ensure the column is set to integer type.
      admissions["fold"] = admissions["fold"].astype(`int`)
      
      print(admissions.head())
      print(admissions.tail())
    
         index       gpa         gre  actual_label  fold
      0    510  3.177277  594.102992             0     1
      1    129  3.412655  631.528607             0     1
      2    208  2.728097  553.714399             0     1
      3     32  3.093559  551.089985             0     1
      4    420  3.141923  537.184894             0     1
           index       gpa         gre  actual_label  fold
      639    306  3.381359  720.718438             1     5
      640    376  3.083956  556.918021             1     5
      641    486  3.114419  734.297679             1     5
      642    296  3.549012  604.697503             1     5
      643    449  3.532753  588.986175             1     5
    複製程式碼
  • 交叉案例

      from sklearn.linear_model import LogisticRegression
      # Training
      model = LogisticRegression()
      train_iteration_one = admissions[admissions["fold"] != 1]
      test_iteration_one = admissions[admissions["fold"] == 1]
      model.fit(train_iteration_one[["gpa"]], train_iteration_one["actual_label"])
      
      # Predicting
      labels = model.predict(test_iteration_one[["gpa"]])
      test_iteration_one["predicted_label"] = labels
      
      matches = test_iteration_one["predicted_label"] == test_iteration_one["actual_label"]
      correct_predictions = test_iteration_one[matches]
      iteration_one_accuracy = len(correct_predictions) / float(len(test_iteration_one))
      print(iteration_one_accuracy)
      
      0.6744186046511628
    複製程式碼
  • 交叉預測

      import numpy as np
      fold_ids = [1,2,3,4,5]
      def train_and_test(df, folds):
          fold_accuracies = []
          for fold in folds:
              model = LogisticRegression()
              train = admissions[admissions["fold"] != fold]
              test = admissions[admissions["fold"] == fold]
              model.fit(train[["gpa"]], train["actual_label"])
              labels = model.predict(test[["gpa"]])
              test["predicted_label"] = labels
      
              matches = test["predicted_label"] == test["actual_label"]
              correct_predictions = test[matches]
              fold_accuracies.append(len(correct_predictions) / float(len(test)))
          return(fold_accuracies)
      
      accuracies = train_and_test(admissions, fold_ids)
      print(accuracies)
      average_accuracy = np.mean(accuracies)
      print(average_accuracy)
      
      [0.6744186046511628, 0.7209302325581395, 0.8372093023255814, 0.1015625, 0.0]
      0.46682412790697675
    複製程式碼
  • roc_auc 交叉驗證綜合預測結果

      from sklearn.model_selection import KFold
      from sklearn.model_selection  import cross_val_score
      
      admissions = pd.read_csv("C:\ML\MLData\admissions.csv")
      admissions["actual_label"] = admissions["admit"]
      admissions = admissions.drop("admit", axis=1)
      lr = LogisticRegression()
      
      #roc_auc 
      accuracies = cross_val_score(lr, admissions[["gpa"]], admissions["actual_label"], scoring="roc_auc", cv=3)
      average_accuracy = sum(accuracies) / len(accuracies)
      
      print(accuracies)
      print(average_accuracy)
    複製程式碼

5 多分類問題- 二分類轉換為三分類問題

  • onehot編碼(針對氣缸cylinders種類)

      import pandas as pd
      import matplotlib.pyplot as plt
      columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin", "car name"]
      cars = pd.read_table("C:\ML\MLData\auto-mpg.data", delim_whitespace=True, names=columns)
      print(cars.head(5))
      print(cars.tail(5))
      
         mpg  cylinders  displacement horsepower  weight  acceleration  year  
      0  18.0          8         307.0      130.0  3504.0          12.0    70   
      1  15.0          8         350.0      165.0  3693.0          11.5    70   
      2  18.0          8         318.0      150.0  3436.0          11.0    70   
      3  16.0          8         304.0      150.0  3433.0          12.0    70   
      4  17.0          8         302.0      140.0  3449.0          10.5    70   
      
          origin                   car name  
      0       1  chevrolet chevelle malibu  
      1       1          buick skylark 320  
      2       1         plymouth satellite  
      3       1              amc rebel sst  
      4       1                ford torino  
      
            mpg  cylinders  displacement horsepower  weight  acceleration  year  
      393  27.0          4         140.0      86.00  2790.0          15.6    82   
      394  44.0          4          97.0      52.00  2130.0          24.6    82   
      395  32.0          4         135.0      84.00  2295.0          11.6    82   
      396  28.0          4         120.0      79.00  2625.0          18.6    82   
      397  31.0          4         119.0      82.00  2720.0          19.4    82   
      
           origin         car name  
      393       1  ford mustang gl  
      394       2        vw pickup  
      395       1    dodge rampage  
      396       1      ford ranger  
      397       1       chevy s-10
    複製程式碼
  • get_dummies獲取所有類別的編碼,並廢棄原特徵

      dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
      #print dummy_cylinders
      cars = pd.concat([cars, dummy_cylinders], axis=1)
      print(cars.head())
      
      dummy_years = pd.get_dummies(cars["year"], prefix="year")
      #print dummy_years
      cars = pd.concat([cars, dummy_years], axis=1)
      cars = cars.drop("year", axis=1)
      cars = cars.drop("cylinders", axis=1)
      print(cars.head())
         
             mpg  cylinders  displacement horsepower  weight  acceleration  year  
          0  18.0          8         307.0      130.0  3504.0          12.0    70   
          1  15.0          8         350.0      165.0  3693.0          11.5    70   
          2  18.0          8         318.0      150.0  3436.0          11.0    70   
          3  16.0          8         304.0      150.0  3433.0          12.0    70   
          4  17.0          8         302.0      140.0  3449.0          10.5    70   
          
             origin                   car name  cyl_3  cyl_4  cyl_5  cyl_6  cyl_8  
          0       1  chevrolet chevelle malibu      0      0      0      0      1  
          1       1          buick skylark 320      0      0      0      0      1  
          2       1         plymouth satellite      0      0      0      0      1  
          3       1              amc rebel sst      0      0      0      0      1  
          4       1                ford torino      0      0      0      0      1  
          
          
          
              mpg  displacement horsepower  weight  acceleration  origin  
          0  18.0         307.0      130.0  3504.0          12.0       1   
          1  15.0         350.0      165.0  3693.0          11.5       1   
          2  18.0         318.0      150.0  3436.0          11.0       1   
          3  16.0         304.0      150.0  3433.0          12.0       1   
          4  17.0         302.0      140.0  3449.0          10.5       1   
          
                              car name  cyl_3  cyl_4  cyl_5   ...     year_73  year_74  
          0  chevrolet chevelle malibu      0      0      0   ...           0        0   
          1          buick skylark 320      0      0      0   ...           0        0   
          2         plymouth satellite      0      0      0   ...           0        0   
          3              amc rebel sst      0      0      0   ...           0        0   
          4                ford torino      0      0      0   ...           0        0   
          
             year_75  year_76  year_77  year_78  year_79  year_80  year_81  year_82  
          0        0        0        0        0        0        0        0        0  
          1        0        0        0        0        0        0        0        0  
          2        0        0        0        0        0        0        0        0  
          3        0        0        0        0        0        0        0        0  
          4        0        0        0        0        0        0        0        0  
          
          [5 rows x 25 columns]
    複製程式碼
  • 資料混洗

      import numpy as np
          shuffled_rows = np.random.permutation(cars.index)
          shuffled_cars = cars.iloc[shuffled_rows]
          highest_train_row = int(cars.shape[0] * .70)
          train = shuffled_cars.iloc[0:highest_train_row]
          test = shuffled_cars.iloc[highest_train_row:]
    
          from sklearn.linear_model import LogisticRegression
          
          unique_origins = cars["origin"].unique()
          unique_origins.sort()
          
          models = {}
          features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]
    複製程式碼
  • unique_origins具有三個類別,得到三個模型

      for origin in unique_origins:
          model = LogisticRegression()
          
          X_train = train[features]
          y_train = train["origin"] == origin
      
          model.fit(X_train, y_train)
          models[origin] = model
          
      testing_probs = pd.DataFrame(columns=unique_origins)  
      print (testing_probs)
      
      #三分類預測
      for origin in unique_origins:
          # Select testing features.
          X_test = test[features]   
          # Compute probability of observation being in the origin.
          testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
      print (testing_probs)    
      
      Empty DataFrame
      Columns: [1, 2, 3]
      Index: []
                  1         2         3
      0    0.562520  0.139051  0.311761
      1    0.344396  0.305419  0.340340
      2    0.562520  0.139051  0.311761
      3    0.845885  0.039920  0.136193
      4    0.344396  0.305419  0.340340
      5    0.356995  0.251510  0.382683
      6    0.962674  0.024810  0.034025
      7    0.962987  0.036239  0.022234
      8    0.945593  0.035397  0.035385
      9    0.871548  0.076721  0.061084
      10   0.962229  0.020634  0.039798
      11   0.272269  0.326451  0.392319
    複製程式碼

6 總結

管中窺豹,雖然基礎,卻道明問題,正是本文的目的。

秦凱新 於深圳

相關文章