[機器學習] 低程式碼機器學習工具PyCaret庫使用指北

落痕的寒假發表於2024-06-01

PyCaret是一個開源、低程式碼Python機器學習庫,能夠自動化機器學習工作流程。它是一個端到端的機器學習和模型管理工具,極大地加快了實驗週期,提高了工作效率。PyCaret本質上是圍繞幾個機器學習庫和框架(如scikit-learn、XGBoost、LightGBM、CatBoost、spaCy、Optuna、Hyperopt、Ray等)的Python包裝器,與其他開源機器學習庫相比,PyCaret可以用少量程式碼取代數百行程式碼。PyCaret開源倉庫地址:pycaret,官方文件地址為:pycaret-docs

PyCaret基礎版安裝命令如下:

pip install pycaret

完整版安裝程式碼如下:

pip install pycaret[full]

此外以下模型可以呼叫GPU

  • Extreme Gradient Boosting
  • Catboost
  • Light Gradient Boosting(需要安裝lightgbm)
  • Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression(需要安裝cuml0.15版本以上)
# 檢視pycaret版本
import pycaret
pycaret.__version__
'3.3.2'

目錄
  • 1 快速入門
    • 1.1 分類
    • 1.2 迴歸
    • 1.3 聚類
    • 1.4 異常檢測
    • 1.5 時序預測
  • 2 資料處理與清洗
    • 2.1 缺失值處理
    • 2.2 型別轉換
    • 2.3 獨熱編碼
    • 2.4 資料平衡
    • 2.5 異常值處理
    • 2.6 特徵重要性
    • 2.7 歸一化
  • 3 參考

1 快速入門

PyCaret支援多種機器學習任務,包括分類、迴歸、聚類、異常檢測和時序預測。本節主要介紹如何利用PyCaret構建相關任務模型的基礎使用方法。關於更詳細的PyCaret任務模型使用,請參考:

Topic NotebookLink
二分類BinaryClassification link
多分類MulticlassClassification link
迴歸Regression link
聚類Clustering link
異常檢測AnomalyDetection link
時序預測TimeSeriesForecasting link

1.1 分類

PyCaret的classification模組是一個可用於二分類或多分類的模組,用於將元素分類到不同的組中。一些常見的用例包括預測客戶是否違約、預測客戶是否流失、以及診斷疾病(陽性或陰性)。示例程式碼如下所示:

資料準備

載入糖尿病示例資料集:

from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/diabetes', verbose=False)
# 從pycaret開源倉庫下載公開資料
# data = get_data('diabetes', verbose=False)
# 檢視資料型別和資料維度
type(data), data.shape
(pandas.core.frame.DataFrame, (768, 9))
# 最後一列表示是否為糖尿病患者,其他列為特徵列
data.head()
Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

利用PyCaret核心函式setup,初始化建模環境並準備資料以供模型訓練和評估使用:

from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# target目標列,session_id設定隨機數種子, preprocesss是否清洗資料,train_size訓練集比例, normalize是否歸一化資料, normalize_method歸一化方式
s.setup(data, target = 'Class variable', session_id = 0, verbose= False, train_size = 0.7, normalize = True, normalize_method = 'minmax')
<pycaret.classification.oop.ClassificationExperiment at 0x200b939df40>

檢視基於setup函式建立的變數:

s.get_config()
{'USI',
 'X',
 'X_test',
 'X_test_transformed',
 'X_train',
 'X_train_transformed',
 'X_transformed',
 '_available_plots',
 '_ml_usecase',
 'data',
 'dataset',
 'dataset_transformed',
 'exp_id',
 'exp_name_log',
 'fix_imbalance',
 'fold_generator',
 'fold_groups_param',
 'fold_shuffle_param',
 'gpu_n_jobs_param',
 'gpu_param',
 'html_param',
 'idx',
 'is_multiclass',
 'log_plots_param',
 'logging_param',
 'memory',
 'n_jobs_param',
 'pipeline',
 'seed',
 'target_param',
 'test',
 'test_transformed',
 'train',
 'train_transformed',
 'variable_and_property_keys',
 'variables',
 'y',
 'y_test',
 'y_test_transformed',
 'y_train',
 'y_train_transformed',
 'y_transformed'}

檢視歸一化的資料:

s.get_config('X_train_transformed')
Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years)
34 0.588235 0.613065 0.639344 0.313131 0.000000 0.411326 0.185312 0.400000
221 0.117647 0.793970 0.737705 0.000000 0.000000 0.470939 0.310418 0.750000
531 0.000000 0.537688 0.622951 0.000000 0.000000 0.675112 0.259607 0.050000
518 0.764706 0.381910 0.491803 0.000000 0.000000 0.488823 0.043553 0.333333
650 0.058824 0.457286 0.442623 0.252525 0.118203 0.375559 0.066610 0.033333
... ... ... ... ... ... ... ... ...
628 0.294118 0.643216 0.655738 0.000000 0.000000 0.515648 0.028181 0.400000
456 0.058824 0.678392 0.442623 0.000000 0.000000 0.397914 0.260034 0.683333
398 0.176471 0.412060 0.573770 0.000000 0.000000 0.314456 0.132792 0.066667
6 0.176471 0.391960 0.409836 0.323232 0.104019 0.461997 0.072588 0.083333
294 0.000000 0.809045 0.409836 0.000000 0.000000 0.326379 0.075149 0.733333

537 rows × 8 columns

繪製某列資料的柱狀圖:

s.get_config('X_train_transformed')['Number of times pregnant'].hist()
<AxesSubplot:>

png

當然也可以利用如下程式碼建立任務示例來初始化環境:

from pycaret.classification import setup
# s = setup(data, target = 'Class variable', session_id = 0, preprocess = True, train_size = 0.7, verbose = False)

模型訓練與評估

PyCaret提供了compare_models函式,透過使用預設的10折交叉驗證來訓練和評估模型庫中所有可用估計器的效能:

best = s.compare_models()
# 選擇某些模型進行比較
# best = s.compare_models(include = ['dt', 'rf', 'et', 'gbc', 'lightgbm'])
# 按照召回率返回n_select效能最佳的模型
# best_recall_models_top3 = s.compare_models(sort = 'Recall', n_select = 3)
Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
lr Logistic Regression 0.7633 0.8132 0.4968 0.7436 0.5939 0.4358 0.4549 0.2720
ridge Ridge Classifier 0.7633 0.8113 0.5178 0.7285 0.6017 0.4406 0.4560 0.0090
lda Linear Discriminant Analysis 0.7633 0.8110 0.5497 0.7069 0.6154 0.4489 0.4583 0.0080
ada Ada Boost Classifier 0.7465 0.7768 0.5655 0.6580 0.6051 0.4208 0.4255 0.0190
svm SVM - Linear Kernel 0.7408 0.8087 0.5921 0.6980 0.6020 0.4196 0.4480 0.0080
nb Naive Bayes 0.7391 0.7939 0.5442 0.6515 0.5857 0.3995 0.4081 0.0080
rf Random Forest Classifier 0.7337 0.8033 0.5406 0.6331 0.5778 0.3883 0.3929 0.0350
et Extra Trees Classifier 0.7298 0.7899 0.5181 0.6416 0.5677 0.3761 0.3840 0.0450
gbc Gradient Boosting Classifier 0.7281 0.8007 0.5567 0.6267 0.5858 0.3857 0.3896 0.0260
lightgbm Light Gradient Boosting Machine 0.7242 0.7811 0.5827 0.6096 0.5935 0.3859 0.3876 0.0860
qda Quadratic Discriminant Analysis 0.7150 0.7875 0.4962 0.6225 0.5447 0.3428 0.3524 0.0080
knn K Neighbors Classifier 0.7131 0.7425 0.5287 0.6005 0.5577 0.3480 0.3528 0.2200
dt Decision Tree Classifier 0.6685 0.6461 0.5722 0.5266 0.5459 0.2868 0.2889 0.0100
dummy Dummy Classifier 0.6518 0.5000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0120

返回當前設定中所有經過訓練的模型中的最佳模型:

best_ml = s.automl()
# best_ml
# 列印效果最佳的模型
print(best)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

資料視覺化

PyCaret也提供了plot_model函式視覺化模型的評估指標,plot_model函式中的plot用於設定評估指標型別。plot可用引數如下(注意並不是所有的模型都支援以下評估指標):

  • pipeline: Schematic drawing of the preprocessing pipeline
  • auc: Area Under the Curve
  • threshold: Discrimination Threshold
  • pr: Precision Recall Curve
  • confusion_matrix: Confusion Matrix
  • error: Class Prediction Error
  • class_report: Classification Report
  • boundary: Decision Boundary
  • rfe: Recursive Feature Selection
  • learning: Learning Curve
  • manifold: Manifold Learning
  • calibration: Calibration Curve
  • vc: Validation Curve
  • dimension: Dimension Learning
  • feature: Feature Importance
  • feature_all: Feature Importance (All)
  • parameter: Model Hyperparameter
  • lift: Lift Curve
  • gain: Gain Chart
  • tree: Decision Tree
  • ks: KS Statistic Plot
# 提取所有模型預測結果
models_results = s.pull()
models_results
Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
lr Logistic Regression 0.7633 0.8132 0.4968 0.7436 0.5939 0.4358 0.4549 0.272
ridge Ridge Classifier 0.7633 0.8113 0.5178 0.7285 0.6017 0.4406 0.4560 0.009
lda Linear Discriminant Analysis 0.7633 0.8110 0.5497 0.7069 0.6154 0.4489 0.4583 0.008
ada Ada Boost Classifier 0.7465 0.7768 0.5655 0.6580 0.6051 0.4208 0.4255 0.019
svm SVM - Linear Kernel 0.7408 0.8087 0.5921 0.6980 0.6020 0.4196 0.4480 0.008
nb Naive Bayes 0.7391 0.7939 0.5442 0.6515 0.5857 0.3995 0.4081 0.008
rf Random Forest Classifier 0.7337 0.8033 0.5406 0.6331 0.5778 0.3883 0.3929 0.035
et Extra Trees Classifier 0.7298 0.7899 0.5181 0.6416 0.5677 0.3761 0.3840 0.045
gbc Gradient Boosting Classifier 0.7281 0.8007 0.5567 0.6267 0.5858 0.3857 0.3896 0.026
lightgbm Light Gradient Boosting Machine 0.7242 0.7811 0.5827 0.6096 0.5935 0.3859 0.3876 0.086
qda Quadratic Discriminant Analysis 0.7150 0.7875 0.4962 0.6225 0.5447 0.3428 0.3524 0.008
knn K Neighbors Classifier 0.7131 0.7425 0.5287 0.6005 0.5577 0.3480 0.3528 0.220
dt Decision Tree Classifier 0.6685 0.6461 0.5722 0.5266 0.5459 0.2868 0.2889 0.010
dummy Dummy Classifier 0.6518 0.5000 0.0000 0.0000 0.0000 0.0000 0.0000 0.012
s.plot_model(best, plot = 'confusion_matrix')

png

如果在jupyter環境,可以透過evaluate_model函式來互動式展示模型的效能:

# s.evaluate_model(best)

模型預測

predict_model函式實現對資料進行預測,並返回包含預測標籤prediction_label和分數prediction_score的Pandas表格。當data為None時,它預測測試集(在設定功能期間建立)上的標籤和分數。

# 預測整個資料集
res = s.predict_model(best, data=data)
# 檢視各行預測結果
# res
Model Accuracy AUC Recall Prec. F1 Kappa MCC
0 Logistic Regression 0.7708 0.8312 0.5149 0.7500 0.6106 0.4561 0.4723
# 預測用於資料訓練的測試集
res = s.predict_model(best)
Model Accuracy AUC Recall Prec. F1 Kappa MCC
0 Logistic Regression 0.7576 0.8553 0.5062 0.7193 0.5942 0.4287 0.4422

模型儲存與匯入

# 儲存模型到本地
_ = s.save_model(best, 'best_model', verbose = False)
# 匯入模型
model = s.load_model( 'best_model')
# 檢視模型結構
# model
Transformation Pipeline and Model Successfully Loaded
# 預測整個資料集
res = s.predict_model(model, data=data)
Model Accuracy AUC Recall Prec. F1 Kappa MCC
0 Logistic Regression 0.7708 0.8312 0.5149 0.7500 0.6106 0.4561 0.4723

1.2 迴歸

PyCaret提供了regression模型實現迴歸任務,regression模組與classification模組使用方法一致。

# 載入保險費用示例資料集
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# 從網路下載
# data = get_data(dataset='insurance', verbose=False)
data.head()
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
# 建立資料管道
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()
# 預測charges列
s.setup(data, target = 'charges', session_id = 0)
# 另一種資料管道建立方式
# from pycaret.regression import *
# s = setup(data, target = 'charges', session_id = 0)
Description Value
0 Session id 0
1 Target charges
2 Target type Regression
3 Original data shape (1338, 7)
4 Transformed data shape (1338, 10)
5 Transformed train set shape (936, 10)
6 Transformed test set shape (402, 10)
7 Numeric features 3
8 Categorical features 3
9 Preprocess True
10 Imputation type simple
11 Numeric imputation mean
12 Categorical imputation mode
13 Maximum one-hot encoding 25
14 Encoding method None
15 Fold Generator KFold
16 Fold Number 10
17 CPU Jobs -1
18 Use GPU False
19 Log Experiment False
20 Experiment Name reg-default-name
21 USI eb9d
<pycaret.regression.oop.RegressionExperiment at 0x200dedc2d30>
# 評估各類模型
best = s.compare_models()
Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
gbr Gradient Boosting Regressor 2723.2453 23787529.5872 4832.4785 0.8254 0.4427 0.3140 0.0550
lightgbm Light Gradient Boosting Machine 2998.1311 25738691.2181 5012.2404 0.8106 0.5525 0.3709 0.1140
rf Random Forest Regressor 2915.7018 26780127.0016 5109.5098 0.8031 0.4855 0.3520 0.0670
et Extra Trees Regressor 2841.8257 28559316.9533 5243.5828 0.7931 0.4671 0.3218 0.0670
ada AdaBoost Regressor 4180.2669 28289551.0048 5297.6817 0.7886 0.5935 0.6545 0.0210
ridge Ridge Regression 4304.2640 38786967.4768 6188.6966 0.7152 0.5794 0.4283 0.0230
lar Least Angle Regression 4293.9886 38781666.5991 6188.3301 0.7151 0.5893 0.4263 0.0210
llar Lasso Least Angle Regression 4294.2135 38780221.0039 6188.1906 0.7151 0.5891 0.4264 0.0200
br Bayesian Ridge 4299.8532 38785479.0984 6188.6026 0.7151 0.5784 0.4274 0.0200
lasso Lasso Regression 4294.2186 38780210.5665 6188.1898 0.7151 0.5892 0.4264 0.0370
lr Linear Regression 4293.9886 38781666.5991 6188.3301 0.7151 0.5893 0.4263 0.0350
dt Decision Tree Regressor 3550.6534 51149204.9032 7095.9170 0.6127 0.5839 0.4537 0.0230
huber Huber Regressor 3769.3076 53638697.2337 7254.7108 0.6095 0.4528 0.2187 0.0250
par Passive Aggressive Regressor 4144.7180 62949698.1775 7862.7604 0.5433 0.4634 0.2465 0.0210
en Elastic Net 7248.9376 89841235.9517 9405.5846 0.3534 0.7346 0.9238 0.0210
omp Orthogonal Matching Pursuit 8916.1927 130904492.3067 11356.4120 0.0561 0.8781 1.1598 0.0180
knn K Neighbors Regressor 8161.8875 137982796.8000 11676.3735 -0.0011 0.8744 0.9742 0.0250
dummy Dummy Regressor 8892.4478 141597492.8000 11823.4271 -0.0221 0.9868 1.4909 0.0210
print(best)
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='squared_error',
                          max_depth=3, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          n_estimators=100, n_iter_no_change=None,
                          random_state=0, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

1.3 聚類

ParCaret提供了clustering模組實現無監督聚類。

資料準備

# 匯入珠寶資料集
from pycaret.datasets import get_data
# 根據資料集特徵進行聚類
data = get_data('./datasets/jewellery')
# data = get_data('jewellery')
Age Income SpendingScore Savings
0 58 77769 0.791329 6559.829923
1 59 81799 0.791082 5417.661426
2 62 74751 0.702657 9258.992965
3 59 74373 0.765680 7346.334504
4 87 17760 0.348778 16869.507130
# 建立資料管道
from pycaret.clustering import ClusteringExperiment
s = ClusteringExperiment()
# normalize歸一化資料
s.setup(data, normalize = True, verbose = False)
# 另一種資料管道建立方式
# from pycaret.clustering import *
# s = setup(data, normalize = True)
<pycaret.clustering.oop.ClusteringExperiment at 0x200dec86340>

模型建立

PyCaret在聚類任務中提供create_model選擇合適的方法來構建聚類模型,而不是全部比較。

kmeans = s.create_model('kmeans')
Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.7581 1611.2647 0.3743 0 0 0

create_model函式支援的聚類方法如下:

s.models()
Name Reference
ID
kmeans K-Means Clustering sklearn.cluster._kmeans.KMeans
ap Affinity Propagation sklearn.cluster._affinity_propagation.Affinity...
meanshift Mean Shift Clustering sklearn.cluster._mean_shift.MeanShift
sc Spectral Clustering sklearn.cluster._spectral.SpectralClustering
hclust Agglomerative Clustering sklearn.cluster._agglomerative.AgglomerativeCl...
dbscan Density-Based Spatial Clustering sklearn.cluster._dbscan.DBSCAN
optics OPTICS Clustering sklearn.cluster._optics.OPTICS
birch Birch Clustering sklearn.cluster._birch.Birch
print(kmeans)
# 檢視聚類數
print(kmeans.n_clusters)
KMeans(algorithm='lloyd', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init='auto', random_state=1459, tol=0.0001, verbose=0)
4

資料展示

# jupyter環境下互動視覺化展示
# s.evaluate_model(kmeans)
# 結果視覺化 
# 'cluster' - Cluster PCA Plot (2d)
# 'tsne' - Cluster t-SNE (3d)
# 'elbow' - Elbow Plot
# 'silhouette' - Silhouette Plot
# 'distance' - Distance Plot
# 'distribution' - Distribution Plot
s.plot_model(kmeans, plot = 'elbow')

png

標籤分配與資料預測

為訓練資料分配聚類標籤:

result = s.assign_model(kmeans)
result.head()
Age Income SpendingScore Savings Cluster
0 58 77769 0.791329 6559.830078 Cluster 2
1 59 81799 0.791082 5417.661621 Cluster 2
2 62 74751 0.702657 9258.993164 Cluster 2
3 59 74373 0.765680 7346.334473 Cluster 2
4 87 17760 0.348778 16869.507812 Cluster 1

為新的資料進行標籤分配:

predictions = s.predict_model(kmeans, data = data)
predictions.head()
Age Income SpendingScore Savings Cluster
0 -0.042287 0.062733 1.103593 -1.072467 Cluster 2
1 -0.000821 0.174811 1.102641 -1.303473 Cluster 2
2 0.123577 -0.021200 0.761727 -0.526556 Cluster 2
3 -0.000821 -0.031712 1.004705 -0.913395 Cluster 2
4 1.160228 -1.606165 -0.602619 1.012686 Cluster 1

1.4 異常檢測

PyCaret的anomaly detection模組是一個無監督的機器學習模組,用於識別與大多數資料存在顯著差異的罕見專案、事件或觀測值。通常,這些異常專案會轉化為某種問題,如銀行欺詐、結構缺陷、醫療問題或錯誤。anomaly detection模組的使用類似於cluster模組。

資料準備

from pycaret.datasets import get_data
data = get_data('./datasets/anomaly')
# data = get_data('anomaly')
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
0 0.263995 0.764929 0.138424 0.935242 0.605867 0.518790 0.912225 0.608234 0.723782 0.733591
1 0.546092 0.653975 0.065575 0.227772 0.845269 0.837066 0.272379 0.331679 0.429297 0.367422
2 0.336714 0.538842 0.192801 0.553563 0.074515 0.332993 0.365792 0.861309 0.899017 0.088600
3 0.092108 0.995017 0.014465 0.176371 0.241530 0.514724 0.562208 0.158963 0.073715 0.208463
4 0.325261 0.805968 0.957033 0.331665 0.307923 0.355315 0.501899 0.558449 0.885169 0.182754
from pycaret.anomaly import AnomalyExperiment
s = AnomalyExperiment()
s.setup(data, session_id = 0)
# 另一種載入方式
# from pycaret.anomaly import *
# s = setup(data, session_id = 0)
Description Value
0 Session id 0
1 Original data shape (1000, 10)
2 Transformed data shape (1000, 10)
3 Numeric features 10
4 Preprocess True
5 Imputation type simple
6 Numeric imputation mean
7 Categorical imputation mode
8 CPU Jobs -1
9 Use GPU False
10 Log Experiment False
11 Experiment Name anomaly-default-name
12 USI 54db
<pycaret.anomaly.oop.AnomalyExperiment at 0x200e14f5250>

模型建立

iforest = s.create_model('iforest')
print(iforest)
IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=0, verbose=0)

anomaly detection模組所支援的模型列表如下:

s.models()
Name Reference
ID
abod Angle-base Outlier Detection pyod.models.abod.ABOD
cluster Clustering-Based Local Outlier pycaret.internal.patches.pyod.CBLOFForceToDouble
cof Connectivity-Based Local Outlier pyod.models.cof.COF
iforest Isolation Forest pyod.models.iforest.IForest
histogram Histogram-based Outlier Detection pyod.models.hbos.HBOS
knn K-Nearest Neighbors Detector pyod.models.knn.KNN
lof Local Outlier Factor pyod.models.lof.LOF
svm One-class SVM detector pyod.models.ocsvm.OCSVM
pca Principal Component Analysis pyod.models.pca.PCA
mcd Minimum Covariance Determinant pyod.models.mcd.MCD
sod Subspace Outlier Detection pyod.models.sod.SOD
sos Stochastic Outlier Selection pyod.models.sos.SOS

標籤分配與資料預測

為訓練資料分配聚類標籤:

result = s.assign_model(iforest)
result.head()
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Anomaly Anomaly_Score
0 0.263995 0.764929 0.138424 0.935242 0.605867 0.518790 0.912225 0.608234 0.723782 0.733591 0 -0.016205
1 0.546092 0.653975 0.065575 0.227772 0.845269 0.837066 0.272379 0.331679 0.429297 0.367422 0 -0.068052
2 0.336714 0.538842 0.192801 0.553563 0.074515 0.332993 0.365792 0.861309 0.899017 0.088600 1 0.009221
3 0.092108 0.995017 0.014465 0.176371 0.241530 0.514724 0.562208 0.158963 0.073715 0.208463 1 0.056690
4 0.325261 0.805968 0.957033 0.331665 0.307923 0.355315 0.501899 0.558449 0.885169 0.182754 0 -0.012945

為新的資料進行標籤分配:

predictions = s.predict_model(iforest, data = data)
predictions.head()
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Anomaly Anomaly_Score
0 0.263995 0.764929 0.138424 0.935242 0.605867 0.518790 0.912225 0.608234 0.723782 0.733591 0 -0.016205
1 0.546092 0.653975 0.065575 0.227772 0.845269 0.837066 0.272379 0.331679 0.429297 0.367422 0 -0.068052
2 0.336714 0.538842 0.192801 0.553563 0.074515 0.332993 0.365792 0.861309 0.899017 0.088600 1 0.009221
3 0.092108 0.995017 0.014465 0.176371 0.241530 0.514724 0.562208 0.158963 0.073715 0.208463 1 0.056690
4 0.325261 0.805968 0.957033 0.331665 0.307923 0.355315 0.501899 0.558449 0.885169 0.182754 0 -0.012945

1.5 時序預測

PyCaret時間序列預測Time Series模組支援多種預測方法,如ARIMA、Prophet和LSTM。它還提供了各種功能來處理缺失值、時間序列分解和資料視覺化。

資料準備

# 乘客時序資料
from pycaret.datasets import get_data
# 下載路徑:https://raw.githubusercontent.com/sktime/sktime/main/sktime/datasets/data/Airline/Airline.csv
data = get_data('./datasets/airline')
# data = get_data('airline')
Date Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
import pandas as pd
data['Date'] = pd.to_datetime(data['Date'])
# 並將Date設定為列號
data.set_index('Date', inplace=True)
from pycaret.time_series import TSForecastingExperiment
s = TSForecastingExperiment()
# fh: 用於預測的預測範圍。預設值設定為1,即預測前方一點。,fold: 交叉驗證中折數
s.setup(data, fh = 3, fold = 5, session_id = 0, verbose = False)
# from pycaret.time_series import *
# s = setup(data, fh = 3, fold = 5, session_id = 0)
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x200dee26910>

模型訓練與評估

best = s.compare_models()
Model MASE RMSSE MAE RMSE MAPE SMAPE R2 TT (Sec)
stlf STLF 0.4240 0.4429 12.8002 15.1933 0.0266 0.0268 0.4296 0.0300
exp_smooth Exponential Smoothing 0.5063 0.5378 15.2900 18.4455 0.0334 0.0335 -0.0521 0.0500
ets ETS 0.5520 0.5801 16.6164 19.8391 0.0354 0.0357 -0.0740 0.0680
arima ARIMA 0.6480 0.6501 19.5728 22.3027 0.0412 0.0420 -0.0796 0.0420
auto_arima Auto ARIMA 0.6526 0.6300 19.7405 21.6202 0.0414 0.0421 -0.0560 10.5220
theta Theta Forecaster 0.8458 0.8223 25.7024 28.3332 0.0524 0.0541 -0.7710 0.0220
huber_cds_dt Huber w/ Cond. Deseasonalize & Detrending 0.9002 0.8900 27.2568 30.5782 0.0550 0.0572 -0.0309 0.0680
knn_cds_dt K Neighbors w/ Cond. Deseasonalize & Detrending 0.9381 0.8830 28.5678 30.5007 0.0555 0.0575 0.0908 0.0920
lr_cds_dt Linear w/ Cond. Deseasonalize & Detrending 0.9469 0.9297 28.6337 31.9163 0.0581 0.0605 -0.1620 0.0820
ridge_cds_dt Ridge w/ Cond. Deseasonalize & Detrending 0.9469 0.9297 28.6340 31.9164 0.0581 0.0605 -0.1620 0.0680
en_cds_dt Elastic Net w/ Cond. Deseasonalize & Detrending 0.9499 0.9320 28.7271 31.9952 0.0582 0.0606 -0.1579 0.0700
llar_cds_dt Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending 0.9520 0.9336 28.7917 32.0528 0.0583 0.0607 -0.1559 0.0560
lasso_cds_dt Lasso w/ Cond. Deseasonalize & Detrending 0.9521 0.9337 28.7941 32.0557 0.0583 0.0607 -0.1560 0.0720
br_cds_dt Bayesian Ridge w/ Cond. Deseasonalize & Detrending 0.9551 0.9347 28.9018 32.1013 0.0582 0.0606 -0.1377 0.0580
et_cds_dt Extra Trees w/ Cond. Deseasonalize & Detrending 1.0322 0.9942 31.4048 34.3054 0.0607 0.0633 -0.1660 0.1280
rf_cds_dt Random Forest w/ Cond. Deseasonalize & Detrending 1.0851 1.0286 32.9791 35.4666 0.0641 0.0670 -0.3545 0.1400
lightgbm_cds_dt Light Gradient Boosting w/ Cond. Deseasonalize & Detrending 1.1409 1.1040 34.5999 37.9918 0.0670 0.0701 -0.3994 0.0900
ada_cds_dt AdaBoost w/ Cond. Deseasonalize & Detrending 1.1441 1.0843 34.7451 37.3681 0.0664 0.0697 -0.3004 0.0920
gbr_cds_dt Gradient Boosting w/ Cond. Deseasonalize & Detrending 1.1697 1.1094 35.4408 38.1373 0.0697 0.0729 -0.4163 0.0900
omp_cds_dt Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending 1.1793 1.1250 35.7348 38.6755 0.0706 0.0732 -0.5095 0.0620
dt_cds_dt Decision Tree w/ Cond. Deseasonalize & Detrending 1.2704 1.2371 38.4976 42.4846 0.0773 0.0814 -1.0382 0.0860
snaive Seasonal Naive Forecaster 1.7700 1.5999 53.5333 54.9143 0.1136 0.1211 -4.1630 0.1580
naive Naive Forecaster 1.8145 1.7444 54.8667 59.8160 0.1135 0.1151 -3.7710 0.1460
polytrend Polynomial Trend Forecaster 2.3154 2.2507 70.1138 77.3400 0.1363 0.1468 -4.6202 0.1080
croston Croston 2.6211 2.4985 79.3645 85.8439 0.1515 0.1684 -5.2294 0.0140
grand_means Grand Means Forecaster 7.1261 6.3506 216.0214 218.4259 0.4377 0.5682 -59.2684 0.1400

資料展示

# jupyter環境下互動視覺化展示
# plot引數支援:
# - 'ts' - Time Series Plot
# - 'train_test_split' - Train Test Split
# - 'cv' - Cross Validation
# - 'acf' - Auto Correlation (ACF)
# - 'pacf' - Partial Auto Correlation (PACF)
# - 'decomp' - Classical Decomposition
# - 'decomp_stl' - STL Decomposition
# - 'diagnostics' - Diagnostics Plot
# - 'diff' - Difference Plot
# - 'periodogram' - Frequency Components (Periodogram)
# - 'fft' - Frequency Components (FFT)
# - 'ccf' - Cross Correlation (CCF)
# - 'forecast' - "Out-of-Sample" Forecast Plot
# - 'insample' - "In-Sample" Forecast Plot
# - 'residuals' - Residuals Plot
# s.plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})

資料預測

# 使模型擬合包括測試樣本在內的完整資料集
final_best = s.finalize_model(best)
s.predict_model(best, fh = 24)
s.save_model(final_best, 'final_best_model')
Transformation Pipeline and Model Successfully Saved





(ForecastingPipeline(steps=[('forecaster',
                             TransformedTargetForecaster(steps=[('model',
                                                                 ForecastingPipeline(steps=[('forecaster',
                                                                                             TransformedTargetForecaster(steps=[('model',
                                                                                                                                 STLForecaster(sp=12))]))]))]))]),
 'final_best_model.pkl')

2 資料處理與清洗

2.1 缺失值處理

各種資料集可能由於多種原因存在缺失值或空記錄。移除具有缺失值的樣本是一種常見策略,但這會導致丟失可能有價值的資料。一種可替代的策略是對缺失值進行插值填充。在setup函式中可以指定如下引數,實現缺失值處理:

  • imputation_type:取值可以是 'simple' 或 'iterative'或 None。當imputation_type設定為 'simple' 時,PyCaret 將使用簡單的方式(numeric_imputation和categorical_imputation)對缺失值進行填充。而當設定為 'iterative' 時,則會使用模型估計的方式(numeric_iterative_imputer,categorical_iterative_imputer)進行填充處理。如果設定為 None,則不會執行任何缺失值填充操作
  • numeric_imputation: 設定數值型別的缺失值,方式如下:
    • mean: 用列的平均值填充,預設
    • drop: 刪除包含缺失值的行
    • median: 用列的中值填充
    • mode: 用列最常見值填充
    • knn: 使用K-最近鄰方法擬合
    • int or float: 用指定值替代
  • categorical_imputation:
    • mode: 用列最常見值填充,預設
    • drop: 刪除包含缺失值的行
    • str: 用指定字元替代
  • numeric_iterative_imputer: 使用估計模型擬合值,可輸入str或sklearn模型, 預設使用lightgbm
  • categorical_iterative_imputer: 使用估計模型差值,可輸入str或sklearn模型, 預設使用lightgbm

載入資料

# load dataset
from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/hepatitis', verbose=False)
# data = get_data('hepatitis',verbose=False)
# 可以看到第三行STEROID列出現NaN值
data.head()
Class AGE SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER BIG LIVER FIRM SPLEEN PALPABLE SPIDERS ASCITES VARICES BILIRUBIN ALK PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY
0 0 30 2 1.0 2 2 2 2 1.0 2.0 2.0 2.0 2.0 2.0 1.0 85.0 18.0 4.0 NaN 1
1 0 50 1 1.0 2 1 2 2 1.0 2.0 2.0 2.0 2.0 2.0 0.9 135.0 42.0 3.5 NaN 1
2 0 78 1 2.0 2 1 2 2 2.0 2.0 2.0 2.0 2.0 2.0 0.7 96.0 32.0 4.0 NaN 1
3 0 31 1 NaN 1 2 2 2 2.0 2.0 2.0 2.0 2.0 2.0 0.7 46.0 52.0 4.0 80.0 1
4 0 34 1 2.0 2 2 2 2 2.0 2.0 2.0 2.0 2.0 2.0 1.0 NaN 200.0 4.0 NaN 1
# 使用均值填充資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# 均值
# s.data['STEROID'].mean()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 設定data_split_shuffle和data_split_stratify為False不打亂資料
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_iterative_imputer='drop')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()
AGE SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER BIG LIVER FIRM SPLEEN PALPABLE SPIDERS ASCITES VARICES BILIRUBIN ALK PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY Class
0 30.0 2.0 1.000000 2.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 1.0 85.000000 18.0 4.0 66.53968 1.0 0
1 50.0 1.0 1.000000 2.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 0.9 135.000000 42.0 3.5 66.53968 1.0 0
2 78.0 1.0 2.000000 2.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 96.000000 32.0 4.0 66.53968 1.0 0
3 31.0 1.0 1.509434 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 46.000000 52.0 4.0 80.00000 1.0 0
4 34.0 1.0 2.000000 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 99.659088 200.0 4.0 66.53968 1.0 0
# 使用knn擬合資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 設定data_split_shuffle和data_split_stratify為False不打亂資料
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_imputation = 'knn')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()
AGE SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER BIG LIVER FIRM SPLEEN PALPABLE SPIDERS ASCITES VARICES BILIRUBIN ALK PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY Class
0 30.0 2.0 1.0 2.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 1.0 85.000000 18.0 4.0 91.800003 1.0 0
1 50.0 1.0 1.0 2.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 0.9 135.000000 42.0 3.5 61.599998 1.0 0
2 78.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 96.000000 32.0 4.0 75.800003 1.0 0
3 31.0 1.0 1.8 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 46.000000 52.0 4.0 80.000000 1.0 0
4 34.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 108.400002 200.0 4.0 62.799999 1.0 0
# 使用lightgbmn擬合資料
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
#         # 設定data_split_shuffle和data_split_stratify為False不打亂資料
#         data_split_shuffle = False, data_split_stratify = False,
#         imputation_type='iterative', numeric_iterative_imputer = 'lightgbm')
# 檢視轉換後的資料
# s.get_config('dataset_transformed').head()

2.2 型別轉換

雖然 PyCaret具有自動識別特徵型別的功能,但PyCaret提供了資料型別自定義引數,使用者可以對資料集進行更精細的控制和指導,以確保模型訓練和特徵工程的效果更加符合使用者的預期和需求。這些自定義引數如下:

  • numeric_features:用於指定資料集中的數值特徵列的引數。這些特徵將被視為連續型變數進行處理
  • categorical_features:用於指定資料集中的分類特徵列的引數。這些特徵將被視為離散型變數進行處理
  • date_features:用於指定資料集中的日期特徵列的引數。這些特徵將被視為日期型變數進行處理
  • create_date_columns:用於指定是否從日期特徵中建立新的日期相關列的引數
  • text_features:用於指定資料集中的文字特徵列的引數。這些特徵將被視為文字型變數進行處理
  • text_features_method:用於指定對文字特徵進行處理的方法的引數
  • ignore_features:用於指定在建模過程中需要忽略的特徵列的引數
  • keep_features:用於指定在建模過程中需要保留的特徵列的引數
# 轉換變數型別
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/hepatitis', verbose=False)

from pycaret.classification import *
s = setup(data = data, target = 'Class', ignore_features  = ['SEX','AGE'], categorical_features=['STEROID'],verbose = False,
         data_split_shuffle = False, data_split_stratify = False)
# 檢視轉換後的資料,前兩列消失,STEROID變為分類變數
s.get_config('dataset_transformed').head()
STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER BIG LIVER FIRM SPLEEN PALPABLE SPIDERS ASCITES VARICES BILIRUBIN ALK PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY Class
0 0.0 2.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 1.0 85.000000 18.0 4.0 66.53968 1.0 0
1 0.0 2.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 0.9 135.000000 42.0 3.5 66.53968 1.0 0
2 1.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 96.000000 32.0 4.0 66.53968 1.0 0
3 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.7 46.000000 52.0 4.0 80.00000 1.0 0
4 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 99.659088 200.0 4.0 66.53968 1.0 0

2.3 獨熱編碼

當資料集中包含分類變數時,這些變數通常需要轉換為模型可以理解的數值形式。獨熱編碼是一種常用的方法,它將每個分類變數轉換為一組二進位制變數,其中每個變數對應一個可能的分類值,並且只有一個變數在任何給定時間點上取值為 1,其餘變數均為 0。可以透過傳遞引數categorical_features來指定要進行獨熱編碼的列。例如:

# load dataset
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/pokemon', verbose=False)
# data = get_data('pokemon')
data.head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
# 對Type 1實現獨熱編碼
len(set(data['Type 1']))
18
from pycaret.classification import *
s = setup(data = data, categorical_features =["Type 1"],target = 'Legendary', verbose=False)
# 檢視轉換後的資料Type 1變為獨熱編碼
s.get_config('dataset_transformed').head()
# Name Type 1_Grass Type 1_Ghost Type 1_Water Type 1_Steel Type 1_Psychic Type 1_Fire Type 1_Poison Type 1_Fairy ... Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
202 187.0 Hoppip 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Flying 250.0 35.0 35.0 40.0 35.0 55.0 50.0 2.0 False
477 429.0 Mismagius 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN 495.0 60.0 60.0 60.0 105.0 105.0 105.0 4.0 False
349 319.0 SharpedoMega Sharpedo 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... Dark 560.0 70.0 140.0 70.0 110.0 65.0 105.0 3.0 False
777 707.0 Klefki 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... Fairy 470.0 57.0 80.0 91.0 80.0 87.0 75.0 6.0 False
50 45.0 Vileplume 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Poison 490.0 75.0 80.0 85.0 110.0 90.0 50.0 1.0 False

5 rows × 30 columns

2.4 資料平衡

在 PyCaret 中,fix_imbalance 和 fix_imbalance_method 是用於處理不平衡資料集的兩個引數。這些引數通常用於在訓練模型之前對資料集進行預處理,以解決類別不平衡問題。

  • fix_imbalance 引數:這是一個布林值引數,用於指示是否對不平衡資料集進行處理。當設定為 True 時,PyCaret 將自動檢測資料集中的類別不平衡問題,並嘗試透過取樣方法來解決。當設定為 False 時,PyCaret 將使用原始的不平衡資料集進行模型訓練
  • fix_imbalance_method 引數:這是一個字串引數,用於指定處理不平衡資料集的方法。可選的值包括:
    • 使用 SMOTE(Synthetic Minority Over-sampling Technique)來生成人工合成樣本,從而平衡類別(預設引數smote)
    • 使用imbalanced-learn提供的估算模型
# 載入資料
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/credit', verbose=False)
# data = get_data('credit')
data.head()
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default
0 20000 2 2 1 24 2 2 -1 -1 -2 ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 90000 2 2 2 34 0 0 0 0 0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
2 50000 2 2 1 37 0 0 0 0 0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
3 50000 1 2 1 57 -1 0 -1 0 0 ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0
4 50000 1 1 2 37 0 0 0 0 0 ... 19394.0 19619.0 20024.0 2500.0 1815.0 657.0 1000.0 1000.0 800.0 0

5 rows × 24 columns

# 檢視資料各類別數
category_counts = data['default'].value_counts()
category_counts
default
0    18694
1     5306
Name: count, dtype: int64
from pycaret.classification import *
s = setup(data = data, target = 'default', fix_imbalance = True, verbose = False)
# 可以看到類1資料量變多了
s.get_config('dataset_transformed')['default'].value_counts()
default
0    18694
1    14678
Name: count, dtype: int64

2.5 異常值處理

PyCaret的remove_outliers函式可以在訓練模型之前識別和刪除資料集中的異常值。它使用奇異值分解技術進行PCA線性降維來識別異常值,並可以透過setup中的outliers_threshold引數控制異常值的比例(預設0.05)。

from pycaret.datasets import get_data

data = get_data(dataset='./datasets/insurance', verbose=False)
# insurance = get_data('insurance')
# 資料維度
data.shape
(1338, 7)
from pycaret.regression import *
s = setup(data = data, target = 'charges', remove_outliers = True ,verbose = False, outliers_threshold = 0.02)
# 移除異常資料後,資料量變少
s.get_config('dataset_transformed').shape
(1319, 10)

2.6 特徵重要性

特徵重要性是一種用於選擇資料集中對預測目標變數最有貢獻的特徵的過程。與使用所有特徵相比,僅使用選定的特徵可以減少過擬合的風險,提高準確性,並縮短訓練時間。在PyCaret中,可以透過使用feature_selection引數來實現這一目的。對於PyCaret中幾個與特徵選擇相關引數的解釋如下:

  • feature_selection:用於指定是否在模型訓練過程中進行特徵選擇。可以設定為 True 或 False。
  • feature_selection_method:特徵選擇方法:
    • 'univariate': 使用sklearn的SelectKBest,基於統計測試來選擇與目標變數最相關的特徵。
    • 'classic(預設)': 使用sklearn的SelectFromModel,利用監督學習模型的特徵重要性或係數來選擇最重要的特徵。
    • 'sequential': 使用sklearn的SequentialFeatureSelector,該類根據指定的演算法(如前向選擇、後向選擇等)以及效能指標(如交叉驗證得分)逐步選擇特徵。
  • n_features_to_select:特徵選擇的最大特徵數量或比例。如果<1,則為起始特徵的比例。預設為0.2。該引數在計數時不考慮 ignore_features 和 keep_features 中的特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/diabetes')
Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
from pycaret.regression import *
# feature_selection選擇特徵, n_features_to_select選擇特徵比例
s = setup(data = data, target = 'Class variable', feature_selection = True, feature_selection_method = 'univariate',
          n_features_to_select = 0.3, verbose = False)
# 檢視哪些特徵保留下來
s.get_config('X_transformed').columns
s.get_config('X_transformed').head()
Plasma glucose concentration a 2 hours in an oral glucose tolerance test Body mass index (weight in kg/(height in m)^2)
56 187.0 37.700001
541 128.0 32.400002
269 146.0 27.500000
304 150.0 21.000000
32 88.0 24.799999

2.7 歸一化

資料歸一化

在 PyCaret 中,normalize 和 normalize_method 引數用於資料預處理中的特徵縮放操作。特徵縮放是指將資料的特徵值按比例縮放,使之落入一個小的特定範圍,這樣可以消除特徵之間的量綱影響,使模型訓練更加穩定和準確。下面是關於這兩個引數的說明:

  • normalize: 這是一個布林值引數,用於指定是否對特徵進行縮放。預設情況下,它的取值為 False,表示不進行特徵縮放。如果將其設定為 True,則會啟用特徵縮放功能。
  • normalize_method: 這是一個字串引數,用於指定特徵縮放的方法。可選的值有:
    • zscore(預設): 使用 Z 分數標準化方法,也稱為標準化或 Z 標準化。該方法將特徵的值轉換為其 Z 分數,即將特徵值減去其均值,然後除以其標準差,從而使得特徵的均值為 0,標準差為 1。
    • minmax: 使用 Min-Max 標準化方法,也稱為歸一化。該方法將特徵的值線性轉換到指定的最小值和最大值之間,預設情況下是 [0, 1] 範圍。
    • maxabs: 使用 MaxAbs 標準化方法。該方法將特徵的值除以特徵的最大絕對值,將特徵的值縮放到 [-1, 1] 範圍內。
    • robust: 使用 RobustScaler 標準化方法。該方法對資料的每個特徵進行中心化和縮放,使用特徵的中位數和四分位數範圍來縮放特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/pokemon')
data.head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
# 歸一化
from pycaret.classification import *
s = setup(data, target='Legendary', normalize=True, normalize_method='robust', verbose=False)

資料歸一化結果:

s.get_config('X_transformed').head()
# Name Type 1_Water Type 1_Normal Type 1_Ice Type 1_Psychic Type 1_Fire Type 1_Rock Type 1_Fighting Type 1_Grass ... Type 2_Electric Type 2_Normal Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
403 -0.021629 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.195387 -0.333333 0.200000 0.875 1.088889 0.125 -0.288889 0.000000
471 0.139870 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.179104 0.333333 0.555556 -0.100 -0.111111 -0.100 1.111111 0.333333
238 -0.448450 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 -1.080054 -0.500000 -0.555556 -0.750 -0.777778 -1.000 -0.333333 -0.333333
646 0.604182 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 -0.618725 -0.166667 -0.333333 -0.500 -0.555556 -0.500 0.222222 0.666667
69 -0.898342 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 -0.265943 -0.833333 -0.888889 -1.000 1.222222 0.000 0.888889 -0.666667

5 rows × 46 columns

特徵變換

歸一化會重新調整資料,使其在新的範圍內,以減少方差中幅度的影響。特徵變換是一種更徹底的技術。透過轉換改變資料的分佈形狀,使得轉換後的資料可以被表示為正態分佈或近似正態分佈。PyCaret中透過transformation引數開啟特徵轉換,transformation_method設定轉換方法:yeo-johnson(預設)和分位數。此外除了特徵變換,還有目標變換。目標變換它將改變目標變數而不是特徵的分佈形狀。此功能僅在pycarte.regression模組中可用。使用transform_target開啟目標變換,transformation_method設定轉換方法。

from pycaret.classification import *
s = setup(data = data, target = 'Legendary', transformation = True, verbose = False)
# 特徵變換結果
s.get_config('X_transformed').head()
# Name Type 1_Psychic Type 1_Water Type 1_Rock Type 1_Grass Type 1_Dragon Type 1_Ghost Type 1_Bug Type 1_Fairy ... Type 2_Electric Type 2_Bug Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
165 52.899003 0.009216 0.043322 -0.000000 -0.000000 -0.000000 -0.000000 -0.0 -0.0 -0.0 ... -0.0 -0.0 93.118403 12.336844 23.649090 13.573010 10.692443 8.081703 26.134255 0.900773
625 140.730289 0.009216 -0.000000 0.095739 -0.000000 -0.000000 -0.000000 -0.0 -0.0 -0.0 ... -0.0 -0.0 66.091344 9.286671 20.259153 13.764668 8.160482 6.056644 9.552506 3.679456
628 141.283084 0.009216 -0.000000 -0.000000 0.043322 -0.000000 -0.000000 -0.0 -0.0 -0.0 ... -0.0 -0.0 89.747939 10.823299 29.105379 11.029571 11.203335 6.942091 27.793080 3.679456
606 137.396878 0.009216 -0.000000 -0.000000 -0.000000 0.061897 -0.000000 -0.0 -0.0 -0.0 ... -0.0 -0.0 56.560577 8.043018 10.276208 10.604937 6.949265 6.302465 19.943809 3.679456
672 149.303914 0.009216 -0.000000 -0.000000 -0.000000 -0.000000 0.029706 -0.0 -0.0 -0.0 ... -0.0 -0.0 72.626190 10.202245 26.061259 11.435493 7.199607 6.302465 20.141156 3.679456

5 rows × 46 columns

3 參考

  • pycaret
  • pycaret-docs
  • pycaret-datasets
  • lightgbm
  • cuml
  • imbalanced-learn

相關文章