PyCaret是一個開源、低程式碼Python機器學習庫,能夠自動化機器學習工作流程。它是一個端到端的機器學習和模型管理工具,極大地加快了實驗週期,提高了工作效率。PyCaret本質上是圍繞幾個機器學習庫和框架(如scikit-learn、XGBoost、LightGBM、CatBoost、spaCy、Optuna、Hyperopt、Ray等)的Python包裝器,與其他開源機器學習庫相比,PyCaret可以用少量程式碼取代數百行程式碼。PyCaret開源倉庫地址:pycaret,官方文件地址為:pycaret-docs。
PyCaret基礎版安裝命令如下:
pip install pycaret
完整版安裝程式碼如下:
pip install pycaret[full]
此外以下模型可以呼叫GPU
- Extreme Gradient Boosting
- Catboost
- Light Gradient Boosting(需要安裝lightgbm)
- Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression(需要安裝cuml0.15版本以上)
# 檢視pycaret版本
import pycaret
pycaret.__version__
'3.3.2'
- 1 快速入門
- 1.1 分類
- 1.2 迴歸
- 1.3 聚類
- 1.4 異常檢測
- 1.5 時序預測
- 2 資料處理與清洗
- 2.1 缺失值處理
- 2.2 型別轉換
- 2.3 獨熱編碼
- 2.4 資料平衡
- 2.5 異常值處理
- 2.6 特徵重要性
- 2.7 歸一化
- 3 參考
1 快速入門
PyCaret支援多種機器學習任務,包括分類、迴歸、聚類、異常檢測和時序預測。本節主要介紹如何利用PyCaret構建相關任務模型的基礎使用方法。關於更詳細的PyCaret任務模型使用,請參考:
Topic | NotebookLink |
---|---|
二分類BinaryClassification | link |
多分類MulticlassClassification | link |
迴歸Regression | link |
聚類Clustering | link |
異常檢測AnomalyDetection | link |
時序預測TimeSeriesForecasting | link |
1.1 分類
PyCaret的classification模組是一個可用於二分類或多分類的模組,用於將元素分類到不同的組中。一些常見的用例包括預測客戶是否違約、預測客戶是否流失、以及診斷疾病(陽性或陰性)。示例程式碼如下所示:
資料準備
載入糖尿病示例資料集:
from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/diabetes', verbose=False)
# 從pycaret開源倉庫下載公開資料
# data = get_data('diabetes', verbose=False)
# 檢視資料型別和資料維度
type(data), data.shape
(pandas.core.frame.DataFrame, (768, 9))
# 最後一列表示是否為糖尿病患者,其他列為特徵列
data.head()
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | Class variable | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
利用PyCaret核心函式setup,初始化建模環境並準備資料以供模型訓練和評估使用:
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# target目標列,session_id設定隨機數種子, preprocesss是否清洗資料,train_size訓練集比例, normalize是否歸一化資料, normalize_method歸一化方式
s.setup(data, target = 'Class variable', session_id = 0, verbose= False, train_size = 0.7, normalize = True, normalize_method = 'minmax')
<pycaret.classification.oop.ClassificationExperiment at 0x200b939df40>
檢視基於setup函式建立的變數:
s.get_config()
{'USI',
'X',
'X_test',
'X_test_transformed',
'X_train',
'X_train_transformed',
'X_transformed',
'_available_plots',
'_ml_usecase',
'data',
'dataset',
'dataset_transformed',
'exp_id',
'exp_name_log',
'fix_imbalance',
'fold_generator',
'fold_groups_param',
'fold_shuffle_param',
'gpu_n_jobs_param',
'gpu_param',
'html_param',
'idx',
'is_multiclass',
'log_plots_param',
'logging_param',
'memory',
'n_jobs_param',
'pipeline',
'seed',
'target_param',
'test',
'test_transformed',
'train',
'train_transformed',
'variable_and_property_keys',
'variables',
'y',
'y_test',
'y_test_transformed',
'y_train',
'y_train_transformed',
'y_transformed'}
檢視歸一化的資料:
s.get_config('X_train_transformed')
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | |
---|---|---|---|---|---|---|---|---|
34 | 0.588235 | 0.613065 | 0.639344 | 0.313131 | 0.000000 | 0.411326 | 0.185312 | 0.400000 |
221 | 0.117647 | 0.793970 | 0.737705 | 0.000000 | 0.000000 | 0.470939 | 0.310418 | 0.750000 |
531 | 0.000000 | 0.537688 | 0.622951 | 0.000000 | 0.000000 | 0.675112 | 0.259607 | 0.050000 |
518 | 0.764706 | 0.381910 | 0.491803 | 0.000000 | 0.000000 | 0.488823 | 0.043553 | 0.333333 |
650 | 0.058824 | 0.457286 | 0.442623 | 0.252525 | 0.118203 | 0.375559 | 0.066610 | 0.033333 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
628 | 0.294118 | 0.643216 | 0.655738 | 0.000000 | 0.000000 | 0.515648 | 0.028181 | 0.400000 |
456 | 0.058824 | 0.678392 | 0.442623 | 0.000000 | 0.000000 | 0.397914 | 0.260034 | 0.683333 |
398 | 0.176471 | 0.412060 | 0.573770 | 0.000000 | 0.000000 | 0.314456 | 0.132792 | 0.066667 |
6 | 0.176471 | 0.391960 | 0.409836 | 0.323232 | 0.104019 | 0.461997 | 0.072588 | 0.083333 |
294 | 0.000000 | 0.809045 | 0.409836 | 0.000000 | 0.000000 | 0.326379 | 0.075149 | 0.733333 |
537 rows × 8 columns
繪製某列資料的柱狀圖:
s.get_config('X_train_transformed')['Number of times pregnant'].hist()
<AxesSubplot:>
當然也可以利用如下程式碼建立任務示例來初始化環境:
from pycaret.classification import setup
# s = setup(data, target = 'Class variable', session_id = 0, preprocess = True, train_size = 0.7, verbose = False)
模型訓練與評估
PyCaret提供了compare_models函式,透過使用預設的10折交叉驗證來訓練和評估模型庫中所有可用估計器的效能:
best = s.compare_models()
# 選擇某些模型進行比較
# best = s.compare_models(include = ['dt', 'rf', 'et', 'gbc', 'lightgbm'])
# 按照召回率返回n_select效能最佳的模型
# best_recall_models_top3 = s.compare_models(sort = 'Recall', n_select = 3)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
lr | Logistic Regression | 0.7633 | 0.8132 | 0.4968 | 0.7436 | 0.5939 | 0.4358 | 0.4549 | 0.2720 |
ridge | Ridge Classifier | 0.7633 | 0.8113 | 0.5178 | 0.7285 | 0.6017 | 0.4406 | 0.4560 | 0.0090 |
lda | Linear Discriminant Analysis | 0.7633 | 0.8110 | 0.5497 | 0.7069 | 0.6154 | 0.4489 | 0.4583 | 0.0080 |
ada | Ada Boost Classifier | 0.7465 | 0.7768 | 0.5655 | 0.6580 | 0.6051 | 0.4208 | 0.4255 | 0.0190 |
svm | SVM - Linear Kernel | 0.7408 | 0.8087 | 0.5921 | 0.6980 | 0.6020 | 0.4196 | 0.4480 | 0.0080 |
nb | Naive Bayes | 0.7391 | 0.7939 | 0.5442 | 0.6515 | 0.5857 | 0.3995 | 0.4081 | 0.0080 |
rf | Random Forest Classifier | 0.7337 | 0.8033 | 0.5406 | 0.6331 | 0.5778 | 0.3883 | 0.3929 | 0.0350 |
et | Extra Trees Classifier | 0.7298 | 0.7899 | 0.5181 | 0.6416 | 0.5677 | 0.3761 | 0.3840 | 0.0450 |
gbc | Gradient Boosting Classifier | 0.7281 | 0.8007 | 0.5567 | 0.6267 | 0.5858 | 0.3857 | 0.3896 | 0.0260 |
lightgbm | Light Gradient Boosting Machine | 0.7242 | 0.7811 | 0.5827 | 0.6096 | 0.5935 | 0.3859 | 0.3876 | 0.0860 |
qda | Quadratic Discriminant Analysis | 0.7150 | 0.7875 | 0.4962 | 0.6225 | 0.5447 | 0.3428 | 0.3524 | 0.0080 |
knn | K Neighbors Classifier | 0.7131 | 0.7425 | 0.5287 | 0.6005 | 0.5577 | 0.3480 | 0.3528 | 0.2200 |
dt | Decision Tree Classifier | 0.6685 | 0.6461 | 0.5722 | 0.5266 | 0.5459 | 0.2868 | 0.2889 | 0.0100 |
dummy | Dummy Classifier | 0.6518 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0120 |
返回當前設定中所有經過訓練的模型中的最佳模型:
best_ml = s.automl()
# best_ml
# 列印效果最佳的模型
print(best)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
資料視覺化
PyCaret也提供了plot_model函式視覺化模型的評估指標,plot_model函式中的plot用於設定評估指標型別。plot可用引數如下(注意並不是所有的模型都支援以下評估指標):
- pipeline: Schematic drawing of the preprocessing pipeline
- auc: Area Under the Curve
- threshold: Discrimination Threshold
- pr: Precision Recall Curve
- confusion_matrix: Confusion Matrix
- error: Class Prediction Error
- class_report: Classification Report
- boundary: Decision Boundary
- rfe: Recursive Feature Selection
- learning: Learning Curve
- manifold: Manifold Learning
- calibration: Calibration Curve
- vc: Validation Curve
- dimension: Dimension Learning
- feature: Feature Importance
- feature_all: Feature Importance (All)
- parameter: Model Hyperparameter
- lift: Lift Curve
- gain: Gain Chart
- tree: Decision Tree
- ks: KS Statistic Plot
# 提取所有模型預測結果
models_results = s.pull()
models_results
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
lr | Logistic Regression | 0.7633 | 0.8132 | 0.4968 | 0.7436 | 0.5939 | 0.4358 | 0.4549 | 0.272 |
ridge | Ridge Classifier | 0.7633 | 0.8113 | 0.5178 | 0.7285 | 0.6017 | 0.4406 | 0.4560 | 0.009 |
lda | Linear Discriminant Analysis | 0.7633 | 0.8110 | 0.5497 | 0.7069 | 0.6154 | 0.4489 | 0.4583 | 0.008 |
ada | Ada Boost Classifier | 0.7465 | 0.7768 | 0.5655 | 0.6580 | 0.6051 | 0.4208 | 0.4255 | 0.019 |
svm | SVM - Linear Kernel | 0.7408 | 0.8087 | 0.5921 | 0.6980 | 0.6020 | 0.4196 | 0.4480 | 0.008 |
nb | Naive Bayes | 0.7391 | 0.7939 | 0.5442 | 0.6515 | 0.5857 | 0.3995 | 0.4081 | 0.008 |
rf | Random Forest Classifier | 0.7337 | 0.8033 | 0.5406 | 0.6331 | 0.5778 | 0.3883 | 0.3929 | 0.035 |
et | Extra Trees Classifier | 0.7298 | 0.7899 | 0.5181 | 0.6416 | 0.5677 | 0.3761 | 0.3840 | 0.045 |
gbc | Gradient Boosting Classifier | 0.7281 | 0.8007 | 0.5567 | 0.6267 | 0.5858 | 0.3857 | 0.3896 | 0.026 |
lightgbm | Light Gradient Boosting Machine | 0.7242 | 0.7811 | 0.5827 | 0.6096 | 0.5935 | 0.3859 | 0.3876 | 0.086 |
qda | Quadratic Discriminant Analysis | 0.7150 | 0.7875 | 0.4962 | 0.6225 | 0.5447 | 0.3428 | 0.3524 | 0.008 |
knn | K Neighbors Classifier | 0.7131 | 0.7425 | 0.5287 | 0.6005 | 0.5577 | 0.3480 | 0.3528 | 0.220 |
dt | Decision Tree Classifier | 0.6685 | 0.6461 | 0.5722 | 0.5266 | 0.5459 | 0.2868 | 0.2889 | 0.010 |
dummy | Dummy Classifier | 0.6518 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.012 |
s.plot_model(best, plot = 'confusion_matrix')
如果在jupyter環境,可以透過evaluate_model函式來互動式展示模型的效能:
# s.evaluate_model(best)
模型預測
predict_model函式實現對資料進行預測,並返回包含預測標籤prediction_label和分數prediction_score的Pandas表格。當data為None時,它預測測試集(在設定功能期間建立)上的標籤和分數。
# 預測整個資料集
res = s.predict_model(best, data=data)
# 檢視各行預測結果
# res
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.7708 | 0.8312 | 0.5149 | 0.7500 | 0.6106 | 0.4561 | 0.4723 |
# 預測用於資料訓練的測試集
res = s.predict_model(best)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.7576 | 0.8553 | 0.5062 | 0.7193 | 0.5942 | 0.4287 | 0.4422 |
模型儲存與匯入
# 儲存模型到本地
_ = s.save_model(best, 'best_model', verbose = False)
# 匯入模型
model = s.load_model( 'best_model')
# 檢視模型結構
# model
Transformation Pipeline and Model Successfully Loaded
# 預測整個資料集
res = s.predict_model(model, data=data)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.7708 | 0.8312 | 0.5149 | 0.7500 | 0.6106 | 0.4561 | 0.4723 |
1.2 迴歸
PyCaret提供了regression模型實現迴歸任務,regression模組與classification模組使用方法一致。
# 載入保險費用示例資料集
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# 從網路下載
# data = get_data(dataset='insurance', verbose=False)
data.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
# 建立資料管道
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()
# 預測charges列
s.setup(data, target = 'charges', session_id = 0)
# 另一種資料管道建立方式
# from pycaret.regression import *
# s = setup(data, target = 'charges', session_id = 0)
Description | Value | |
---|---|---|
0 | Session id | 0 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Numeric features | 3 |
8 | Categorical features | 3 |
9 | Preprocess | True |
10 | Imputation type | simple |
11 | Numeric imputation | mean |
12 | Categorical imputation | mode |
13 | Maximum one-hot encoding | 25 |
14 | Encoding method | None |
15 | Fold Generator | KFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | reg-default-name |
21 | USI | eb9d |
<pycaret.regression.oop.RegressionExperiment at 0x200dedc2d30>
# 評估各類模型
best = s.compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2723.2453 | 23787529.5872 | 4832.4785 | 0.8254 | 0.4427 | 0.3140 | 0.0550 |
lightgbm | Light Gradient Boosting Machine | 2998.1311 | 25738691.2181 | 5012.2404 | 0.8106 | 0.5525 | 0.3709 | 0.1140 |
rf | Random Forest Regressor | 2915.7018 | 26780127.0016 | 5109.5098 | 0.8031 | 0.4855 | 0.3520 | 0.0670 |
et | Extra Trees Regressor | 2841.8257 | 28559316.9533 | 5243.5828 | 0.7931 | 0.4671 | 0.3218 | 0.0670 |
ada | AdaBoost Regressor | 4180.2669 | 28289551.0048 | 5297.6817 | 0.7886 | 0.5935 | 0.6545 | 0.0210 |
ridge | Ridge Regression | 4304.2640 | 38786967.4768 | 6188.6966 | 0.7152 | 0.5794 | 0.4283 | 0.0230 |
lar | Least Angle Regression | 4293.9886 | 38781666.5991 | 6188.3301 | 0.7151 | 0.5893 | 0.4263 | 0.0210 |
llar | Lasso Least Angle Regression | 4294.2135 | 38780221.0039 | 6188.1906 | 0.7151 | 0.5891 | 0.4264 | 0.0200 |
br | Bayesian Ridge | 4299.8532 | 38785479.0984 | 6188.6026 | 0.7151 | 0.5784 | 0.4274 | 0.0200 |
lasso | Lasso Regression | 4294.2186 | 38780210.5665 | 6188.1898 | 0.7151 | 0.5892 | 0.4264 | 0.0370 |
lr | Linear Regression | 4293.9886 | 38781666.5991 | 6188.3301 | 0.7151 | 0.5893 | 0.4263 | 0.0350 |
dt | Decision Tree Regressor | 3550.6534 | 51149204.9032 | 7095.9170 | 0.6127 | 0.5839 | 0.4537 | 0.0230 |
huber | Huber Regressor | 3769.3076 | 53638697.2337 | 7254.7108 | 0.6095 | 0.4528 | 0.2187 | 0.0250 |
par | Passive Aggressive Regressor | 4144.7180 | 62949698.1775 | 7862.7604 | 0.5433 | 0.4634 | 0.2465 | 0.0210 |
en | Elastic Net | 7248.9376 | 89841235.9517 | 9405.5846 | 0.3534 | 0.7346 | 0.9238 | 0.0210 |
omp | Orthogonal Matching Pursuit | 8916.1927 | 130904492.3067 | 11356.4120 | 0.0561 | 0.8781 | 1.1598 | 0.0180 |
knn | K Neighbors Regressor | 8161.8875 | 137982796.8000 | 11676.3735 | -0.0011 | 0.8744 | 0.9742 | 0.0250 |
dummy | Dummy Regressor | 8892.4478 | 141597492.8000 | 11823.4271 | -0.0221 | 0.9868 | 1.4909 | 0.0210 |
print(best)
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
init=None, learning_rate=0.1, loss='squared_error',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_iter_no_change=None,
random_state=0, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
1.3 聚類
ParCaret提供了clustering模組實現無監督聚類。
資料準備
# 匯入珠寶資料集
from pycaret.datasets import get_data
# 根據資料集特徵進行聚類
data = get_data('./datasets/jewellery')
# data = get_data('jewellery')
Age | Income | SpendingScore | Savings | |
---|---|---|---|---|
0 | 58 | 77769 | 0.791329 | 6559.829923 |
1 | 59 | 81799 | 0.791082 | 5417.661426 |
2 | 62 | 74751 | 0.702657 | 9258.992965 |
3 | 59 | 74373 | 0.765680 | 7346.334504 |
4 | 87 | 17760 | 0.348778 | 16869.507130 |
# 建立資料管道
from pycaret.clustering import ClusteringExperiment
s = ClusteringExperiment()
# normalize歸一化資料
s.setup(data, normalize = True, verbose = False)
# 另一種資料管道建立方式
# from pycaret.clustering import *
# s = setup(data, normalize = True)
<pycaret.clustering.oop.ClusteringExperiment at 0x200dec86340>
模型建立
PyCaret在聚類任務中提供create_model選擇合適的方法來構建聚類模型,而不是全部比較。
kmeans = s.create_model('kmeans')
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
---|---|---|---|---|---|---|
0 | 0.7581 | 1611.2647 | 0.3743 | 0 | 0 | 0 |
create_model函式支援的聚類方法如下:
s.models()
Name | Reference | |
---|---|---|
ID | ||
kmeans | K-Means Clustering | sklearn.cluster._kmeans.KMeans |
ap | Affinity Propagation | sklearn.cluster._affinity_propagation.Affinity... |
meanshift | Mean Shift Clustering | sklearn.cluster._mean_shift.MeanShift |
sc | Spectral Clustering | sklearn.cluster._spectral.SpectralClustering |
hclust | Agglomerative Clustering | sklearn.cluster._agglomerative.AgglomerativeCl... |
dbscan | Density-Based Spatial Clustering | sklearn.cluster._dbscan.DBSCAN |
optics | OPTICS Clustering | sklearn.cluster._optics.OPTICS |
birch | Birch Clustering | sklearn.cluster._birch.Birch |
print(kmeans)
# 檢視聚類數
print(kmeans.n_clusters)
KMeans(algorithm='lloyd', copy_x=True, init='k-means++', max_iter=300,
n_clusters=4, n_init='auto', random_state=1459, tol=0.0001, verbose=0)
4
資料展示
# jupyter環境下互動視覺化展示
# s.evaluate_model(kmeans)
# 結果視覺化
# 'cluster' - Cluster PCA Plot (2d)
# 'tsne' - Cluster t-SNE (3d)
# 'elbow' - Elbow Plot
# 'silhouette' - Silhouette Plot
# 'distance' - Distance Plot
# 'distribution' - Distribution Plot
s.plot_model(kmeans, plot = 'elbow')
標籤分配與資料預測
為訓練資料分配聚類標籤:
result = s.assign_model(kmeans)
result.head()
Age | Income | SpendingScore | Savings | Cluster | |
---|---|---|---|---|---|
0 | 58 | 77769 | 0.791329 | 6559.830078 | Cluster 2 |
1 | 59 | 81799 | 0.791082 | 5417.661621 | Cluster 2 |
2 | 62 | 74751 | 0.702657 | 9258.993164 | Cluster 2 |
3 | 59 | 74373 | 0.765680 | 7346.334473 | Cluster 2 |
4 | 87 | 17760 | 0.348778 | 16869.507812 | Cluster 1 |
為新的資料進行標籤分配:
predictions = s.predict_model(kmeans, data = data)
predictions.head()
Age | Income | SpendingScore | Savings | Cluster | |
---|---|---|---|---|---|
0 | -0.042287 | 0.062733 | 1.103593 | -1.072467 | Cluster 2 |
1 | -0.000821 | 0.174811 | 1.102641 | -1.303473 | Cluster 2 |
2 | 0.123577 | -0.021200 | 0.761727 | -0.526556 | Cluster 2 |
3 | -0.000821 | -0.031712 | 1.004705 | -0.913395 | Cluster 2 |
4 | 1.160228 | -1.606165 | -0.602619 | 1.012686 | Cluster 1 |
1.4 異常檢測
PyCaret的anomaly detection模組是一個無監督的機器學習模組,用於識別與大多數資料存在顯著差異的罕見專案、事件或觀測值。通常,這些異常專案會轉化為某種問題,如銀行欺詐、結構缺陷、醫療問題或錯誤。anomaly detection模組的使用類似於cluster模組。
資料準備
from pycaret.datasets import get_data
data = get_data('./datasets/anomaly')
# data = get_data('anomaly')
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 |
from pycaret.anomaly import AnomalyExperiment
s = AnomalyExperiment()
s.setup(data, session_id = 0)
# 另一種載入方式
# from pycaret.anomaly import *
# s = setup(data, session_id = 0)
Description | Value | |
---|---|---|
0 | Session id | 0 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 54db |
<pycaret.anomaly.oop.AnomalyExperiment at 0x200e14f5250>
模型建立
iforest = s.create_model('iforest')
print(iforest)
IForest(behaviour='new', bootstrap=False, contamination=0.05,
max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
random_state=0, verbose=0)
anomaly detection模組所支援的模型列表如下:
s.models()
Name | Reference | |
---|---|---|
ID | ||
abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
cluster | Clustering-Based Local Outlier | pycaret.internal.patches.pyod.CBLOFForceToDouble |
cof | Connectivity-Based Local Outlier | pyod.models.cof.COF |
iforest | Isolation Forest | pyod.models.iforest.IForest |
histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
lof | Local Outlier Factor | pyod.models.lof.LOF |
svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
pca | Principal Component Analysis | pyod.models.pca.PCA |
mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
sod | Subspace Outlier Detection | pyod.models.sod.SOD |
sos | Stochastic Outlier Selection | pyod.models.sos.SOS |
標籤分配與資料預測
為訓練資料分配聚類標籤:
result = s.assign_model(iforest)
result.head()
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.016205 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.068052 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.009221 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.056690 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.012945 |
為新的資料進行標籤分配:
predictions = s.predict_model(iforest, data = data)
predictions.head()
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.016205 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.068052 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.009221 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.056690 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.012945 |
1.5 時序預測
PyCaret時間序列預測Time Series模組支援多種預測方法,如ARIMA、Prophet和LSTM。它還提供了各種功能來處理缺失值、時間序列分解和資料視覺化。
資料準備
# 乘客時序資料
from pycaret.datasets import get_data
# 下載路徑:https://raw.githubusercontent.com/sktime/sktime/main/sktime/datasets/data/Airline/Airline.csv
data = get_data('./datasets/airline')
# data = get_data('airline')
Date | Passengers | |
---|---|---|
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
import pandas as pd
data['Date'] = pd.to_datetime(data['Date'])
# 並將Date設定為列號
data.set_index('Date', inplace=True)
from pycaret.time_series import TSForecastingExperiment
s = TSForecastingExperiment()
# fh: 用於預測的預測範圍。預設值設定為1,即預測前方一點。,fold: 交叉驗證中折數
s.setup(data, fh = 3, fold = 5, session_id = 0, verbose = False)
# from pycaret.time_series import *
# s = setup(data, fh = 3, fold = 5, session_id = 0)
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x200dee26910>
模型訓練與評估
best = s.compare_models()
Model | MASE | RMSSE | MAE | RMSE | MAPE | SMAPE | R2 | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
stlf | STLF | 0.4240 | 0.4429 | 12.8002 | 15.1933 | 0.0266 | 0.0268 | 0.4296 | 0.0300 |
exp_smooth | Exponential Smoothing | 0.5063 | 0.5378 | 15.2900 | 18.4455 | 0.0334 | 0.0335 | -0.0521 | 0.0500 |
ets | ETS | 0.5520 | 0.5801 | 16.6164 | 19.8391 | 0.0354 | 0.0357 | -0.0740 | 0.0680 |
arima | ARIMA | 0.6480 | 0.6501 | 19.5728 | 22.3027 | 0.0412 | 0.0420 | -0.0796 | 0.0420 |
auto_arima | Auto ARIMA | 0.6526 | 0.6300 | 19.7405 | 21.6202 | 0.0414 | 0.0421 | -0.0560 | 10.5220 |
theta | Theta Forecaster | 0.8458 | 0.8223 | 25.7024 | 28.3332 | 0.0524 | 0.0541 | -0.7710 | 0.0220 |
huber_cds_dt | Huber w/ Cond. Deseasonalize & Detrending | 0.9002 | 0.8900 | 27.2568 | 30.5782 | 0.0550 | 0.0572 | -0.0309 | 0.0680 |
knn_cds_dt | K Neighbors w/ Cond. Deseasonalize & Detrending | 0.9381 | 0.8830 | 28.5678 | 30.5007 | 0.0555 | 0.0575 | 0.0908 | 0.0920 |
lr_cds_dt | Linear w/ Cond. Deseasonalize & Detrending | 0.9469 | 0.9297 | 28.6337 | 31.9163 | 0.0581 | 0.0605 | -0.1620 | 0.0820 |
ridge_cds_dt | Ridge w/ Cond. Deseasonalize & Detrending | 0.9469 | 0.9297 | 28.6340 | 31.9164 | 0.0581 | 0.0605 | -0.1620 | 0.0680 |
en_cds_dt | Elastic Net w/ Cond. Deseasonalize & Detrending | 0.9499 | 0.9320 | 28.7271 | 31.9952 | 0.0582 | 0.0606 | -0.1579 | 0.0700 |
llar_cds_dt | Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending | 0.9520 | 0.9336 | 28.7917 | 32.0528 | 0.0583 | 0.0607 | -0.1559 | 0.0560 |
lasso_cds_dt | Lasso w/ Cond. Deseasonalize & Detrending | 0.9521 | 0.9337 | 28.7941 | 32.0557 | 0.0583 | 0.0607 | -0.1560 | 0.0720 |
br_cds_dt | Bayesian Ridge w/ Cond. Deseasonalize & Detrending | 0.9551 | 0.9347 | 28.9018 | 32.1013 | 0.0582 | 0.0606 | -0.1377 | 0.0580 |
et_cds_dt | Extra Trees w/ Cond. Deseasonalize & Detrending | 1.0322 | 0.9942 | 31.4048 | 34.3054 | 0.0607 | 0.0633 | -0.1660 | 0.1280 |
rf_cds_dt | Random Forest w/ Cond. Deseasonalize & Detrending | 1.0851 | 1.0286 | 32.9791 | 35.4666 | 0.0641 | 0.0670 | -0.3545 | 0.1400 |
lightgbm_cds_dt | Light Gradient Boosting w/ Cond. Deseasonalize & Detrending | 1.1409 | 1.1040 | 34.5999 | 37.9918 | 0.0670 | 0.0701 | -0.3994 | 0.0900 |
ada_cds_dt | AdaBoost w/ Cond. Deseasonalize & Detrending | 1.1441 | 1.0843 | 34.7451 | 37.3681 | 0.0664 | 0.0697 | -0.3004 | 0.0920 |
gbr_cds_dt | Gradient Boosting w/ Cond. Deseasonalize & Detrending | 1.1697 | 1.1094 | 35.4408 | 38.1373 | 0.0697 | 0.0729 | -0.4163 | 0.0900 |
omp_cds_dt | Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending | 1.1793 | 1.1250 | 35.7348 | 38.6755 | 0.0706 | 0.0732 | -0.5095 | 0.0620 |
dt_cds_dt | Decision Tree w/ Cond. Deseasonalize & Detrending | 1.2704 | 1.2371 | 38.4976 | 42.4846 | 0.0773 | 0.0814 | -1.0382 | 0.0860 |
snaive | Seasonal Naive Forecaster | 1.7700 | 1.5999 | 53.5333 | 54.9143 | 0.1136 | 0.1211 | -4.1630 | 0.1580 |
naive | Naive Forecaster | 1.8145 | 1.7444 | 54.8667 | 59.8160 | 0.1135 | 0.1151 | -3.7710 | 0.1460 |
polytrend | Polynomial Trend Forecaster | 2.3154 | 2.2507 | 70.1138 | 77.3400 | 0.1363 | 0.1468 | -4.6202 | 0.1080 |
croston | Croston | 2.6211 | 2.4985 | 79.3645 | 85.8439 | 0.1515 | 0.1684 | -5.2294 | 0.0140 |
grand_means | Grand Means Forecaster | 7.1261 | 6.3506 | 216.0214 | 218.4259 | 0.4377 | 0.5682 | -59.2684 | 0.1400 |
資料展示
# jupyter環境下互動視覺化展示
# plot引數支援:
# - 'ts' - Time Series Plot
# - 'train_test_split' - Train Test Split
# - 'cv' - Cross Validation
# - 'acf' - Auto Correlation (ACF)
# - 'pacf' - Partial Auto Correlation (PACF)
# - 'decomp' - Classical Decomposition
# - 'decomp_stl' - STL Decomposition
# - 'diagnostics' - Diagnostics Plot
# - 'diff' - Difference Plot
# - 'periodogram' - Frequency Components (Periodogram)
# - 'fft' - Frequency Components (FFT)
# - 'ccf' - Cross Correlation (CCF)
# - 'forecast' - "Out-of-Sample" Forecast Plot
# - 'insample' - "In-Sample" Forecast Plot
# - 'residuals' - Residuals Plot
# s.plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})
資料預測
# 使模型擬合包括測試樣本在內的完整資料集
final_best = s.finalize_model(best)
s.predict_model(best, fh = 24)
s.save_model(final_best, 'final_best_model')
Transformation Pipeline and Model Successfully Saved
(ForecastingPipeline(steps=[('forecaster',
TransformedTargetForecaster(steps=[('model',
ForecastingPipeline(steps=[('forecaster',
TransformedTargetForecaster(steps=[('model',
STLForecaster(sp=12))]))]))]))]),
'final_best_model.pkl')
2 資料處理與清洗
2.1 缺失值處理
各種資料集可能由於多種原因存在缺失值或空記錄。移除具有缺失值的樣本是一種常見策略,但這會導致丟失可能有價值的資料。一種可替代的策略是對缺失值進行插值填充。在setup函式中可以指定如下引數,實現缺失值處理:
- imputation_type:取值可以是 'simple' 或 'iterative'或 None。當imputation_type設定為 'simple' 時,PyCaret 將使用簡單的方式(numeric_imputation和categorical_imputation)對缺失值進行填充。而當設定為 'iterative' 時,則會使用模型估計的方式(numeric_iterative_imputer,categorical_iterative_imputer)進行填充處理。如果設定為 None,則不會執行任何缺失值填充操作
- numeric_imputation: 設定數值型別的缺失值,方式如下:
- mean: 用列的平均值填充,預設
- drop: 刪除包含缺失值的行
- median: 用列的中值填充
- mode: 用列最常見值填充
- knn: 使用K-最近鄰方法擬合
- int or float: 用指定值替代
- categorical_imputation:
- mode: 用列最常見值填充,預設
- drop: 刪除包含缺失值的行
- str: 用指定字元替代
- numeric_iterative_imputer: 使用估計模型擬合值,可輸入str或sklearn模型, 預設使用lightgbm
- categorical_iterative_imputer: 使用估計模型差值,可輸入str或sklearn模型, 預設使用lightgbm
載入資料
# load dataset
from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/hepatitis', verbose=False)
# data = get_data('hepatitis',verbose=False)
# 可以看到第三行STEROID列出現NaN值
data.head()
Class | AGE | SEX | STEROID | ANTIVIRALS | FATIGUE | MALAISE | ANOREXIA | LIVER BIG | LIVER FIRM | SPLEEN PALPABLE | SPIDERS | ASCITES | VARICES | BILIRUBIN | ALK PHOSPHATE | SGOT | ALBUMIN | PROTIME | HISTOLOGY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 30 | 2 | 1.0 | 2 | 2 | 2 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | NaN | 1 |
1 | 0 | 50 | 1 | 1.0 | 2 | 1 | 2 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | NaN | 1 |
2 | 0 | 78 | 1 | 2.0 | 2 | 1 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | NaN | 1 |
3 | 0 | 31 | 1 | NaN | 1 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1 |
4 | 0 | 34 | 1 | 2.0 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | NaN | 200.0 | 4.0 | NaN | 1 |
# 使用均值填充資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# 均值
# s.data['STEROID'].mean()
s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# 設定data_split_shuffle和data_split_stratify為False不打亂資料
data_split_shuffle = False, data_split_stratify = False,
imputation_type='simple', numeric_iterative_imputer='drop')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()
AGE | SEX | STEROID | ANTIVIRALS | FATIGUE | MALAISE | ANOREXIA | LIVER BIG | LIVER FIRM | SPLEEN PALPABLE | SPIDERS | ASCITES | VARICES | BILIRUBIN | ALK PHOSPHATE | SGOT | ALBUMIN | PROTIME | HISTOLOGY | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30.0 | 2.0 | 1.000000 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 66.53968 | 1.0 | 0 |
1 | 50.0 | 1.0 | 1.000000 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 66.53968 | 1.0 | 0 |
2 | 78.0 | 1.0 | 2.000000 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 66.53968 | 1.0 | 0 |
3 | 31.0 | 1.0 | 1.509434 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.00000 | 1.0 | 0 |
4 | 34.0 | 1.0 | 2.000000 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 99.659088 | 200.0 | 4.0 | 66.53968 | 1.0 | 0 |
# 使用knn擬合資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# 設定data_split_shuffle和data_split_stratify為False不打亂資料
data_split_shuffle = False, data_split_stratify = False,
imputation_type='simple', numeric_imputation = 'knn')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()
AGE | SEX | STEROID | ANTIVIRALS | FATIGUE | MALAISE | ANOREXIA | LIVER BIG | LIVER FIRM | SPLEEN PALPABLE | SPIDERS | ASCITES | VARICES | BILIRUBIN | ALK PHOSPHATE | SGOT | ALBUMIN | PROTIME | HISTOLOGY | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 91.800003 | 1.0 | 0 |
1 | 50.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 61.599998 | 1.0 | 0 |
2 | 78.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 75.800003 | 1.0 | 0 |
3 | 31.0 | 1.0 | 1.8 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.000000 | 1.0 | 0 |
4 | 34.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 108.400002 | 200.0 | 4.0 | 62.799999 | 1.0 | 0 |
# 使用lightgbmn擬合資料
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# # 設定data_split_shuffle和data_split_stratify為False不打亂資料
# data_split_shuffle = False, data_split_stratify = False,
# imputation_type='iterative', numeric_iterative_imputer = 'lightgbm')
# 檢視轉換後的資料
# s.get_config('dataset_transformed').head()
2.2 型別轉換
雖然 PyCaret具有自動識別特徵型別的功能,但PyCaret提供了資料型別自定義引數,使用者可以對資料集進行更精細的控制和指導,以確保模型訓練和特徵工程的效果更加符合使用者的預期和需求。這些自定義引數如下:
- numeric_features:用於指定資料集中的數值特徵列的引數。這些特徵將被視為連續型變數進行處理
- categorical_features:用於指定資料集中的分類特徵列的引數。這些特徵將被視為離散型變數進行處理
- date_features:用於指定資料集中的日期特徵列的引數。這些特徵將被視為日期型變數進行處理
- create_date_columns:用於指定是否從日期特徵中建立新的日期相關列的引數
- text_features:用於指定資料集中的文字特徵列的引數。這些特徵將被視為文字型變數進行處理
- text_features_method:用於指定對文字特徵進行處理的方法的引數
- ignore_features:用於指定在建模過程中需要忽略的特徵列的引數
- keep_features:用於指定在建模過程中需要保留的特徵列的引數
# 轉換變數型別
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/hepatitis', verbose=False)
from pycaret.classification import *
s = setup(data = data, target = 'Class', ignore_features = ['SEX','AGE'], categorical_features=['STEROID'],verbose = False,
data_split_shuffle = False, data_split_stratify = False)
# 檢視轉換後的資料,前兩列消失,STEROID變為分類變數
s.get_config('dataset_transformed').head()
STEROID | ANTIVIRALS | FATIGUE | MALAISE | ANOREXIA | LIVER BIG | LIVER FIRM | SPLEEN PALPABLE | SPIDERS | ASCITES | VARICES | BILIRUBIN | ALK PHOSPHATE | SGOT | ALBUMIN | PROTIME | HISTOLOGY | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 66.53968 | 1.0 | 0 |
1 | 0.0 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 66.53968 | 1.0 | 0 |
2 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 66.53968 | 1.0 | 0 |
3 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.00000 | 1.0 | 0 |
4 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 99.659088 | 200.0 | 4.0 | 66.53968 | 1.0 | 0 |
2.3 獨熱編碼
當資料集中包含分類變數時,這些變數通常需要轉換為模型可以理解的數值形式。獨熱編碼是一種常用的方法,它將每個分類變數轉換為一組二進位制變數,其中每個變數對應一個可能的分類值,並且只有一個變數在任何給定時間點上取值為 1,其餘變數均為 0。可以透過傳遞引數categorical_features來指定要進行獨熱編碼的列。例如:
# load dataset
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/pokemon', verbose=False)
# data = get_data('pokemon')
data.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# 對Type 1實現獨熱編碼
len(set(data['Type 1']))
18
from pycaret.classification import *
s = setup(data = data, categorical_features =["Type 1"],target = 'Legendary', verbose=False)
# 檢視轉換後的資料Type 1變為獨熱編碼
s.get_config('dataset_transformed').head()
# | Name | Type 1_Grass | Type 1_Ghost | Type 1_Water | Type 1_Steel | Type 1_Psychic | Type 1_Fire | Type 1_Poison | Type 1_Fairy | ... | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202 | 187.0 | Hoppip | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Flying | 250.0 | 35.0 | 35.0 | 40.0 | 35.0 | 55.0 | 50.0 | 2.0 | False |
477 | 429.0 | Mismagius | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | 495.0 | 60.0 | 60.0 | 60.0 | 105.0 | 105.0 | 105.0 | 4.0 | False |
349 | 319.0 | SharpedoMega Sharpedo | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Dark | 560.0 | 70.0 | 140.0 | 70.0 | 110.0 | 65.0 | 105.0 | 3.0 | False |
777 | 707.0 | Klefki | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Fairy | 470.0 | 57.0 | 80.0 | 91.0 | 80.0 | 87.0 | 75.0 | 6.0 | False |
50 | 45.0 | Vileplume | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Poison | 490.0 | 75.0 | 80.0 | 85.0 | 110.0 | 90.0 | 50.0 | 1.0 | False |
5 rows × 30 columns
2.4 資料平衡
在 PyCaret 中,fix_imbalance 和 fix_imbalance_method 是用於處理不平衡資料集的兩個引數。這些引數通常用於在訓練模型之前對資料集進行預處理,以解決類別不平衡問題。
- fix_imbalance 引數:這是一個布林值引數,用於指示是否對不平衡資料集進行處理。當設定為 True 時,PyCaret 將自動檢測資料集中的類別不平衡問題,並嘗試透過取樣方法來解決。當設定為 False 時,PyCaret 將使用原始的不平衡資料集進行模型訓練
- fix_imbalance_method 引數:這是一個字串引數,用於指定處理不平衡資料集的方法。可選的值包括:
- 使用 SMOTE(Synthetic Minority Over-sampling Technique)來生成人工合成樣本,從而平衡類別(預設引數smote)
- 使用imbalanced-learn提供的估算模型
# 載入資料
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/credit', verbose=False)
# data = get_data('credit')
data.head()
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
1 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
2 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
3 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
4 | 50000 | 1 | 1 | 2 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 19394.0 | 19619.0 | 20024.0 | 2500.0 | 1815.0 | 657.0 | 1000.0 | 1000.0 | 800.0 | 0 |
5 rows × 24 columns
# 檢視資料各類別數
category_counts = data['default'].value_counts()
category_counts
default
0 18694
1 5306
Name: count, dtype: int64
from pycaret.classification import *
s = setup(data = data, target = 'default', fix_imbalance = True, verbose = False)
# 可以看到類1資料量變多了
s.get_config('dataset_transformed')['default'].value_counts()
default
0 18694
1 14678
Name: count, dtype: int64
2.5 異常值處理
PyCaret的remove_outliers函式可以在訓練模型之前識別和刪除資料集中的異常值。它使用奇異值分解技術進行PCA線性降維來識別異常值,並可以透過setup中的outliers_threshold引數控制異常值的比例(預設0.05)。
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# insurance = get_data('insurance')
# 資料維度
data.shape
(1338, 7)
from pycaret.regression import *
s = setup(data = data, target = 'charges', remove_outliers = True ,verbose = False, outliers_threshold = 0.02)
# 移除異常資料後,資料量變少
s.get_config('dataset_transformed').shape
(1319, 10)
2.6 特徵重要性
特徵重要性是一種用於選擇資料集中對預測目標變數最有貢獻的特徵的過程。與使用所有特徵相比,僅使用選定的特徵可以減少過擬合的風險,提高準確性,並縮短訓練時間。在PyCaret中,可以透過使用feature_selection引數來實現這一目的。對於PyCaret中幾個與特徵選擇相關引數的解釋如下:
- feature_selection:用於指定是否在模型訓練過程中進行特徵選擇。可以設定為 True 或 False。
- feature_selection_method:特徵選擇方法:
- 'univariate': 使用sklearn的SelectKBest,基於統計測試來選擇與目標變數最相關的特徵。
- 'classic(預設)': 使用sklearn的SelectFromModel,利用監督學習模型的特徵重要性或係數來選擇最重要的特徵。
- 'sequential': 使用sklearn的SequentialFeatureSelector,該類根據指定的演算法(如前向選擇、後向選擇等)以及效能指標(如交叉驗證得分)逐步選擇特徵。
- n_features_to_select:特徵選擇的最大特徵數量或比例。如果<1,則為起始特徵的比例。預設為0.2。該引數在計數時不考慮 ignore_features 和 keep_features 中的特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/diabetes')
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | Class variable | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
from pycaret.regression import *
# feature_selection選擇特徵, n_features_to_select選擇特徵比例
s = setup(data = data, target = 'Class variable', feature_selection = True, feature_selection_method = 'univariate',
n_features_to_select = 0.3, verbose = False)
# 檢視哪些特徵保留下來
s.get_config('X_transformed').columns
s.get_config('X_transformed').head()
Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Body mass index (weight in kg/(height in m)^2) | |
---|---|---|
56 | 187.0 | 37.700001 |
541 | 128.0 | 32.400002 |
269 | 146.0 | 27.500000 |
304 | 150.0 | 21.000000 |
32 | 88.0 | 24.799999 |
2.7 歸一化
資料歸一化
在 PyCaret 中,normalize 和 normalize_method 引數用於資料預處理中的特徵縮放操作。特徵縮放是指將資料的特徵值按比例縮放,使之落入一個小的特定範圍,這樣可以消除特徵之間的量綱影響,使模型訓練更加穩定和準確。下面是關於這兩個引數的說明:
- normalize: 這是一個布林值引數,用於指定是否對特徵進行縮放。預設情況下,它的取值為 False,表示不進行特徵縮放。如果將其設定為 True,則會啟用特徵縮放功能。
- normalize_method: 這是一個字串引數,用於指定特徵縮放的方法。可選的值有:
- zscore(預設): 使用 Z 分數標準化方法,也稱為標準化或 Z 標準化。該方法將特徵的值轉換為其 Z 分數,即將特徵值減去其均值,然後除以其標準差,從而使得特徵的均值為 0,標準差為 1。
- minmax: 使用 Min-Max 標準化方法,也稱為歸一化。該方法將特徵的值線性轉換到指定的最小值和最大值之間,預設情況下是 [0, 1] 範圍。
- maxabs: 使用 MaxAbs 標準化方法。該方法將特徵的值除以特徵的最大絕對值,將特徵的值縮放到 [-1, 1] 範圍內。
- robust: 使用 RobustScaler 標準化方法。該方法對資料的每個特徵進行中心化和縮放,使用特徵的中位數和四分位數範圍來縮放特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/pokemon')
data.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# 歸一化
from pycaret.classification import *
s = setup(data, target='Legendary', normalize=True, normalize_method='robust', verbose=False)
資料歸一化結果:
s.get_config('X_transformed').head()
# | Name | Type 1_Water | Type 1_Normal | Type 1_Ice | Type 1_Psychic | Type 1_Fire | Type 1_Rock | Type 1_Fighting | Type 1_Grass | ... | Type 2_Electric | Type 2_Normal | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
403 | -0.021629 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.195387 | -0.333333 | 0.200000 | 0.875 | 1.088889 | 0.125 | -0.288889 | 0.000000 |
471 | 0.139870 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.179104 | 0.333333 | 0.555556 | -0.100 | -0.111111 | -0.100 | 1.111111 | 0.333333 |
238 | -0.448450 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -1.080054 | -0.500000 | -0.555556 | -0.750 | -0.777778 | -1.000 | -0.333333 | -0.333333 |
646 | 0.604182 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -0.618725 | -0.166667 | -0.333333 | -0.500 | -0.555556 | -0.500 | 0.222222 | 0.666667 |
69 | -0.898342 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -0.265943 | -0.833333 | -0.888889 | -1.000 | 1.222222 | 0.000 | 0.888889 | -0.666667 |
5 rows × 46 columns
特徵變換
歸一化會重新調整資料,使其在新的範圍內,以減少方差中幅度的影響。特徵變換是一種更徹底的技術。透過轉換改變資料的分佈形狀,使得轉換後的資料可以被表示為正態分佈或近似正態分佈。PyCaret中透過transformation引數開啟特徵轉換,transformation_method設定轉換方法:yeo-johnson(預設)和分位數。此外除了特徵變換,還有目標變換。目標變換它將改變目標變數而不是特徵的分佈形狀。此功能僅在pycarte.regression模組中可用。使用transform_target開啟目標變換,transformation_method設定轉換方法。
from pycaret.classification import *
s = setup(data = data, target = 'Legendary', transformation = True, verbose = False)
# 特徵變換結果
s.get_config('X_transformed').head()
# | Name | Type 1_Psychic | Type 1_Water | Type 1_Rock | Type 1_Grass | Type 1_Dragon | Type 1_Ghost | Type 1_Bug | Type 1_Fairy | ... | Type 2_Electric | Type 2_Bug | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
165 | 52.899003 | 0.009216 | 0.043322 | -0.000000 | -0.000000 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 93.118403 | 12.336844 | 23.649090 | 13.573010 | 10.692443 | 8.081703 | 26.134255 | 0.900773 |
625 | 140.730289 | 0.009216 | -0.000000 | 0.095739 | -0.000000 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 66.091344 | 9.286671 | 20.259153 | 13.764668 | 8.160482 | 6.056644 | 9.552506 | 3.679456 |
628 | 141.283084 | 0.009216 | -0.000000 | -0.000000 | 0.043322 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 89.747939 | 10.823299 | 29.105379 | 11.029571 | 11.203335 | 6.942091 | 27.793080 | 3.679456 |
606 | 137.396878 | 0.009216 | -0.000000 | -0.000000 | -0.000000 | 0.061897 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 56.560577 | 8.043018 | 10.276208 | 10.604937 | 6.949265 | 6.302465 | 19.943809 | 3.679456 |
672 | 149.303914 | 0.009216 | -0.000000 | -0.000000 | -0.000000 | -0.000000 | 0.029706 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 72.626190 | 10.202245 | 26.061259 | 11.435493 | 7.199607 | 6.302465 | 20.141156 | 3.679456 |
5 rows × 46 columns
3 參考
- pycaret
- pycaret-docs
- pycaret-datasets
- lightgbm
- cuml
- imbalanced-learn