pip install pycaret
pip install pycaret[full]
- Extreme Gradient Boosting
- Catboost
- Light Gradient Boosting(需要安裝lightgbm)
- Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression(需要安裝cuml0.15版本以上)
# 檢視pycaret版本
import pycaret
- 1 快速入門
- 1.1 分類
- 1.2 迴歸
- 1.3 聚類
- 1.4 異常檢測
- 1.5 時序預測
- 2 資料處理與清洗
- 2.1 缺失值處理
- 2.2 型別轉換
- 2.3 獨熱編碼
- 2.4 資料平衡
- 2.5 異常值處理
- 2.6 特徵重要性
- 2.7 歸一化
- 3 參考
1 快速入門
Topic | NotebookLink |
二分類BinaryClassification | link |
多分類MulticlassClassification | link |
迴歸Regression | link |
聚類Clustering | link |
異常檢測AnomalyDetection | link |
時序預測TimeSeriesForecasting | link |
1.1 分類
from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/diabetes', verbose=False)
# 從pycaret開源倉庫下載公開資料
# data = get_data('diabetes', verbose=False)
# 檢視資料型別和資料維度
type(data), data.shape
(pandas.core.frame.DataFrame, (768, 9))
# 最後一列表示是否為糖尿病患者,其他列為特徵列
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | Class variable | |
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# target目標列,session_id設定隨機數種子, preprocesss是否清洗資料,train_size訓練集比例, normalize是否歸一化資料, normalize_method歸一化方式
s.setup(data, target = 'Class variable', session_id = 0, verbose= False, train_size = 0.7, normalize = True, normalize_method = 'minmax')
<pycaret.classification.oop.ClassificationExperiment at 0x200b939df40>
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | |
34 | 0.588235 | 0.613065 | 0.639344 | 0.313131 | 0.000000 | 0.411326 | 0.185312 | 0.400000 |
221 | 0.117647 | 0.793970 | 0.737705 | 0.000000 | 0.000000 | 0.470939 | 0.310418 | 0.750000 |
531 | 0.000000 | 0.537688 | 0.622951 | 0.000000 | 0.000000 | 0.675112 | 0.259607 | 0.050000 |
518 | 0.764706 | 0.381910 | 0.491803 | 0.000000 | 0.000000 | 0.488823 | 0.043553 | 0.333333 |
650 | 0.058824 | 0.457286 | 0.442623 | 0.252525 | 0.118203 | 0.375559 | 0.066610 | 0.033333 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
628 | 0.294118 | 0.643216 | 0.655738 | 0.000000 | 0.000000 | 0.515648 | 0.028181 | 0.400000 |
456 | 0.058824 | 0.678392 | 0.442623 | 0.000000 | 0.000000 | 0.397914 | 0.260034 | 0.683333 |
398 | 0.176471 | 0.412060 | 0.573770 | 0.000000 | 0.000000 | 0.314456 | 0.132792 | 0.066667 |
6 | 0.176471 | 0.391960 | 0.409836 | 0.323232 | 0.104019 | 0.461997 | 0.072588 | 0.083333 |
294 | 0.000000 | 0.809045 | 0.409836 | 0.000000 | 0.000000 | 0.326379 | 0.075149 | 0.733333 |
537 rows × 8 columns
s.get_config('X_train_transformed')['Number of times pregnant'].hist()
from pycaret.classification import setup
# s = setup(data, target = 'Class variable', session_id = 0, preprocess = True, train_size = 0.7, verbose = False)
best = s.compare_models()
# 選擇某些模型進行比較
# best = s.compare_models(include = ['dt', 'rf', 'et', 'gbc', 'lightgbm'])
# 按照召回率返回n_select效能最佳的模型
# best_recall_models_top3 = s.compare_models(sort = 'Recall', n_select = 3)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
lr | Logistic Regression | 0.7633 | 0.8132 | 0.4968 | 0.7436 | 0.5939 | 0.4358 | 0.4549 | 0.2720 |
ridge | Ridge Classifier | 0.7633 | 0.8113 | 0.5178 | 0.7285 | 0.6017 | 0.4406 | 0.4560 | 0.0090 |
lda | Linear Discriminant Analysis | 0.7633 | 0.8110 | 0.5497 | 0.7069 | 0.6154 | 0.4489 | 0.4583 | 0.0080 |
ada | Ada Boost Classifier | 0.7465 | 0.7768 | 0.5655 | 0.6580 | 0.6051 | 0.4208 | 0.4255 | 0.0190 |
svm | SVM - Linear Kernel | 0.7408 | 0.8087 | 0.5921 | 0.6980 | 0.6020 | 0.4196 | 0.4480 | 0.0080 |
nb | Naive Bayes | 0.7391 | 0.7939 | 0.5442 | 0.6515 | 0.5857 | 0.3995 | 0.4081 | 0.0080 |
rf | Random Forest Classifier | 0.7337 | 0.8033 | 0.5406 | 0.6331 | 0.5778 | 0.3883 | 0.3929 | 0.0350 |
et | Extra Trees Classifier | 0.7298 | 0.7899 | 0.5181 | 0.6416 | 0.5677 | 0.3761 | 0.3840 | 0.0450 |
gbc | Gradient Boosting Classifier | 0.7281 | 0.8007 | 0.5567 | 0.6267 | 0.5858 | 0.3857 | 0.3896 | 0.0260 |
lightgbm | Light Gradient Boosting Machine | 0.7242 | 0.7811 | 0.5827 | 0.6096 | 0.5935 | 0.3859 | 0.3876 | 0.0860 |
qda | Quadratic Discriminant Analysis | 0.7150 | 0.7875 | 0.4962 | 0.6225 | 0.5447 | 0.3428 | 0.3524 | 0.0080 |
knn | K Neighbors Classifier | 0.7131 | 0.7425 | 0.5287 | 0.6005 | 0.5577 | 0.3480 | 0.3528 | 0.2200 |
dt | Decision Tree Classifier | 0.6685 | 0.6461 | 0.5722 | 0.5266 | 0.5459 | 0.2868 | 0.2889 | 0.0100 |
dummy | Dummy Classifier | 0.6518 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0120 |
best_ml = s.automl()
# best_ml
# 列印效果最佳的模型
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
- pipeline: Schematic drawing of the preprocessing pipeline
- auc: Area Under the Curve
- threshold: Discrimination Threshold
- pr: Precision Recall Curve
- confusion_matrix: Confusion Matrix
- error: Class Prediction Error
- class_report: Classification Report
- boundary: Decision Boundary
- rfe: Recursive Feature Selection
- learning: Learning Curve
- manifold: Manifold Learning
- calibration: Calibration Curve
- vc: Validation Curve
- dimension: Dimension Learning
- feature: Feature Importance
- feature_all: Feature Importance (All)
- parameter: Model Hyperparameter
- lift: Lift Curve
- gain: Gain Chart
- tree: Decision Tree
- ks: KS Statistic Plot
# 提取所有模型預測結果
models_results = s.pull()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
lr | Logistic Regression | 0.7633 | 0.8132 | 0.4968 | 0.7436 | 0.5939 | 0.4358 | 0.4549 | 0.272 |
ridge | Ridge Classifier | 0.7633 | 0.8113 | 0.5178 | 0.7285 | 0.6017 | 0.4406 | 0.4560 | 0.009 |
lda | Linear Discriminant Analysis | 0.7633 | 0.8110 | 0.5497 | 0.7069 | 0.6154 | 0.4489 | 0.4583 | 0.008 |
ada | Ada Boost Classifier | 0.7465 | 0.7768 | 0.5655 | 0.6580 | 0.6051 | 0.4208 | 0.4255 | 0.019 |
svm | SVM - Linear Kernel | 0.7408 | 0.8087 | 0.5921 | 0.6980 | 0.6020 | 0.4196 | 0.4480 | 0.008 |
nb | Naive Bayes | 0.7391 | 0.7939 | 0.5442 | 0.6515 | 0.5857 | 0.3995 | 0.4081 | 0.008 |
rf | Random Forest Classifier | 0.7337 | 0.8033 | 0.5406 | 0.6331 | 0.5778 | 0.3883 | 0.3929 | 0.035 |
et | Extra Trees Classifier | 0.7298 | 0.7899 | 0.5181 | 0.6416 | 0.5677 | 0.3761 | 0.3840 | 0.045 |
gbc | Gradient Boosting Classifier | 0.7281 | 0.8007 | 0.5567 | 0.6267 | 0.5858 | 0.3857 | 0.3896 | 0.026 |
lightgbm | Light Gradient Boosting Machine | 0.7242 | 0.7811 | 0.5827 | 0.6096 | 0.5935 | 0.3859 | 0.3876 | 0.086 |
qda | Quadratic Discriminant Analysis | 0.7150 | 0.7875 | 0.4962 | 0.6225 | 0.5447 | 0.3428 | 0.3524 | 0.008 |
knn | K Neighbors Classifier | 0.7131 | 0.7425 | 0.5287 | 0.6005 | 0.5577 | 0.3480 | 0.3528 | 0.220 |
dt | Decision Tree Classifier | 0.6685 | 0.6461 | 0.5722 | 0.5266 | 0.5459 | 0.2868 | 0.2889 | 0.010 |
dummy | Dummy Classifier | 0.6518 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.012 |
s.plot_model(best, plot = 'confusion_matrix')
# s.evaluate_model(best)
# 預測整個資料集
res = s.predict_model(best, data=data)
# 檢視各行預測結果
# res
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
0 | Logistic Regression | 0.7708 | 0.8312 | 0.5149 | 0.7500 | 0.6106 | 0.4561 | 0.4723 |
# 預測用於資料訓練的測試集
res = s.predict_model(best)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
0 | Logistic Regression | 0.7576 | 0.8553 | 0.5062 | 0.7193 | 0.5942 | 0.4287 | 0.4422 |
# 儲存模型到本地
_ = s.save_model(best, 'best_model', verbose = False)
# 匯入模型
model = s.load_model( 'best_model')
# 檢視模型結構
# model
Transformation Pipeline and Model Successfully Loaded
# 預測整個資料集
res = s.predict_model(model, data=data)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
0 | Logistic Regression | 0.7708 | 0.8312 | 0.5149 | 0.7500 | 0.6106 | 0.4561 | 0.4723 |
1.2 迴歸
# 載入保險費用示例資料集
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# 從網路下載
# data = get_data(dataset='insurance', verbose=False)
age | sex | bmi | children | smoker | region | charges | |
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
# 建立資料管道
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()
# 預測charges列
s.setup(data, target = 'charges', session_id = 0)
# 另一種資料管道建立方式
# from pycaret.regression import *
# s = setup(data, target = 'charges', session_id = 0)
Description | Value | |
0 | Session id | 0 |
1 | Target | charges |
2 | Target type | Regression |
3 | Original data shape | (1338, 7) |
4 | Transformed data shape | (1338, 10) |
5 | Transformed train set shape | (936, 10) |
6 | Transformed test set shape | (402, 10) |
7 | Numeric features | 3 |
8 | Categorical features | 3 |
9 | Preprocess | True |
10 | Imputation type | simple |
11 | Numeric imputation | mean |
12 | Categorical imputation | mode |
13 | Maximum one-hot encoding | 25 |
14 | Encoding method | None |
15 | Fold Generator | KFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | reg-default-name |
21 | USI | eb9d |
<pycaret.regression.oop.RegressionExperiment at 0x200dedc2d30>
# 評估各類模型
best = s.compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
gbr | Gradient Boosting Regressor | 2723.2453 | 23787529.5872 | 4832.4785 | 0.8254 | 0.4427 | 0.3140 | 0.0550 |
lightgbm | Light Gradient Boosting Machine | 2998.1311 | 25738691.2181 | 5012.2404 | 0.8106 | 0.5525 | 0.3709 | 0.1140 |
rf | Random Forest Regressor | 2915.7018 | 26780127.0016 | 5109.5098 | 0.8031 | 0.4855 | 0.3520 | 0.0670 |
et | Extra Trees Regressor | 2841.8257 | 28559316.9533 | 5243.5828 | 0.7931 | 0.4671 | 0.3218 | 0.0670 |
ada | AdaBoost Regressor | 4180.2669 | 28289551.0048 | 5297.6817 | 0.7886 | 0.5935 | 0.6545 | 0.0210 |
ridge | Ridge Regression | 4304.2640 | 38786967.4768 | 6188.6966 | 0.7152 | 0.5794 | 0.4283 | 0.0230 |
lar | Least Angle Regression | 4293.9886 | 38781666.5991 | 6188.3301 | 0.7151 | 0.5893 | 0.4263 | 0.0210 |
llar | Lasso Least Angle Regression | 4294.2135 | 38780221.0039 | 6188.1906 | 0.7151 | 0.5891 | 0.4264 | 0.0200 |
br | Bayesian Ridge | 4299.8532 | 38785479.0984 | 6188.6026 | 0.7151 | 0.5784 | 0.4274 | 0.0200 |
lasso | Lasso Regression | 4294.2186 | 38780210.5665 | 6188.1898 | 0.7151 | 0.5892 | 0.4264 | 0.0370 |
lr | Linear Regression | 4293.9886 | 38781666.5991 | 6188.3301 | 0.7151 | 0.5893 | 0.4263 | 0.0350 |
dt | Decision Tree Regressor | 3550.6534 | 51149204.9032 | 7095.9170 | 0.6127 | 0.5839 | 0.4537 | 0.0230 |
huber | Huber Regressor | 3769.3076 | 53638697.2337 | 7254.7108 | 0.6095 | 0.4528 | 0.2187 | 0.0250 |
par | Passive Aggressive Regressor | 4144.7180 | 62949698.1775 | 7862.7604 | 0.5433 | 0.4634 | 0.2465 | 0.0210 |
en | Elastic Net | 7248.9376 | 89841235.9517 | 9405.5846 | 0.3534 | 0.7346 | 0.9238 | 0.0210 |
omp | Orthogonal Matching Pursuit | 8916.1927 | 130904492.3067 | 11356.4120 | 0.0561 | 0.8781 | 1.1598 | 0.0180 |
knn | K Neighbors Regressor | 8161.8875 | 137982796.8000 | 11676.3735 | -0.0011 | 0.8744 | 0.9742 | 0.0250 |
dummy | Dummy Regressor | 8892.4478 | 141597492.8000 | 11823.4271 | -0.0221 | 0.9868 | 1.4909 | 0.0210 |
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
init=None, learning_rate=0.1, loss='squared_error',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_iter_no_change=None,
random_state=0, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
1.3 聚類
# 匯入珠寶資料集
from pycaret.datasets import get_data
# 根據資料集特徵進行聚類
data = get_data('./datasets/jewellery')
# data = get_data('jewellery')
Age | Income | SpendingScore | Savings | |
0 | 58 | 77769 | 0.791329 | 6559.829923 |
1 | 59 | 81799 | 0.791082 | 5417.661426 |
2 | 62 | 74751 | 0.702657 | 9258.992965 |
3 | 59 | 74373 | 0.765680 | 7346.334504 |
4 | 87 | 17760 | 0.348778 | 16869.507130 |
# 建立資料管道
from pycaret.clustering import ClusteringExperiment
s = ClusteringExperiment()
# normalize歸一化資料
s.setup(data, normalize = True, verbose = False)
# 另一種資料管道建立方式
# from pycaret.clustering import *
# s = setup(data, normalize = True)
<pycaret.clustering.oop.ClusteringExperiment at 0x200dec86340>
kmeans = s.create_model('kmeans')
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
0 | 0.7581 | 1611.2647 | 0.3743 | 0 | 0 | 0 |
Name | Reference | |
ID | ||
kmeans | K-Means Clustering | sklearn.cluster._kmeans.KMeans |
ap | Affinity Propagation | sklearn.cluster._affinity_propagation.Affinity... |
meanshift | Mean Shift Clustering | sklearn.cluster._mean_shift.MeanShift |
sc | Spectral Clustering | sklearn.cluster._spectral.SpectralClustering |
hclust | Agglomerative Clustering | sklearn.cluster._agglomerative.AgglomerativeCl... |
dbscan | Density-Based Spatial Clustering | sklearn.cluster._dbscan.DBSCAN |
optics | OPTICS Clustering | sklearn.cluster._optics.OPTICS |
birch | Birch Clustering | sklearn.cluster._birch.Birch |
# 檢視聚類數
KMeans(algorithm='lloyd', copy_x=True, init='k-means++', max_iter=300,
n_clusters=4, n_init='auto', random_state=1459, tol=0.0001, verbose=0)
# jupyter環境下互動視覺化展示
# s.evaluate_model(kmeans)
# 結果視覺化
# 'cluster' - Cluster PCA Plot (2d)
# 'tsne' - Cluster t-SNE (3d)
# 'elbow' - Elbow Plot
# 'silhouette' - Silhouette Plot
# 'distance' - Distance Plot
# 'distribution' - Distribution Plot
s.plot_model(kmeans, plot = 'elbow')
result = s.assign_model(kmeans)
Age | Income | SpendingScore | Savings | Cluster | |
0 | 58 | 77769 | 0.791329 | 6559.830078 | Cluster 2 |
1 | 59 | 81799 | 0.791082 | 5417.661621 | Cluster 2 |
2 | 62 | 74751 | 0.702657 | 9258.993164 | Cluster 2 |
3 | 59 | 74373 | 0.765680 | 7346.334473 | Cluster 2 |
4 | 87 | 17760 | 0.348778 | 16869.507812 | Cluster 1 |
predictions = s.predict_model(kmeans, data = data)
Age | Income | SpendingScore | Savings | Cluster | |
0 | -0.042287 | 0.062733 | 1.103593 | -1.072467 | Cluster 2 |
1 | -0.000821 | 0.174811 | 1.102641 | -1.303473 | Cluster 2 |
2 | 0.123577 | -0.021200 | 0.761727 | -0.526556 | Cluster 2 |
3 | -0.000821 | -0.031712 | 1.004705 | -0.913395 | Cluster 2 |
4 | 1.160228 | -1.606165 | -0.602619 | 1.012686 | Cluster 1 |
1.4 異常檢測
PyCaret的anomaly detection模組是一個無監督的機器學習模組,用於識別與大多數資料存在顯著差異的罕見專案、事件或觀測值。通常,這些異常專案會轉化為某種問題,如銀行欺詐、結構缺陷、醫療問題或錯誤。anomaly detection模組的使用類似於cluster模組。
from pycaret.datasets import get_data
data = get_data('./datasets/anomaly')
# data = get_data('anomaly')
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | |
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 |
from pycaret.anomaly import AnomalyExperiment
s = AnomalyExperiment()
s.setup(data, session_id = 0)
# 另一種載入方式
# from pycaret.anomaly import *
# s = setup(data, session_id = 0)
Description | Value | |
0 | Session id | 0 |
1 | Original data shape | (1000, 10) |
2 | Transformed data shape | (1000, 10) |
3 | Numeric features | 10 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | anomaly-default-name |
12 | USI | 54db |
<pycaret.anomaly.oop.AnomalyExperiment at 0x200e14f5250>
iforest = s.create_model('iforest')
IForest(behaviour='new', bootstrap=False, contamination=0.05,
max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
random_state=0, verbose=0)
anomaly detection模組所支援的模型列表如下:
Name | Reference | |
ID | ||
abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
cluster | Clustering-Based Local Outlier | pycaret.internal.patches.pyod.CBLOFForceToDouble |
cof | Connectivity-Based Local Outlier | pyod.models.cof.COF |
iforest | Isolation Forest | pyod.models.iforest.IForest |
histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
lof | Local Outlier Factor | pyod.models.lof.LOF |
svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
pca | Principal Component Analysis | pyod.models.pca.PCA |
mcd | Minimum Covariance Determinant | |
sod | Subspace Outlier Detection | pyod.models.sod.SOD |
sos | Stochastic Outlier Selection | pyod.models.sos.SOS |
result = s.assign_model(iforest)
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.016205 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.068052 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.009221 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.056690 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.012945 |
predictions = s.predict_model(iforest, data = data)
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Anomaly | Anomaly_Score | |
0 | 0.263995 | 0.764929 | 0.138424 | 0.935242 | 0.605867 | 0.518790 | 0.912225 | 0.608234 | 0.723782 | 0.733591 | 0 | -0.016205 |
1 | 0.546092 | 0.653975 | 0.065575 | 0.227772 | 0.845269 | 0.837066 | 0.272379 | 0.331679 | 0.429297 | 0.367422 | 0 | -0.068052 |
2 | 0.336714 | 0.538842 | 0.192801 | 0.553563 | 0.074515 | 0.332993 | 0.365792 | 0.861309 | 0.899017 | 0.088600 | 1 | 0.009221 |
3 | 0.092108 | 0.995017 | 0.014465 | 0.176371 | 0.241530 | 0.514724 | 0.562208 | 0.158963 | 0.073715 | 0.208463 | 1 | 0.056690 |
4 | 0.325261 | 0.805968 | 0.957033 | 0.331665 | 0.307923 | 0.355315 | 0.501899 | 0.558449 | 0.885169 | 0.182754 | 0 | -0.012945 |
1.5 時序預測
PyCaret時間序列預測Time Series模組支援多種預測方法,如ARIMA、Prophet和LSTM。它還提供了各種功能來處理缺失值、時間序列分解和資料視覺化。
# 乘客時序資料
from pycaret.datasets import get_data
# 下載路徑:
data = get_data('./datasets/airline')
# data = get_data('airline')
Date | Passengers | |
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
import pandas as pd
data['Date'] = pd.to_datetime(data['Date'])
# 並將Date設定為列號
data.set_index('Date', inplace=True)
from pycaret.time_series import TSForecastingExperiment
s = TSForecastingExperiment()
# fh: 用於預測的預測範圍。預設值設定為1,即預測前方一點。,fold: 交叉驗證中折數
s.setup(data, fh = 3, fold = 5, session_id = 0, verbose = False)
# from pycaret.time_series import *
# s = setup(data, fh = 3, fold = 5, session_id = 0)
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x200dee26910>
best = s.compare_models()
Model | MASE | RMSSE | MAE | RMSE | MAPE | SMAPE | R2 | TT (Sec) | |
stlf | STLF | 0.4240 | 0.4429 | 12.8002 | 15.1933 | 0.0266 | 0.0268 | 0.4296 | 0.0300 |
exp_smooth | Exponential Smoothing | 0.5063 | 0.5378 | 15.2900 | 18.4455 | 0.0334 | 0.0335 | -0.0521 | 0.0500 |
ets | ETS | 0.5520 | 0.5801 | 16.6164 | 19.8391 | 0.0354 | 0.0357 | -0.0740 | 0.0680 |
arima | ARIMA | 0.6480 | 0.6501 | 19.5728 | 22.3027 | 0.0412 | 0.0420 | -0.0796 | 0.0420 |
auto_arima | Auto ARIMA | 0.6526 | 0.6300 | 19.7405 | 21.6202 | 0.0414 | 0.0421 | -0.0560 | 10.5220 |
theta | Theta Forecaster | 0.8458 | 0.8223 | 25.7024 | 28.3332 | 0.0524 | 0.0541 | -0.7710 | 0.0220 |
huber_cds_dt | Huber w/ Cond. Deseasonalize & Detrending | 0.9002 | 0.8900 | 27.2568 | 30.5782 | 0.0550 | 0.0572 | -0.0309 | 0.0680 |
knn_cds_dt | K Neighbors w/ Cond. Deseasonalize & Detrending | 0.9381 | 0.8830 | 28.5678 | 30.5007 | 0.0555 | 0.0575 | 0.0908 | 0.0920 |
lr_cds_dt | Linear w/ Cond. Deseasonalize & Detrending | 0.9469 | 0.9297 | 28.6337 | 31.9163 | 0.0581 | 0.0605 | -0.1620 | 0.0820 |
ridge_cds_dt | Ridge w/ Cond. Deseasonalize & Detrending | 0.9469 | 0.9297 | 28.6340 | 31.9164 | 0.0581 | 0.0605 | -0.1620 | 0.0680 |
en_cds_dt | Elastic Net w/ Cond. Deseasonalize & Detrending | 0.9499 | 0.9320 | 28.7271 | 31.9952 | 0.0582 | 0.0606 | -0.1579 | 0.0700 |
llar_cds_dt | Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending | 0.9520 | 0.9336 | 28.7917 | 32.0528 | 0.0583 | 0.0607 | -0.1559 | 0.0560 |
lasso_cds_dt | Lasso w/ Cond. Deseasonalize & Detrending | 0.9521 | 0.9337 | 28.7941 | 32.0557 | 0.0583 | 0.0607 | -0.1560 | 0.0720 |
br_cds_dt | Bayesian Ridge w/ Cond. Deseasonalize & Detrending | 0.9551 | 0.9347 | 28.9018 | 32.1013 | 0.0582 | 0.0606 | -0.1377 | 0.0580 |
et_cds_dt | Extra Trees w/ Cond. Deseasonalize & Detrending | 1.0322 | 0.9942 | 31.4048 | 34.3054 | 0.0607 | 0.0633 | -0.1660 | 0.1280 |
rf_cds_dt | Random Forest w/ Cond. Deseasonalize & Detrending | 1.0851 | 1.0286 | 32.9791 | 35.4666 | 0.0641 | 0.0670 | -0.3545 | 0.1400 |
lightgbm_cds_dt | Light Gradient Boosting w/ Cond. Deseasonalize & Detrending | 1.1409 | 1.1040 | 34.5999 | 37.9918 | 0.0670 | 0.0701 | -0.3994 | 0.0900 |
ada_cds_dt | AdaBoost w/ Cond. Deseasonalize & Detrending | 1.1441 | 1.0843 | 34.7451 | 37.3681 | 0.0664 | 0.0697 | -0.3004 | 0.0920 |
gbr_cds_dt | Gradient Boosting w/ Cond. Deseasonalize & Detrending | 1.1697 | 1.1094 | 35.4408 | 38.1373 | 0.0697 | 0.0729 | -0.4163 | 0.0900 |
omp_cds_dt | Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending | 1.1793 | 1.1250 | 35.7348 | 38.6755 | 0.0706 | 0.0732 | -0.5095 | 0.0620 |
dt_cds_dt | Decision Tree w/ Cond. Deseasonalize & Detrending | 1.2704 | 1.2371 | 38.4976 | 42.4846 | 0.0773 | 0.0814 | -1.0382 | 0.0860 |
snaive | Seasonal Naive Forecaster | 1.7700 | 1.5999 | 53.5333 | 54.9143 | 0.1136 | 0.1211 | -4.1630 | 0.1580 |
naive | Naive Forecaster | 1.8145 | 1.7444 | 54.8667 | 59.8160 | 0.1135 | 0.1151 | -3.7710 | 0.1460 |
polytrend | Polynomial Trend Forecaster | 2.3154 | 2.2507 | 70.1138 | 77.3400 | 0.1363 | 0.1468 | -4.6202 | 0.1080 |
croston | Croston | 2.6211 | 2.4985 | 79.3645 | 85.8439 | 0.1515 | 0.1684 | -5.2294 | 0.0140 |
grand_means | Grand Means Forecaster | 7.1261 | 6.3506 | 216.0214 | 218.4259 | 0.4377 | 0.5682 | -59.2684 | 0.1400 |
# jupyter環境下互動視覺化展示
# plot引數支援:
# - 'ts' - Time Series Plot
# - 'train_test_split' - Train Test Split
# - 'cv' - Cross Validation
# - 'acf' - Auto Correlation (ACF)
# - 'pacf' - Partial Auto Correlation (PACF)
# - 'decomp' - Classical Decomposition
# - 'decomp_stl' - STL Decomposition
# - 'diagnostics' - Diagnostics Plot
# - 'diff' - Difference Plot
# - 'periodogram' - Frequency Components (Periodogram)
# - 'fft' - Frequency Components (FFT)
# - 'ccf' - Cross Correlation (CCF)
# - 'forecast' - "Out-of-Sample" Forecast Plot
# - 'insample' - "In-Sample" Forecast Plot
# - 'residuals' - Residuals Plot
# s.plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})
# 使模型擬合包括測試樣本在內的完整資料集
final_best = s.finalize_model(best)
s.predict_model(best, fh = 24)
s.save_model(final_best, 'final_best_model')
Transformation Pipeline and Model Successfully Saved
2 資料處理與清洗
2.1 缺失值處理
- imputation_type:取值可以是 'simple' 或 'iterative'或 None。當imputation_type設定為 'simple' 時,PyCaret 將使用簡單的方式(numeric_imputation和categorical_imputation)對缺失值進行填充。而當設定為 'iterative' 時,則會使用模型估計的方式(numeric_iterative_imputer,categorical_iterative_imputer)進行填充處理。如果設定為 None,則不會執行任何缺失值填充操作
- numeric_imputation: 設定數值型別的缺失值,方式如下:
- mean: 用列的平均值填充,預設
- drop: 刪除包含缺失值的行
- median: 用列的中值填充
- mode: 用列最常見值填充
- knn: 使用K-最近鄰方法擬合
- int or float: 用指定值替代
- categorical_imputation:
- mode: 用列最常見值填充,預設
- drop: 刪除包含缺失值的行
- str: 用指定字元替代
- numeric_iterative_imputer: 使用估計模型擬合值,可輸入str或sklearn模型, 預設使用lightgbm
- categorical_iterative_imputer: 使用估計模型差值,可輸入str或sklearn模型, 預設使用lightgbm
# load dataset
from pycaret.datasets import get_data
# 從本地載入資料,注意dataset是資料的檔名
data = get_data(dataset='./datasets/hepatitis', verbose=False)
# data = get_data('hepatitis',verbose=False)
# 可以看到第三行STEROID列出現NaN值
0 | 0 | 30 | 2 | 1.0 | 2 | 2 | 2 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | NaN | 1 |
1 | 0 | 50 | 1 | 1.0 | 2 | 1 | 2 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | NaN | 1 |
2 | 0 | 78 | 1 | 2.0 | 2 | 1 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | NaN | 1 |
3 | 0 | 31 | 1 | NaN | 1 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1 |
4 | 0 | 34 | 1 | 2.0 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | NaN | 200.0 | 4.0 | NaN | 1 |
# 使用均值填充資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# 均值
s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# 設定data_split_shuffle和data_split_stratify為False不打亂資料
data_split_shuffle = False, data_split_stratify = False,
imputation_type='simple', numeric_iterative_imputer='drop')
# 檢視轉換後的資料
0 | 30.0 | 2.0 | 1.000000 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 66.53968 | 1.0 | 0 |
1 | 50.0 | 1.0 | 1.000000 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 66.53968 | 1.0 | 0 |
2 | 78.0 | 1.0 | 2.000000 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 66.53968 | 1.0 | 0 |
3 | 31.0 | 1.0 | 1.509434 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.00000 | 1.0 | 0 |
4 | 34.0 | 1.0 | 2.000000 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 99.659088 | 200.0 | 4.0 | 66.53968 | 1.0 | 0 |
# 使用knn擬合資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# 設定data_split_shuffle和data_split_stratify為False不打亂資料
data_split_shuffle = False, data_split_stratify = False,
imputation_type='simple', numeric_imputation = 'knn')
# 檢視轉換後的資料
0 | 30.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 91.800003 | 1.0 | 0 |
1 | 50.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 61.599998 | 1.0 | 0 |
2 | 78.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 75.800003 | 1.0 | 0 |
3 | 31.0 | 1.0 | 1.8 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.000000 | 1.0 | 0 |
4 | 34.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 108.400002 | 200.0 | 4.0 | 62.799999 | 1.0 | 0 |
# 使用lightgbmn擬合資料
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data = data, session_id=0, target = 'Class',verbose=False,
# # 設定data_split_shuffle和data_split_stratify為False不打亂資料
# data_split_shuffle = False, data_split_stratify = False,
# imputation_type='iterative', numeric_iterative_imputer = 'lightgbm')
# 檢視轉換後的資料
# s.get_config('dataset_transformed').head()
2.2 型別轉換
雖然 PyCaret具有自動識別特徵型別的功能,但PyCaret提供了資料型別自定義引數,使用者可以對資料集進行更精細的控制和指導,以確保模型訓練和特徵工程的效果更加符合使用者的預期和需求。這些自定義引數如下:
- numeric_features:用於指定資料集中的數值特徵列的引數。這些特徵將被視為連續型變數進行處理
- categorical_features:用於指定資料集中的分類特徵列的引數。這些特徵將被視為離散型變數進行處理
- date_features:用於指定資料集中的日期特徵列的引數。這些特徵將被視為日期型變數進行處理
- create_date_columns:用於指定是否從日期特徵中建立新的日期相關列的引數
- text_features:用於指定資料集中的文字特徵列的引數。這些特徵將被視為文字型變數進行處理
- text_features_method:用於指定對文字特徵進行處理的方法的引數
- ignore_features:用於指定在建模過程中需要忽略的特徵列的引數
- keep_features:用於指定在建模過程中需要保留的特徵列的引數
# 轉換變數型別
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/hepatitis', verbose=False)
from pycaret.classification import *
s = setup(data = data, target = 'Class', ignore_features = ['SEX','AGE'], categorical_features=['STEROID'],verbose = False,
data_split_shuffle = False, data_split_stratify = False)
# 檢視轉換後的資料,前兩列消失,STEROID變為分類變數
0 | 0.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.000000 | 18.0 | 4.0 | 66.53968 | 1.0 | 0 |
1 | 0.0 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.000000 | 42.0 | 3.5 | 66.53968 | 1.0 | 0 |
2 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.000000 | 32.0 | 4.0 | 66.53968 | 1.0 | 0 |
3 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.000000 | 52.0 | 4.0 | 80.00000 | 1.0 | 0 |
4 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 99.659088 | 200.0 | 4.0 | 66.53968 | 1.0 | 0 |
2.3 獨熱編碼
當資料集中包含分類變數時,這些變數通常需要轉換為模型可以理解的數值形式。獨熱編碼是一種常用的方法,它將每個分類變數轉換為一組二進位制變數,其中每個變數對應一個可能的分類值,並且只有一個變數在任何給定時間點上取值為 1,其餘變數均為 0。可以透過傳遞引數categorical_features來指定要進行獨熱編碼的列。例如:
# load dataset
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/pokemon', verbose=False)
# data = get_data('pokemon')
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# 對Type 1實現獨熱編碼
len(set(data['Type 1']))
from pycaret.classification import *
s = setup(data = data, categorical_features =["Type 1"],target = 'Legendary', verbose=False)
# 檢視轉換後的資料Type 1變為獨熱編碼
# | Name | Type 1_Grass | Type 1_Ghost | Type 1_Water | Type 1_Steel | Type 1_Psychic | Type 1_Fire | Type 1_Poison | Type 1_Fairy | ... | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
202 | 187.0 | Hoppip | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Flying | 250.0 | 35.0 | 35.0 | 40.0 | 35.0 | 55.0 | 50.0 | 2.0 | False |
477 | 429.0 | Mismagius | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | NaN | 495.0 | 60.0 | 60.0 | 60.0 | 105.0 | 105.0 | 105.0 | 4.0 | False |
349 | 319.0 | SharpedoMega Sharpedo | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Dark | 560.0 | 70.0 | 140.0 | 70.0 | 110.0 | 65.0 | 105.0 | 3.0 | False |
777 | 707.0 | Klefki | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Fairy | 470.0 | 57.0 | 80.0 | 91.0 | 80.0 | 87.0 | 75.0 | 6.0 | False |
50 | 45.0 | Vileplume | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | Poison | 490.0 | 75.0 | 80.0 | 85.0 | 110.0 | 90.0 | 50.0 | 1.0 | False |
5 rows × 30 columns
2.4 資料平衡
在 PyCaret 中,fix_imbalance 和 fix_imbalance_method 是用於處理不平衡資料集的兩個引數。這些引數通常用於在訓練模型之前對資料集進行預處理,以解決類別不平衡問題。
- fix_imbalance 引數:這是一個布林值引數,用於指示是否對不平衡資料集進行處理。當設定為 True 時,PyCaret 將自動檢測資料集中的類別不平衡問題,並嘗試透過取樣方法來解決。當設定為 False 時,PyCaret 將使用原始的不平衡資料集進行模型訓練
- fix_imbalance_method 引數:這是一個字串引數,用於指定處理不平衡資料集的方法。可選的值包括:
- 使用 SMOTE(Synthetic Minority Over-sampling Technique)來生成人工合成樣本,從而平衡類別(預設引數smote)
- 使用imbalanced-learn提供的估算模型
# 載入資料
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/credit', verbose=False)
# data = get_data('credit')
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | |
0 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
1 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
2 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
3 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
4 | 50000 | 1 | 1 | 2 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 19394.0 | 19619.0 | 20024.0 | 2500.0 | 1815.0 | 657.0 | 1000.0 | 1000.0 | 800.0 | 0 |
5 rows × 24 columns
# 檢視資料各類別數
category_counts = data['default'].value_counts()
0 18694
1 5306
Name: count, dtype: int64
from pycaret.classification import *
s = setup(data = data, target = 'default', fix_imbalance = True, verbose = False)
# 可以看到類1資料量變多了
0 18694
1 14678
Name: count, dtype: int64
2.5 異常值處理
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# insurance = get_data('insurance')
# 資料維度
(1338, 7)
from pycaret.regression import *
s = setup(data = data, target = 'charges', remove_outliers = True ,verbose = False, outliers_threshold = 0.02)
# 移除異常資料後,資料量變少
(1319, 10)
2.6 特徵重要性
- feature_selection:用於指定是否在模型訓練過程中進行特徵選擇。可以設定為 True 或 False。
- feature_selection_method:特徵選擇方法:
- 'univariate': 使用sklearn的SelectKBest,基於統計測試來選擇與目標變數最相關的特徵。
- 'classic(預設)': 使用sklearn的SelectFromModel,利用監督學習模型的特徵重要性或係數來選擇最重要的特徵。
- 'sequential': 使用sklearn的SequentialFeatureSelector,該類根據指定的演算法(如前向選擇、後向選擇等)以及效能指標(如交叉驗證得分)逐步選擇特徵。
- n_features_to_select:特徵選擇的最大特徵數量或比例。如果<1,則為起始特徵的比例。預設為0.2。該引數在計數時不考慮 ignore_features 和 keep_features 中的特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/diabetes')
Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure (mm Hg) | Triceps skin fold thickness (mm) | 2-Hour serum insulin (mu U/ml) | Body mass index (weight in kg/(height in m)^2) | Diabetes pedigree function | Age (years) | Class variable | |
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
from pycaret.regression import *
# feature_selection選擇特徵, n_features_to_select選擇特徵比例
s = setup(data = data, target = 'Class variable', feature_selection = True, feature_selection_method = 'univariate',
n_features_to_select = 0.3, verbose = False)
# 檢視哪些特徵保留下來
Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Body mass index (weight in kg/(height in m)^2) | |
56 | 187.0 | 37.700001 |
541 | 128.0 | 32.400002 |
269 | 146.0 | 27.500000 |
304 | 150.0 | 21.000000 |
32 | 88.0 | 24.799999 |
2.7 歸一化
在 PyCaret 中,normalize 和 normalize_method 引數用於資料預處理中的特徵縮放操作。特徵縮放是指將資料的特徵值按比例縮放,使之落入一個小的特定範圍,這樣可以消除特徵之間的量綱影響,使模型訓練更加穩定和準確。下面是關於這兩個引數的說明:
- normalize: 這是一個布林值引數,用於指定是否對特徵進行縮放。預設情況下,它的取值為 False,表示不進行特徵縮放。如果將其設定為 True,則會啟用特徵縮放功能。
- normalize_method: 這是一個字串引數,用於指定特徵縮放的方法。可選的值有:
- zscore(預設): 使用 Z 分數標準化方法,也稱為標準化或 Z 標準化。該方法將特徵的值轉換為其 Z 分數,即將特徵值減去其均值,然後除以其標準差,從而使得特徵的均值為 0,標準差為 1。
- minmax: 使用 Min-Max 標準化方法,也稱為歸一化。該方法將特徵的值線性轉換到指定的最小值和最大值之間,預設情況下是 [0, 1] 範圍。
- maxabs: 使用 MaxAbs 標準化方法。該方法將特徵的值除以特徵的最大絕對值,將特徵的值縮放到 [-1, 1] 範圍內。
- robust: 使用 RobustScaler 標準化方法。該方法對資料的每個特徵進行中心化和縮放,使用特徵的中位數和四分位數範圍來縮放特徵。
from pycaret.datasets import get_data
data = get_data('./datasets/pokemon')
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
# 歸一化
from pycaret.classification import *
s = setup(data, target='Legendary', normalize=True, normalize_method='robust', verbose=False)
# | Name | Type 1_Water | Type 1_Normal | Type 1_Ice | Type 1_Psychic | Type 1_Fire | Type 1_Rock | Type 1_Fighting | Type 1_Grass | ... | Type 2_Electric | Type 2_Normal | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
403 | -0.021629 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.195387 | -0.333333 | 0.200000 | 0.875 | 1.088889 | 0.125 | -0.288889 | 0.000000 |
471 | 0.139870 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.179104 | 0.333333 | 0.555556 | -0.100 | -0.111111 | -0.100 | 1.111111 | 0.333333 |
238 | -0.448450 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -1.080054 | -0.500000 | -0.555556 | -0.750 | -0.777778 | -1.000 | -0.333333 | -0.333333 |
646 | 0.604182 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -0.618725 | -0.166667 | -0.333333 | -0.500 | -0.555556 | -0.500 | 0.222222 | 0.666667 |
69 | -0.898342 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | -0.265943 | -0.833333 | -0.888889 | -1.000 | 1.222222 | 0.000 | 0.888889 | -0.666667 |
5 rows × 46 columns
from pycaret.classification import *
s = setup(data = data, target = 'Legendary', transformation = True, verbose = False)
# 特徵變換結果
# | Name | Type 1_Psychic | Type 1_Water | Type 1_Rock | Type 1_Grass | Type 1_Dragon | Type 1_Ghost | Type 1_Bug | Type 1_Fairy | ... | Type 2_Electric | Type 2_Bug | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
165 | 52.899003 | 0.009216 | 0.043322 | -0.000000 | -0.000000 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 93.118403 | 12.336844 | 23.649090 | 13.573010 | 10.692443 | 8.081703 | 26.134255 | 0.900773 |
625 | 140.730289 | 0.009216 | -0.000000 | 0.095739 | -0.000000 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 66.091344 | 9.286671 | 20.259153 | 13.764668 | 8.160482 | 6.056644 | 9.552506 | 3.679456 |
628 | 141.283084 | 0.009216 | -0.000000 | -0.000000 | 0.043322 | -0.000000 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 89.747939 | 10.823299 | 29.105379 | 11.029571 | 11.203335 | 6.942091 | 27.793080 | 3.679456 |
606 | 137.396878 | 0.009216 | -0.000000 | -0.000000 | -0.000000 | 0.061897 | -0.000000 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 56.560577 | 8.043018 | 10.276208 | 10.604937 | 6.949265 | 6.302465 | 19.943809 | 3.679456 |
672 | 149.303914 | 0.009216 | -0.000000 | -0.000000 | -0.000000 | -0.000000 | 0.029706 | -0.0 | -0.0 | -0.0 | ... | -0.0 | -0.0 | 72.626190 | 10.202245 | 26.061259 | 11.435493 | 7.199607 | 6.302465 | 20.141156 | 3.679456 |
5 rows × 46 columns
3 參考
- pycaret
- pycaret-docs
- pycaret-datasets
- lightgbm
- cuml
- imbalanced-learn