[機器學習] 低程式碼機器學習工具PyCaret庫使用指北

落痕的寒假發表於2024-06-01

原文網址 : https://www.cnblogs.com/luohenyueji/p/18225558

PyCaret是一個開源、低程式碼Python機器學習庫，能夠自動化機器學習工作流程。它是一個端到端的機器學習和模型管理工具，極大地加快了實驗週期，提高了工作效率。PyCaret本質上是圍繞幾個機器學習庫和框架（如scikit-learn、XGBoost、LightGBM、CatBoost、spaCy、Optuna、Hyperopt、Ray等）的Python包裝器，與其他開源機器學習庫相比，PyCaret可以用少量程式碼取代數百行程式碼。PyCaret開源倉庫地址：pycaret，官方文件地址為：pycaret-docs。

PyCaret基礎版安裝命令如下：

pip install pycaret

完整版安裝程式碼如下：

pip install pycaret[full]

此外以下模型可以呼叫GPU

Extreme Gradient Boosting
Catboost
Light Gradient Boosting（需要安裝lightgbm)
Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression（需要安裝cuml0.15版本以上）

# 檢視pycaret版本
import pycaret
pycaret.__version__

'3.3.2'

1 快速入門
- 1.1 分類
- 1.2 迴歸
- 1.3 聚類
- 1.4 異常檢測
- 1.5 時序預測
2 資料處理與清洗
- 2.1 缺失值處理
- 2.2 型別轉換
- 2.3 獨熱編碼
- 2.4 資料平衡
- 2.5 異常值處理
- 2.6 特徵重要性
- 2.7 歸一化
3 參考

1 快速入門

PyCaret支援多種機器學習任務，包括分類、迴歸、聚類、異常檢測和時序預測。本節主要介紹如何利用PyCaret構建相關任務模型的基礎使用方法。關於更詳細的PyCaret任務模型使用，請參考：

Topic	NotebookLink
二分類BinaryClassification	link
多分類MulticlassClassification	link
迴歸Regression	link
聚類Clustering	link
異常檢測AnomalyDetection	link
時序預測TimeSeriesForecasting	link

1.1 分類

PyCaret的classification模組是一個可用於二分類或多分類的模組，用於將元素分類到不同的組中。一些常見的用例包括預測客戶是否違約、預測客戶是否流失、以及診斷疾病（陽性或陰性）。示例程式碼如下所示：

資料準備

載入糖尿病示例資料集：

from pycaret.datasets import get_data
# 從本地載入資料，注意dataset是資料的檔名
data = get_data(dataset='./datasets/diabetes', verbose=False)
# 從pycaret開源倉庫下載公開資料
# data = get_data('diabetes', verbose=False)

# 檢視資料型別和資料維度
type(data), data.shape

(pandas.core.frame.DataFrame, (768, 9))

# 最後一列表示是否為糖尿病患者，其他列為特徵列
data.head()

	Number of times pregnant	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	Diastolic blood pressure (mm Hg)	Triceps skin fold thickness (mm)	2-Hour serum insulin (mu U/ml)	Body mass index (weight in kg/(height in m)^2)	Diabetes pedigree function	Age (years)	Class variable
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

利用PyCaret核心函式setup，初始化建模環境並準備資料以供模型訓練和評估使用：

from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# target目標列，session_id設定隨機數種子, preprocesss是否清洗資料，train_size訓練集比例, normalize是否歸一化資料, normalize_method歸一化方式
s.setup(data, target = 'Class variable', session_id = 0, verbose= False, train_size = 0.7, normalize = True, normalize_method = 'minmax')

<pycaret.classification.oop.ClassificationExperiment at 0x200b939df40>

檢視基於setup函式建立的變數：

s.get_config()

{'USI',
 'X',
 'X_test',
 'X_test_transformed',
 'X_train',
 'X_train_transformed',
 'X_transformed',
 '_available_plots',
 '_ml_usecase',
 'data',
 'dataset',
 'dataset_transformed',
 'exp_id',
 'exp_name_log',
 'fix_imbalance',
 'fold_generator',
 'fold_groups_param',
 'fold_shuffle_param',
 'gpu_n_jobs_param',
 'gpu_param',
 'html_param',
 'idx',
 'is_multiclass',
 'log_plots_param',
 'logging_param',
 'memory',
 'n_jobs_param',
 'pipeline',
 'seed',
 'target_param',
 'test',
 'test_transformed',
 'train',
 'train_transformed',
 'variable_and_property_keys',
 'variables',
 'y',
 'y_test',
 'y_test_transformed',
 'y_train',
 'y_train_transformed',
 'y_transformed'}

檢視歸一化的資料：

s.get_config('X_train_transformed')

	Number of times pregnant	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	Diastolic blood pressure (mm Hg)	Triceps skin fold thickness (mm)	2-Hour serum insulin (mu U/ml)	Body mass index (weight in kg/(height in m)^2)	Diabetes pedigree function	Age (years)
34	0.588235	0.613065	0.639344	0.313131	0.000000	0.411326	0.185312	0.400000
221	0.117647	0.793970	0.737705	0.000000	0.000000	0.470939	0.310418	0.750000
531	0.000000	0.537688	0.622951	0.000000	0.000000	0.675112	0.259607	0.050000
518	0.764706	0.381910	0.491803	0.000000	0.000000	0.488823	0.043553	0.333333
650	0.058824	0.457286	0.442623	0.252525	0.118203	0.375559	0.066610	0.033333
...	...	...	...	...	...	...	...	...
628	0.294118	0.643216	0.655738	0.000000	0.000000	0.515648	0.028181	0.400000
456	0.058824	0.678392	0.442623	0.000000	0.000000	0.397914	0.260034	0.683333
398	0.176471	0.412060	0.573770	0.000000	0.000000	0.314456	0.132792	0.066667
6	0.176471	0.391960	0.409836	0.323232	0.104019	0.461997	0.072588	0.083333
294	0.000000	0.809045	0.409836	0.000000	0.000000	0.326379	0.075149	0.733333

537 rows × 8 columns

繪製某列資料的柱狀圖：

s.get_config('X_train_transformed')['Number of times pregnant'].hist()

<AxesSubplot:>

png

當然也可以利用如下程式碼建立任務示例來初始化環境：

from pycaret.classification import setup
# s = setup(data, target = 'Class variable', session_id = 0, preprocess = True, train_size = 0.7, verbose = False)

模型訓練與評估

PyCaret提供了compare_models函式，透過使用預設的10折交叉驗證來訓練和評估模型庫中所有可用估計器的效能：

best = s.compare_models()
# 選擇某些模型進行比較
# best = s.compare_models(include = ['dt', 'rf', 'et', 'gbc', 'lightgbm'])
# 按照召回率返回n_select效能最佳的模型
# best_recall_models_top3 = s.compare_models(sort = 'Recall', n_select = 3)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lr	Logistic Regression	0.7633	0.8132	0.4968	0.7436	0.5939	0.4358	0.4549	0.2720
ridge	Ridge Classifier	0.7633	0.8113	0.5178	0.7285	0.6017	0.4406	0.4560	0.0090
lda	Linear Discriminant Analysis	0.7633	0.8110	0.5497	0.7069	0.6154	0.4489	0.4583	0.0080
ada	Ada Boost Classifier	0.7465	0.7768	0.5655	0.6580	0.6051	0.4208	0.4255	0.0190
svm	SVM - Linear Kernel	0.7408	0.8087	0.5921	0.6980	0.6020	0.4196	0.4480	0.0080
nb	Naive Bayes	0.7391	0.7939	0.5442	0.6515	0.5857	0.3995	0.4081	0.0080
rf	Random Forest Classifier	0.7337	0.8033	0.5406	0.6331	0.5778	0.3883	0.3929	0.0350
et	Extra Trees Classifier	0.7298	0.7899	0.5181	0.6416	0.5677	0.3761	0.3840	0.0450
gbc	Gradient Boosting Classifier	0.7281	0.8007	0.5567	0.6267	0.5858	0.3857	0.3896	0.0260
lightgbm	Light Gradient Boosting Machine	0.7242	0.7811	0.5827	0.6096	0.5935	0.3859	0.3876	0.0860
qda	Quadratic Discriminant Analysis	0.7150	0.7875	0.4962	0.6225	0.5447	0.3428	0.3524	0.0080
knn	K Neighbors Classifier	0.7131	0.7425	0.5287	0.6005	0.5577	0.3480	0.3528	0.2200
dt	Decision Tree Classifier	0.6685	0.6461	0.5722	0.5266	0.5459	0.2868	0.2889	0.0100
dummy	Dummy Classifier	0.6518	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0120

返回當前設定中所有經過訓練的模型中的最佳模型:

best_ml = s.automl()
# best_ml

# 列印效果最佳的模型
print(best)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

資料視覺化

PyCaret也提供了plot_model函式視覺化模型的評估指標，plot_model函式中的plot用於設定評估指標型別。plot可用引數如下（注意並不是所有的模型都支援以下評估指標）：

pipeline: Schematic drawing of the preprocessing pipeline
auc: Area Under the Curve
threshold: Discrimination Threshold
pr: Precision Recall Curve
confusion_matrix: Confusion Matrix
error: Class Prediction Error
class_report: Classification Report
boundary: Decision Boundary
rfe: Recursive Feature Selection
learning: Learning Curve
manifold: Manifold Learning
calibration: Calibration Curve
vc: Validation Curve
dimension: Dimension Learning
feature: Feature Importance
feature_all: Feature Importance (All)
parameter: Model Hyperparameter
lift: Lift Curve
gain: Gain Chart
tree: Decision Tree
ks: KS Statistic Plot

# 提取所有模型預測結果
models_results = s.pull()
models_results

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lr	Logistic Regression	0.7633	0.8132	0.4968	0.7436	0.5939	0.4358	0.4549	0.272
ridge	Ridge Classifier	0.7633	0.8113	0.5178	0.7285	0.6017	0.4406	0.4560	0.009
lda	Linear Discriminant Analysis	0.7633	0.8110	0.5497	0.7069	0.6154	0.4489	0.4583	0.008
ada	Ada Boost Classifier	0.7465	0.7768	0.5655	0.6580	0.6051	0.4208	0.4255	0.019
svm	SVM - Linear Kernel	0.7408	0.8087	0.5921	0.6980	0.6020	0.4196	0.4480	0.008
nb	Naive Bayes	0.7391	0.7939	0.5442	0.6515	0.5857	0.3995	0.4081	0.008
rf	Random Forest Classifier	0.7337	0.8033	0.5406	0.6331	0.5778	0.3883	0.3929	0.035
et	Extra Trees Classifier	0.7298	0.7899	0.5181	0.6416	0.5677	0.3761	0.3840	0.045
gbc	Gradient Boosting Classifier	0.7281	0.8007	0.5567	0.6267	0.5858	0.3857	0.3896	0.026
lightgbm	Light Gradient Boosting Machine	0.7242	0.7811	0.5827	0.6096	0.5935	0.3859	0.3876	0.086
qda	Quadratic Discriminant Analysis	0.7150	0.7875	0.4962	0.6225	0.5447	0.3428	0.3524	0.008
knn	K Neighbors Classifier	0.7131	0.7425	0.5287	0.6005	0.5577	0.3480	0.3528	0.220
dt	Decision Tree Classifier	0.6685	0.6461	0.5722	0.5266	0.5459	0.2868	0.2889	0.010
dummy	Dummy Classifier	0.6518	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.012

s.plot_model(best, plot = 'confusion_matrix')

png

如果在jupyter環境，可以透過evaluate_model函式來互動式展示模型的效能：

# s.evaluate_model(best)

模型預測

predict_model函式實現對資料進行預測，並返回包含預測標籤prediction_label和分數prediction_score的Pandas表格。當data為None時，它預測測試集（在設定功能期間建立）上的標籤和分數。

# 預測整個資料集
res = s.predict_model(best, data=data)
# 檢視各行預測結果
# res

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Logistic Regression	0.7708	0.8312	0.5149	0.7500	0.6106	0.4561	0.4723

# 預測用於資料訓練的測試集
res = s.predict_model(best)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Logistic Regression	0.7576	0.8553	0.5062	0.7193	0.5942	0.4287	0.4422

模型儲存與匯入

# 儲存模型到本地
_ = s.save_model(best, 'best_model', verbose = False)

# 匯入模型
model = s.load_model( 'best_model')
# 檢視模型結構
# model

Transformation Pipeline and Model Successfully Loaded

# 預測整個資料集
res = s.predict_model(model, data=data)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Logistic Regression	0.7708	0.8312	0.5149	0.7500	0.6106	0.4561	0.4723

1.2 迴歸

PyCaret提供了regression模型實現迴歸任務，regression模組與classification模組使用方法一致。

# 載入保險費用示例資料集
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# 從網路下載
# data = get_data(dataset='insurance', verbose=False)

data.head()

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

# 建立資料管道
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()
# 預測charges列
s.setup(data, target = 'charges', session_id = 0)
# 另一種資料管道建立方式
# from pycaret.regression import *
# s = setup(data, target = 'charges', session_id = 0)

	Description	Value
0	Session id	0
1	Target	charges
2	Target type	Regression
3	Original data shape	(1338, 7)
4	Transformed data shape	(1338, 10)
5	Transformed train set shape	(936, 10)
6	Transformed test set shape	(402, 10)
7	Numeric features	3
8	Categorical features	3
9	Preprocess	True
10	Imputation type	simple
11	Numeric imputation	mean
12	Categorical imputation	mode
13	Maximum one-hot encoding	25
14	Encoding method	None
15	Fold Generator	KFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	reg-default-name
21	USI	eb9d

<pycaret.regression.oop.RegressionExperiment at 0x200dedc2d30>

# 評估各類模型
best = s.compare_models()

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
gbr	Gradient Boosting Regressor	2723.2453	23787529.5872	4832.4785	0.8254	0.4427	0.3140	0.0550
lightgbm	Light Gradient Boosting Machine	2998.1311	25738691.2181	5012.2404	0.8106	0.5525	0.3709	0.1140
rf	Random Forest Regressor	2915.7018	26780127.0016	5109.5098	0.8031	0.4855	0.3520	0.0670
et	Extra Trees Regressor	2841.8257	28559316.9533	5243.5828	0.7931	0.4671	0.3218	0.0670
ada	AdaBoost Regressor	4180.2669	28289551.0048	5297.6817	0.7886	0.5935	0.6545	0.0210
ridge	Ridge Regression	4304.2640	38786967.4768	6188.6966	0.7152	0.5794	0.4283	0.0230
lar	Least Angle Regression	4293.9886	38781666.5991	6188.3301	0.7151	0.5893	0.4263	0.0210
llar	Lasso Least Angle Regression	4294.2135	38780221.0039	6188.1906	0.7151	0.5891	0.4264	0.0200
br	Bayesian Ridge	4299.8532	38785479.0984	6188.6026	0.7151	0.5784	0.4274	0.0200
lasso	Lasso Regression	4294.2186	38780210.5665	6188.1898	0.7151	0.5892	0.4264	0.0370
lr	Linear Regression	4293.9886	38781666.5991	6188.3301	0.7151	0.5893	0.4263	0.0350
dt	Decision Tree Regressor	3550.6534	51149204.9032	7095.9170	0.6127	0.5839	0.4537	0.0230
huber	Huber Regressor	3769.3076	53638697.2337	7254.7108	0.6095	0.4528	0.2187	0.0250
par	Passive Aggressive Regressor	4144.7180	62949698.1775	7862.7604	0.5433	0.4634	0.2465	0.0210
en	Elastic Net	7248.9376	89841235.9517	9405.5846	0.3534	0.7346	0.9238	0.0210
omp	Orthogonal Matching Pursuit	8916.1927	130904492.3067	11356.4120	0.0561	0.8781	1.1598	0.0180
knn	K Neighbors Regressor	8161.8875	137982796.8000	11676.3735	-0.0011	0.8744	0.9742	0.0250
dummy	Dummy Regressor	8892.4478	141597492.8000	11823.4271	-0.0221	0.9868	1.4909	0.0210

print(best)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='squared_error',
                          max_depth=3, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          n_estimators=100, n_iter_no_change=None,
                          random_state=0, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

1.3 聚類

ParCaret提供了clustering模組實現無監督聚類。

資料準備

# 匯入珠寶資料集
from pycaret.datasets import get_data
# 根據資料集特徵進行聚類
data = get_data('./datasets/jewellery')
# data = get_data('jewellery')

	Age	Income	SpendingScore	Savings
0	58	77769	0.791329	6559.829923
1	59	81799	0.791082	5417.661426
2	62	74751	0.702657	9258.992965
3	59	74373	0.765680	7346.334504
4	87	17760	0.348778	16869.507130

# 建立資料管道
from pycaret.clustering import ClusteringExperiment
s = ClusteringExperiment()
# normalize歸一化資料
s.setup(data, normalize = True, verbose = False)
# 另一種資料管道建立方式
# from pycaret.clustering import *
# s = setup(data, normalize = True)

<pycaret.clustering.oop.ClusteringExperiment at 0x200dec86340>

模型建立

PyCaret在聚類任務中提供create_model選擇合適的方法來構建聚類模型，而不是全部比較。

kmeans = s.create_model('kmeans')

	Silhouette	Calinski-Harabasz	Davies-Bouldin	Homogeneity	Rand Index	Completeness
0	0.7581	1611.2647	0.3743	0	0	0

create_model函式支援的聚類方法如下：

s.models()

	Name	Reference
ID
kmeans	K-Means Clustering	sklearn.cluster._kmeans.KMeans
ap	Affinity Propagation	sklearn.cluster._affinity_propagation.Affinity...
meanshift	Mean Shift Clustering	sklearn.cluster._mean_shift.MeanShift
sc	Spectral Clustering	sklearn.cluster._spectral.SpectralClustering
hclust	Agglomerative Clustering	sklearn.cluster._agglomerative.AgglomerativeCl...
dbscan	Density-Based Spatial Clustering	sklearn.cluster._dbscan.DBSCAN
optics	OPTICS Clustering	sklearn.cluster._optics.OPTICS
birch	Birch Clustering	sklearn.cluster._birch.Birch

print(kmeans)
# 檢視聚類數
print(kmeans.n_clusters)

KMeans(algorithm='lloyd', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init='auto', random_state=1459, tol=0.0001, verbose=0)
4

資料展示

# jupyter環境下互動視覺化展示
# s.evaluate_model(kmeans)

# 結果視覺化 
# 'cluster' - Cluster PCA Plot (2d)
# 'tsne' - Cluster t-SNE (3d)
# 'elbow' - Elbow Plot
# 'silhouette' - Silhouette Plot
# 'distance' - Distance Plot
# 'distribution' - Distribution Plot
s.plot_model(kmeans, plot = 'elbow')

png

標籤分配與資料預測

為訓練資料分配聚類標籤：

result = s.assign_model(kmeans)
result.head()

	Age	Income	SpendingScore	Savings	Cluster
0	58	77769	0.791329	6559.830078	Cluster 2
1	59	81799	0.791082	5417.661621	Cluster 2
2	62	74751	0.702657	9258.993164	Cluster 2
3	59	74373	0.765680	7346.334473	Cluster 2
4	87	17760	0.348778	16869.507812	Cluster 1

為新的資料進行標籤分配：

predictions = s.predict_model(kmeans, data = data)
predictions.head()

	Age	Income	SpendingScore	Savings	Cluster
0	-0.042287	0.062733	1.103593	-1.072467	Cluster 2
1	-0.000821	0.174811	1.102641	-1.303473	Cluster 2
2	0.123577	-0.021200	0.761727	-0.526556	Cluster 2
3	-0.000821	-0.031712	1.004705	-0.913395	Cluster 2
4	1.160228	-1.606165	-0.602619	1.012686	Cluster 1

1.4 異常檢測

PyCaret的anomaly detection模組是一個無監督的機器學習模組，用於識別與大多數資料存在顯著差異的罕見專案、事件或觀測值。通常，這些異常專案會轉化為某種問題，如銀行欺詐、結構缺陷、醫療問題或錯誤。anomaly detection模組的使用類似於cluster模組。

資料準備

from pycaret.datasets import get_data
data = get_data('./datasets/anomaly')
# data = get_data('anomaly')

	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10
0	0.263995	0.764929	0.138424	0.935242	0.605867	0.518790	0.912225	0.608234	0.723782	0.733591
1	0.546092	0.653975	0.065575	0.227772	0.845269	0.837066	0.272379	0.331679	0.429297	0.367422
2	0.336714	0.538842	0.192801	0.553563	0.074515	0.332993	0.365792	0.861309	0.899017	0.088600
3	0.092108	0.995017	0.014465	0.176371	0.241530	0.514724	0.562208	0.158963	0.073715	0.208463
4	0.325261	0.805968	0.957033	0.331665	0.307923	0.355315	0.501899	0.558449	0.885169	0.182754

from pycaret.anomaly import AnomalyExperiment
s = AnomalyExperiment()
s.setup(data, session_id = 0)
# 另一種載入方式
# from pycaret.anomaly import *
# s = setup(data, session_id = 0)

	Description	Value
0	Session id	0
1	Original data shape	(1000, 10)
2	Transformed data shape	(1000, 10)
3	Numeric features	10
4	Preprocess	True
5	Imputation type	simple
6	Numeric imputation	mean
7	Categorical imputation	mode
8	CPU Jobs	-1
9	Use GPU	False
10	Log Experiment	False
11	Experiment Name	anomaly-default-name
12	USI	54db

<pycaret.anomaly.oop.AnomalyExperiment at 0x200e14f5250>

模型建立

iforest = s.create_model('iforest')
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=0, verbose=0)

anomaly detection模組所支援的模型列表如下：

s.models()

	Name	Reference
ID
abod	Angle-base Outlier Detection	pyod.models.abod.ABOD
cluster	Clustering-Based Local Outlier	pycaret.internal.patches.pyod.CBLOFForceToDouble
cof	Connectivity-Based Local Outlier	pyod.models.cof.COF
iforest	Isolation Forest	pyod.models.iforest.IForest
histogram	Histogram-based Outlier Detection	pyod.models.hbos.HBOS
knn	K-Nearest Neighbors Detector	pyod.models.knn.KNN
lof	Local Outlier Factor	pyod.models.lof.LOF
svm	One-class SVM detector	pyod.models.ocsvm.OCSVM
pca	Principal Component Analysis	pyod.models.pca.PCA
mcd	Minimum Covariance Determinant	pyod.models.mcd.MCD
sod	Subspace Outlier Detection	pyod.models.sod.SOD
sos	Stochastic Outlier Selection	pyod.models.sos.SOS

標籤分配與資料預測

為訓練資料分配聚類標籤：

result = s.assign_model(iforest)
result.head()

	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Anomaly	Anomaly_Score
0	0.263995	0.764929	0.138424	0.935242	0.605867	0.518790	0.912225	0.608234	0.723782	0.733591	0	-0.016205
1	0.546092	0.653975	0.065575	0.227772	0.845269	0.837066	0.272379	0.331679	0.429297	0.367422	0	-0.068052
2	0.336714	0.538842	0.192801	0.553563	0.074515	0.332993	0.365792	0.861309	0.899017	0.088600	1	0.009221
3	0.092108	0.995017	0.014465	0.176371	0.241530	0.514724	0.562208	0.158963	0.073715	0.208463	1	0.056690
4	0.325261	0.805968	0.957033	0.331665	0.307923	0.355315	0.501899	0.558449	0.885169	0.182754	0	-0.012945

為新的資料進行標籤分配：

predictions = s.predict_model(iforest, data = data)
predictions.head()

	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Anomaly	Anomaly_Score
0	0.263995	0.764929	0.138424	0.935242	0.605867	0.518790	0.912225	0.608234	0.723782	0.733591	0	-0.016205
1	0.546092	0.653975	0.065575	0.227772	0.845269	0.837066	0.272379	0.331679	0.429297	0.367422	0	-0.068052
2	0.336714	0.538842	0.192801	0.553563	0.074515	0.332993	0.365792	0.861309	0.899017	0.088600	1	0.009221
3	0.092108	0.995017	0.014465	0.176371	0.241530	0.514724	0.562208	0.158963	0.073715	0.208463	1	0.056690
4	0.325261	0.805968	0.957033	0.331665	0.307923	0.355315	0.501899	0.558449	0.885169	0.182754	0	-0.012945

1.5 時序預測

PyCaret時間序列預測Time Series模組支援多種預測方法，如ARIMA、Prophet和LSTM。它還提供了各種功能來處理缺失值、時間序列分解和資料視覺化。

資料準備

# 乘客時序資料
from pycaret.datasets import get_data
# 下載路徑：https://raw.githubusercontent.com/sktime/sktime/main/sktime/datasets/data/Airline/Airline.csv
data = get_data('./datasets/airline')
# data = get_data('airline')

	Date	Passengers
0	1949-01	112
1	1949-02	118
2	1949-03	132
3	1949-04	129
4	1949-05	121

import pandas as pd
data['Date'] = pd.to_datetime(data['Date'])
# 並將Date設定為列號
data.set_index('Date', inplace=True)

from pycaret.time_series import TSForecastingExperiment
s = TSForecastingExperiment()
# fh: 用於預測的預測範圍。預設值設定為1，即預測前方一點。,fold: 交叉驗證中折數
s.setup(data, fh = 3, fold = 5, session_id = 0, verbose = False)
# from pycaret.time_series import *
# s = setup(data, fh = 3, fold = 5, session_id = 0)

<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x200dee26910>

模型訓練與評估

best = s.compare_models()

	Model	MASE	RMSSE	MAE	RMSE	MAPE	SMAPE	R2	TT (Sec)
stlf	STLF	0.4240	0.4429	12.8002	15.1933	0.0266	0.0268	0.4296	0.0300
exp_smooth	Exponential Smoothing	0.5063	0.5378	15.2900	18.4455	0.0334	0.0335	-0.0521	0.0500
ets	ETS	0.5520	0.5801	16.6164	19.8391	0.0354	0.0357	-0.0740	0.0680
arima	ARIMA	0.6480	0.6501	19.5728	22.3027	0.0412	0.0420	-0.0796	0.0420
auto_arima	Auto ARIMA	0.6526	0.6300	19.7405	21.6202	0.0414	0.0421	-0.0560	10.5220
theta	Theta Forecaster	0.8458	0.8223	25.7024	28.3332	0.0524	0.0541	-0.7710	0.0220
huber_cds_dt	Huber w/ Cond. Deseasonalize & Detrending	0.9002	0.8900	27.2568	30.5782	0.0550	0.0572	-0.0309	0.0680
knn_cds_dt	K Neighbors w/ Cond. Deseasonalize & Detrending	0.9381	0.8830	28.5678	30.5007	0.0555	0.0575	0.0908	0.0920
lr_cds_dt	Linear w/ Cond. Deseasonalize & Detrending	0.9469	0.9297	28.6337	31.9163	0.0581	0.0605	-0.1620	0.0820
ridge_cds_dt	Ridge w/ Cond. Deseasonalize & Detrending	0.9469	0.9297	28.6340	31.9164	0.0581	0.0605	-0.1620	0.0680
en_cds_dt	Elastic Net w/ Cond. Deseasonalize & Detrending	0.9499	0.9320	28.7271	31.9952	0.0582	0.0606	-0.1579	0.0700
llar_cds_dt	Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending	0.9520	0.9336	28.7917	32.0528	0.0583	0.0607	-0.1559	0.0560
lasso_cds_dt	Lasso w/ Cond. Deseasonalize & Detrending	0.9521	0.9337	28.7941	32.0557	0.0583	0.0607	-0.1560	0.0720
br_cds_dt	Bayesian Ridge w/ Cond. Deseasonalize & Detrending	0.9551	0.9347	28.9018	32.1013	0.0582	0.0606	-0.1377	0.0580
et_cds_dt	Extra Trees w/ Cond. Deseasonalize & Detrending	1.0322	0.9942	31.4048	34.3054	0.0607	0.0633	-0.1660	0.1280
rf_cds_dt	Random Forest w/ Cond. Deseasonalize & Detrending	1.0851	1.0286	32.9791	35.4666	0.0641	0.0670	-0.3545	0.1400
lightgbm_cds_dt	Light Gradient Boosting w/ Cond. Deseasonalize & Detrending	1.1409	1.1040	34.5999	37.9918	0.0670	0.0701	-0.3994	0.0900
ada_cds_dt	AdaBoost w/ Cond. Deseasonalize & Detrending	1.1441	1.0843	34.7451	37.3681	0.0664	0.0697	-0.3004	0.0920
gbr_cds_dt	Gradient Boosting w/ Cond. Deseasonalize & Detrending	1.1697	1.1094	35.4408	38.1373	0.0697	0.0729	-0.4163	0.0900
omp_cds_dt	Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending	1.1793	1.1250	35.7348	38.6755	0.0706	0.0732	-0.5095	0.0620
dt_cds_dt	Decision Tree w/ Cond. Deseasonalize & Detrending	1.2704	1.2371	38.4976	42.4846	0.0773	0.0814	-1.0382	0.0860
snaive	Seasonal Naive Forecaster	1.7700	1.5999	53.5333	54.9143	0.1136	0.1211	-4.1630	0.1580
naive	Naive Forecaster	1.8145	1.7444	54.8667	59.8160	0.1135	0.1151	-3.7710	0.1460
polytrend	Polynomial Trend Forecaster	2.3154	2.2507	70.1138	77.3400	0.1363	0.1468	-4.6202	0.1080
croston	Croston	2.6211	2.4985	79.3645	85.8439	0.1515	0.1684	-5.2294	0.0140
grand_means	Grand Means Forecaster	7.1261	6.3506	216.0214	218.4259	0.4377	0.5682	-59.2684	0.1400

資料展示

# jupyter環境下互動視覺化展示
# plot引數支援：
# - 'ts' - Time Series Plot
# - 'train_test_split' - Train Test Split
# - 'cv' - Cross Validation
# - 'acf' - Auto Correlation (ACF)
# - 'pacf' - Partial Auto Correlation (PACF)
# - 'decomp' - Classical Decomposition
# - 'decomp_stl' - STL Decomposition
# - 'diagnostics' - Diagnostics Plot
# - 'diff' - Difference Plot
# - 'periodogram' - Frequency Components (Periodogram)
# - 'fft' - Frequency Components (FFT)
# - 'ccf' - Cross Correlation (CCF)
# - 'forecast' - "Out-of-Sample" Forecast Plot
# - 'insample' - "In-Sample" Forecast Plot
# - 'residuals' - Residuals Plot
# s.plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})

資料預測

# 使模型擬合包括測試樣本在內的完整資料集
final_best = s.finalize_model(best)
s.predict_model(best, fh = 24)
s.save_model(final_best, 'final_best_model')

Transformation Pipeline and Model Successfully Saved





(ForecastingPipeline(steps=[('forecaster',
                             TransformedTargetForecaster(steps=[('model',
                                                                 ForecastingPipeline(steps=[('forecaster',
                                                                                             TransformedTargetForecaster(steps=[('model',
                                                                                                                                 STLForecaster(sp=12))]))]))]))]),
 'final_best_model.pkl')

2 資料處理與清洗

2.1 缺失值處理

各種資料集可能由於多種原因存在缺失值或空記錄。移除具有缺失值的樣本是一種常見策略，但這會導致丟失可能有價值的資料。一種可替代的策略是對缺失值進行插值填充。在setup函式中可以指定如下引數，實現缺失值處理：

imputation_type：取值可以是 'simple' 或 'iterative'或 None。當imputation_type設定為 'simple' 時，PyCaret 將使用簡單的方式（numeric_imputation和categorical_imputation）對缺失值進行填充。而當設定為 'iterative' 時，則會使用模型估計的方式（numeric_iterative_imputer，categorical_iterative_imputer）進行填充處理。如果設定為 None，則不會執行任何缺失值填充操作
numeric_imputation: 設定數值型別的缺失值，方式如下：
- mean: 用列的平均值填充，預設
- drop: 刪除包含缺失值的行
- median: 用列的中值填充
- mode: 用列最常見值填充
- knn: 使用K-最近鄰方法擬合
- int or float: 用指定值替代
categorical_imputation:
- mode: 用列最常見值填充，預設
- drop: 刪除包含缺失值的行
- str: 用指定字元替代
numeric_iterative_imputer: 使用估計模型擬合值，可輸入str或sklearn模型, 預設使用lightgbm
categorical_iterative_imputer: 使用估計模型差值，可輸入str或sklearn模型, 預設使用lightgbm

載入資料

# load dataset
from pycaret.datasets import get_data
# 從本地載入資料，注意dataset是資料的檔名
data = get_data(dataset='./datasets/hepatitis', verbose=False)
# data = get_data('hepatitis',verbose=False)
# 可以看到第三行STEROID列出現NaN值
data.head()

	AGE	SEX	STEROID	ANTIVIRALS	FATIGUE	MALAISE	ANOREXIA	LIVER BIG	LIVER FIRM	SPLEEN PALPABLE	SPIDERS	ASCITES	VARICES	BILIRUBIN	ALK PHOSPHATE	SGOT	ALBUMIN	PROTIME	HISTOLOGY
0	30	2	1.0	2	2	2	2	1.0	2.0	2.0	2.0	2.0	2.0	1.0	85.0	18.0	4.0	NaN	1
1	50	1	1.0	2	1	2	2	1.0	2.0	2.0	2.0	2.0	2.0	0.9	135.0	42.0	3.5	NaN	1
2	78	1	2.0	2	1	2	2	2.0	2.0	2.0	2.0	2.0	2.0	0.7	96.0	32.0	4.0	NaN	1
3	31	1	NaN	1	2	2	2	2.0	2.0	2.0	2.0	2.0	2.0	0.7	46.0	52.0	4.0	80.0	1
4	34	1	2.0	2	2	2	2	2.0	2.0	2.0	2.0	2.0	2.0	1.0	NaN	200.0	4.0	NaN	1

# 使用均值填充資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# 均值
# s.data['STEROID'].mean()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 設定data_split_shuffle和data_split_stratify為False不打亂資料
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_iterative_imputer='drop')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()

	AGE	SEX	STEROID	ANTIVIRALS	FATIGUE	MALAISE	ANOREXIA	LIVER BIG	LIVER FIRM	SPLEEN PALPABLE	SPIDERS	ASCITES	VARICES	BILIRUBIN	ALK PHOSPHATE	SGOT	ALBUMIN	PROTIME	HISTOLOGY
0	30.0	2.0	1.000000	2.0	2.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	1.0	85.000000	18.0	4.0	66.53968	1.0
1	50.0	1.0	1.000000	2.0	1.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	0.9	135.000000	42.0	3.5	66.53968	1.0
2	78.0	1.0	2.000000	2.0	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	96.000000	32.0	4.0	66.53968	1.0
3	31.0	1.0	1.509434	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	46.000000	52.0	4.0	80.00000	1.0
4	34.0	1.0	2.000000	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	1.0	99.659088	200.0	4.0	66.53968	1.0

# 使用knn擬合資料
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 設定data_split_shuffle和data_split_stratify為False不打亂資料
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_imputation = 'knn')
# 檢視轉換後的資料
s.get_config('dataset_transformed').head()

	AGE	SEX	STEROID	ANTIVIRALS	FATIGUE	MALAISE	ANOREXIA	LIVER BIG	LIVER FIRM	SPLEEN PALPABLE	SPIDERS	ASCITES	VARICES	BILIRUBIN	ALK PHOSPHATE	SGOT	ALBUMIN	PROTIME	HISTOLOGY
0	30.0	2.0	1.0	2.0	2.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	1.0	85.000000	18.0	4.0	91.800003	1.0
1	50.0	1.0	1.0	2.0	1.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	0.9	135.000000	42.0	3.5	61.599998	1.0
2	78.0	1.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	96.000000	32.0	4.0	75.800003	1.0
3	31.0	1.0	1.8	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	46.000000	52.0	4.0	80.000000	1.0
4	34.0	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	1.0	108.400002	200.0	4.0	62.799999	1.0

# 使用lightgbmn擬合資料
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
#         # 設定data_split_shuffle和data_split_stratify為False不打亂資料
#         data_split_shuffle = False, data_split_stratify = False,
#         imputation_type='iterative', numeric_iterative_imputer = 'lightgbm')
# 檢視轉換後的資料
# s.get_config('dataset_transformed').head()

2.2 型別轉換

雖然 PyCaret具有自動識別特徵型別的功能，但PyCaret提供了資料型別自定義引數，使用者可以對資料集進行更精細的控制和指導，以確保模型訓練和特徵工程的效果更加符合使用者的預期和需求。這些自定義引數如下：

numeric_features：用於指定資料集中的數值特徵列的引數。這些特徵將被視為連續型變數進行處理
categorical_features：用於指定資料集中的分類特徵列的引數。這些特徵將被視為離散型變數進行處理
date_features：用於指定資料集中的日期特徵列的引數。這些特徵將被視為日期型變數進行處理
create_date_columns：用於指定是否從日期特徵中建立新的日期相關列的引數
text_features：用於指定資料集中的文字特徵列的引數。這些特徵將被視為文字型變數進行處理
text_features_method：用於指定對文字特徵進行處理的方法的引數
ignore_features：用於指定在建模過程中需要忽略的特徵列的引數
keep_features：用於指定在建模過程中需要保留的特徵列的引數

# 轉換變數型別
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/hepatitis', verbose=False)

from pycaret.classification import *
s = setup(data = data, target = 'Class', ignore_features  = ['SEX','AGE'], categorical_features=['STEROID'],verbose = False,
         data_split_shuffle = False, data_split_stratify = False)

# 檢視轉換後的資料，前兩列消失，STEROID變為分類變數
s.get_config('dataset_transformed').head()

	STEROID	ANTIVIRALS	FATIGUE	MALAISE	ANOREXIA	LIVER BIG	LIVER FIRM	SPLEEN PALPABLE	SPIDERS	ASCITES	VARICES	BILIRUBIN	ALK PHOSPHATE	SGOT	ALBUMIN	PROTIME	HISTOLOGY
0	0.0	2.0	2.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	1.0	85.000000	18.0	4.0	66.53968	1.0
1	0.0	2.0	1.0	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	0.9	135.000000	42.0	3.5	66.53968	1.0
2	1.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	96.000000	32.0	4.0	66.53968	1.0
3	1.0	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	0.7	46.000000	52.0	4.0	80.00000	1.0
4	1.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	1.0	99.659088	200.0	4.0	66.53968	1.0

2.3 獨熱編碼

當資料集中包含分類變數時，這些變數通常需要轉換為模型可以理解的數值形式。獨熱編碼是一種常用的方法，它將每個分類變數轉換為一組二進位制變數，其中每個變數對應一個可能的分類值，並且只有一個變數在任何給定時間點上取值為 1，其餘變數均為 0。可以透過傳遞引數categorical_features來指定要進行獨熱編碼的列。例如：

# load dataset
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/pokemon', verbose=False)
# data = get_data('pokemon')
data.head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

# 對Type 1實現獨熱編碼
len(set(data['Type 1']))

from pycaret.classification import *
s = setup(data = data, categorical_features =["Type 1"],target = 'Legendary', verbose=False)
# 檢視轉換後的資料Type 1變為獨熱編碼
s.get_config('dataset_transformed').head()

	#	Name	Type 1_Grass	Type 1_Ghost	Type 1_Water	Type 1_Steel	...	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
202	187.0	Hoppip	1.0	0.0	0.0	0.0	...	Flying	250.0	35.0	35.0	40.0	35.0	55.0	50.0	2.0	False
477	429.0	Mismagius	0.0	1.0	0.0	0.0	...	NaN	495.0	60.0	60.0	60.0	105.0	105.0	105.0	4.0	False
349	319.0	SharpedoMega Sharpedo	0.0	0.0	1.0	0.0	...	Dark	560.0	70.0	140.0	70.0	110.0	65.0	105.0	3.0	False
777	707.0	Klefki	0.0	0.0	0.0	1.0	...	Fairy	470.0	57.0	80.0	91.0	80.0	87.0	75.0	6.0	False
50	45.0	Vileplume	1.0	0.0	0.0	0.0	...	Poison	490.0	75.0	80.0	85.0	110.0	90.0	50.0	1.0	False

5 rows × 30 columns

2.4 資料平衡

在 PyCaret 中，fix_imbalance 和 fix_imbalance_method 是用於處理不平衡資料集的兩個引數。這些引數通常用於在訓練模型之前對資料集進行預處理，以解決類別不平衡問題。

fix_imbalance 引數：這是一個布林值引數，用於指示是否對不平衡資料集進行處理。當設定為 True 時，PyCaret 將自動檢測資料集中的類別不平衡問題，並嘗試透過取樣方法來解決。當設定為 False 時，PyCaret 將使用原始的不平衡資料集進行模型訓練
fix_imbalance_method 引數：這是一個字串引數，用於指定處理不平衡資料集的方法。可選的值包括：
- 使用 SMOTE（Synthetic Minority Over-sampling Technique）來生成人工合成樣本，從而平衡類別（預設引數smote）
- 使用imbalanced-learn提供的估算模型

# 載入資料
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/credit', verbose=False)
# data = get_data('credit')
data.head()

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default
0	20000	2	2	1	24	2	2	-1	-1	-2	...	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	90000	2	2	2	34	0	0	0	0	0	...	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
2	50000	2	2	1	37	0	0	0	0	0	...	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
3	50000	1	2	1	57	-1	0	-1	0	0	...	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0
4	50000	1	1	2	37	0	0	0	0	0	...	19394.0	19619.0	20024.0	2500.0	1815.0	657.0	1000.0	1000.0	800.0	0

5 rows × 24 columns

# 檢視資料各類別數
category_counts = data['default'].value_counts()
category_counts

default
0    18694
1     5306
Name: count, dtype: int64

from pycaret.classification import *
s = setup(data = data, target = 'default', fix_imbalance = True, verbose = False)

# 可以看到類1資料量變多了
s.get_config('dataset_transformed')['default'].value_counts()

default
0    18694
1    14678
Name: count, dtype: int64

2.5 異常值處理

PyCaret的remove_outliers函式可以在訓練模型之前識別和刪除資料集中的異常值。它使用奇異值分解技術進行PCA線性降維來識別異常值，並可以透過setup中的outliers_threshold引數控制異常值的比例（預設0.05）。

from pycaret.datasets import get_data

data = get_data(dataset='./datasets/insurance', verbose=False)
# insurance = get_data('insurance')
# 資料維度
data.shape

(1338, 7)

from pycaret.regression import *
s = setup(data = data, target = 'charges', remove_outliers = True ,verbose = False, outliers_threshold = 0.02)
# 移除異常資料後，資料量變少
s.get_config('dataset_transformed').shape

(1319, 10)

2.6 特徵重要性

特徵重要性是一種用於選擇資料集中對預測目標變數最有貢獻的特徵的過程。與使用所有特徵相比，僅使用選定的特徵可以減少過擬合的風險，提高準確性，並縮短訓練時間。在PyCaret中，可以透過使用feature_selection引數來實現這一目的。對於PyCaret中幾個與特徵選擇相關引數的解釋如下：

feature_selection：用於指定是否在模型訓練過程中進行特徵選擇。可以設定為 True 或 False。
feature_selection_method：特徵選擇方法：
- 'univariate': 使用sklearn的SelectKBest，基於統計測試來選擇與目標變數最相關的特徵。
- 'classic（預設）': 使用sklearn的SelectFromModel，利用監督學習模型的特徵重要性或係數來選擇最重要的特徵。
- 'sequential': 使用sklearn的SequentialFeatureSelector，該類根據指定的演算法（如前向選擇、後向選擇等）以及效能指標（如交叉驗證得分）逐步選擇特徵。
n_features_to_select：特徵選擇的最大特徵數量或比例。如果<1，則為起始特徵的比例。預設為0.2。該引數在計數時不考慮 ignore_features 和 keep_features 中的特徵。

from pycaret.datasets import get_data
data = get_data('./datasets/diabetes')

	Number of times pregnant	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	Diastolic blood pressure (mm Hg)	Triceps skin fold thickness (mm)	2-Hour serum insulin (mu U/ml)	Body mass index (weight in kg/(height in m)^2)	Diabetes pedigree function	Age (years)	Class variable
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

from pycaret.regression import *
# feature_selection選擇特徵, n_features_to_select選擇特徵比例
s = setup(data = data, target = 'Class variable', feature_selection = True, feature_selection_method = 'univariate',
          n_features_to_select = 0.3, verbose = False)

# 檢視哪些特徵保留下來
s.get_config('X_transformed').columns
s.get_config('X_transformed').head()

	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	Body mass index (weight in kg/(height in m)^2)
56	187.0	37.700001
541	128.0	32.400002
269	146.0	27.500000
304	150.0	21.000000
32	88.0	24.799999

2.7 歸一化

資料歸一化

在 PyCaret 中，normalize 和 normalize_method 引數用於資料預處理中的特徵縮放操作。特徵縮放是指將資料的特徵值按比例縮放，使之落入一個小的特定範圍，這樣可以消除特徵之間的量綱影響，使模型訓練更加穩定和準確。下面是關於這兩個引數的說明：

normalize: 這是一個布林值引數，用於指定是否對特徵進行縮放。預設情況下，它的取值為 False，表示不進行特徵縮放。如果將其設定為 True，則會啟用特徵縮放功能。
normalize_method: 這是一個字串引數，用於指定特徵縮放的方法。可選的值有：
- zscore（預設）: 使用 Z 分數標準化方法，也稱為標準化或 Z 標準化。該方法將特徵的值轉換為其 Z 分數，即將特徵值減去其均值，然後除以其標準差，從而使得特徵的均值為 0，標準差為 1。
- minmax: 使用 Min-Max 標準化方法，也稱為歸一化。該方法將特徵的值線性轉換到指定的最小值和最大值之間，預設情況下是 [0, 1] 範圍。
- maxabs: 使用 MaxAbs 標準化方法。該方法將特徵的值除以特徵的最大絕對值，將特徵的值縮放到 [-1, 1] 範圍內。
- robust: 使用 RobustScaler 標準化方法。該方法對資料的每個特徵進行中心化和縮放，使用特徵的中位數和四分位數範圍來縮放特徵。

from pycaret.datasets import get_data
data = get_data('./datasets/pokemon')
data.head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

# 歸一化
from pycaret.classification import *
s = setup(data, target='Legendary', normalize=True, normalize_method='robust', verbose=False)

資料歸一化結果：

s.get_config('X_transformed').head()

	#	Type 1_Water	Type 1_Normal	Type 1_Ice	Type 1_Psychic	...	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
403	-0.021629	1.0	0.0	0.0	0.0	...	0.195387	-0.333333	0.200000	0.875	1.088889	0.125	-0.288889	0.000000
471	0.139870	0.0	1.0	0.0	0.0	...	0.179104	0.333333	0.555556	-0.100	-0.111111	-0.100	1.111111	0.333333
238	-0.448450	0.0	0.0	1.0	0.0	...	-1.080054	-0.500000	-0.555556	-0.750	-0.777778	-1.000	-0.333333	-0.333333
646	0.604182	0.0	1.0	0.0	0.0	...	-0.618725	-0.166667	-0.333333	-0.500	-0.555556	-0.500	0.222222	0.666667
69	-0.898342	0.0	0.0	0.0	1.0	...	-0.265943	-0.833333	-0.888889	-1.000	1.222222	0.000	0.888889	-0.666667

5 rows × 46 columns

特徵變換

歸一化會重新調整資料，使其在新的範圍內，以減少方差中幅度的影響。特徵變換是一種更徹底的技術。透過轉換改變資料的分佈形狀，使得轉換後的資料可以被表示為正態分佈或近似正態分佈。PyCaret中透過transformation引數開啟特徵轉換，transformation_method設定轉換方法：yeo-johnson（預設）和分位數。此外除了特徵變換，還有目標變換。目標變換它將改變目標變數而不是特徵的分佈形狀。此功能僅在pycarte.regression模組中可用。使用transform_target開啟目標變換，transformation_method設定轉換方法。

from pycaret.classification import *
s = setup(data = data, target = 'Legendary', transformation = True, verbose = False)
# 特徵變換結果
s.get_config('X_transformed').head()

	#	Name	Type 1_Psychic	Type 1_Water	Type 1_Rock	Type 1_Grass	Type 1_Dragon	Type 1_Ghost	Type 1_Bug	Type 1_Fairy	...	Type 2_Electric	Type 2_Bug	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
165	52.899003	0.009216	0.043322	-0.000000	-0.000000	-0.000000	-0.000000	-0.0	-0.0	-0.0	...	-0.0	-0.0	93.118403	12.336844	23.649090	13.573010	10.692443	8.081703	26.134255	0.900773
625	140.730289	0.009216	-0.000000	0.095739	-0.000000	-0.000000	-0.000000	-0.0	-0.0	-0.0	...	-0.0	-0.0	66.091344	9.286671	20.259153	13.764668	8.160482	6.056644	9.552506	3.679456
628	141.283084	0.009216	-0.000000	-0.000000	0.043322	-0.000000	-0.000000	-0.0	-0.0	-0.0	...	-0.0	-0.0	89.747939	10.823299	29.105379	11.029571	11.203335	6.942091	27.793080	3.679456
606	137.396878	0.009216	-0.000000	-0.000000	-0.000000	0.061897	-0.000000	-0.0	-0.0	-0.0	...	-0.0	-0.0	56.560577	8.043018	10.276208	10.604937	6.949265	6.302465	19.943809	3.679456
672	149.303914	0.009216	-0.000000	-0.000000	-0.000000	-0.000000	0.029706	-0.0	-0.0	-0.0	...	-0.0	-0.0	72.626190	10.202245	26.061259	11.435493	7.199607	6.302465	20.141156	3.679456

5 rows × 46 columns

3 參考

pycaret
pycaret-docs
pycaret-datasets
lightgbm
cuml
imbalanced-learn

[深度學習] 計算機視覺低程式碼工具Supervision庫使用指北
2024-03-18
深度學習計算機視覺
metarank: 推薦排名類的低程式碼機器學習工具
2022-04-01
機器學習
【機器學習】--Python機器學習庫之Numpy
2018-04-06
機器學習Python
【機器學習】機器學習簡介
2018-11-29
機器學習
25種Java機器學習工具和庫
2018-12-17
Java機器學習
機器學習工具總覽
2019-05-21
機器學習
[python學習]機器學習 -- 感知機
2020-10-19
Python機器學習
大型機器學習【Coursera 史丹佛機器學習】
2021-09-09
機器學習
（一）機器學習和機器學習介紹
2021-09-09
機器學習
機器學習-整合學習
2019-05-12
機器學習
如何學習機器學習
2019-02-01
機器學習
機器學習強化下，機器人將掌握工具的使用
2019-04-18
機器學習機器人
機器學習（——）
2018-06-19
機器學習
機器學習
2024-05-19
機器學習
python-機器學習程式碼總結
2020-11-08
Python機器學習
【機器學習】機器學習建立演算法第1篇：機器學習演算法課程定位、目標【附程式碼文件】
2024-03-11
機器學習演算法
Python機器學習筆記：sklearn庫的學習
2018-12-29
Python機器學習筆記
機器學習（十四）機器學習比賽網站
2018-12-06
機器學習網站
用Python進行機器學習（附程式碼、學習資源）
2018-06-04
Python機器學習
機器學習&深度學習之路
2018-06-07
機器學習深度學習
機器學習之學習速率
2020-06-12
機器學習
機器學習學習筆記
2021-06-01
機器學習筆記
機器學習-整合學習LightGBM
2023-02-21
機器學習
機器學習：監督學習
2022-12-04
機器學習
機器學習 | 吳恩達機器學習第九周學習筆記
2018-11-22
機器學習吳恩達筆記
【機器學習】機器學習面試乾貨精講
2018-03-29
機器學習面試
【機器學習】李宏毅——機器學習任務攻略
2022-12-14
機器學習
機器學習 — AdaBoost演算法（手稿+程式碼）
2018-09-15
機器學習演算法
機器學習去除馬賽克案例（程式碼）
2019-02-13
機器學習
基於Sklearn機器學習程式碼實戰
2022-11-25
機器學習
機器學習——KMeans
2018-10-17
機器學習
機器學習-1
2018-04-10
機器學習
機器學習-2
2018-04-10
機器學習
機器學習模型
2024-03-30
機器學習模型
機器學習(2)
2024-10-18
機器學習
機器學習-3
2018-04-16
機器學習
機器學習-4
2018-04-16
機器學習
機器學習-5
2018-04-17
機器學習

[機器學習] 低程式碼機器學習工具PyCaret庫使用指北

1 快速入門

1.1 分類

1.2 迴歸

1.3 聚類

1.4 異常檢測

1.5 時序預測

2 資料處理與清洗

2.1 缺失值處理

2.2 型別轉換

2.3 獨熱編碼

2.4 資料平衡

2.5 異常值處理

2.6 特徵重要性

2.7 歸一化

3 參考

相關文章