一、概述
本篇我們首先通過迴歸演算法實現一個葡萄酒品質預測的程式,然後通過AutoML的方法再重新實現,通過對比兩種實現方式來學習AutoML的應用。
首先資料集來自於競賽網站kaggle.com的UCI Wine Quality Dataset資料集,訪問地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data
該資料集,輸入為一些葡萄酒的化學檢測資料,比如酒精度等,輸出為品酒師的打分,具體欄位描述如下:
Data fields Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Other: 13 - id (unique ID for each sample, needed for submission)
二、程式碼
namespace Regression_WineQuality { public class WineData { [LoadColumn(0)] public float FixedAcidity; [LoadColumn(1)] public float VolatileAcidity; [LoadColumn(2)] public float CitricACID; [LoadColumn(3)] public float ResidualSugar; [LoadColumn(4)] public float Chlorides; [LoadColumn(5)] public float FreeSulfurDioxide; [LoadColumn(6)] public float TotalSulfurDioxide; [LoadColumn(7)] public float Density; [LoadColumn(8)] public float PH; [LoadColumn(9)] public float Sulphates; [LoadColumn(10)] public float Alcohol; [LoadColumn(11)] [ColumnName("Label")] public float Quality; [LoadColumn(12)] public float Id; } public class WinePrediction { [ColumnName("Score")] public float PredictionQuality; } class Program { static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip"); static void Main(string[] args) { Train(); Prediction(); Console.WriteLine("Hit any key to finish the app"); Console.ReadKey(); } public static void Train() { MLContext mlContext = new MLContext(seed: 1); // 準備資料 string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-full.csv"); var fulldata = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true); var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.2); var trainData = trainTestData.TrainSet; var testData = trainTestData.TestSet; // 建立學習管道並通過訓練資料調整模型 var dataProcessPipeline = mlContext.Transforms.DropColumns("Id") .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.FreeSulfurDioxide))) .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.TotalSulfurDioxide))) .Append(mlContext.Transforms.Concatenate("Features", new string[] { nameof(WineData.FixedAcidity), nameof(WineData.VolatileAcidity), nameof(WineData.CitricACID), nameof(WineData.ResidualSugar), nameof(WineData.Chlorides), nameof(WineData.FreeSulfurDioxide), nameof(WineData.TotalSulfurDioxide), nameof(WineData.Density), nameof(WineData.PH), nameof(WineData.Sulphates), nameof(WineData.Alcohol)})); var trainer = mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName: "Label", featureColumnName: "Features"); var trainingPipeline = dataProcessPipeline.Append(trainer); var trainedModel = trainingPipeline.Fit(trainData); // 評估 var predictions = trainedModel.Transform(testData); var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score"); PrintRegressionMetrics(trainer.ToString(), metrics); // 儲存模型 Console.WriteLine("====== Save model to local file ========="); mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath); } static void Prediction() { MLContext mlContext = new MLContext(seed: 1); ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema); var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel); WineData wineData = new WineData { FixedAcidity = 7.6f, VolatileAcidity = 0.33f, CitricACID = 0.36f, ResidualSugar = 2.1f, Chlorides = 0.034f, FreeSulfurDioxide = 26f, TotalSulfurDioxide = 172f, Density = 0.9944f, PH = 3.42f, Sulphates = 0.48f, Alcohol = 10.5f }; var wineQuality = predictor.Predict(wineData); Console.WriteLine($"Wine Data Quality is:{wineQuality.PredictionQuality} "); } } }
關於泊松迴歸的演算法,我們在進行人臉顏值判斷的那篇文章已經介紹過了,這個程式沒有涉及任何新的知識點,就不重複解釋了,主要目的是和下面的AutoML程式碼對比用的。
三、自動學習
我們發現機器學習的大致流程基本都差不多,如:準備資料-明確特徵-選擇演算法-訓練等,有時我們存在這樣一個問題:該選擇什麼演算法?演算法的引數該如何配置?等等。而自動學習就解決了這個問題,框架會多次重複資料選擇、演算法選擇、引數調優、評估結果這一過程,通過這個過程找出評估效果最好的模型。
全部程式碼如下:
namespace Regression_WineQuality { public class WineData { [LoadColumn(0)] public float FixedAcidity; [LoadColumn(1)] public float VolatileAcidity; [LoadColumn(2)] public float CitricACID; [LoadColumn(3)] public float ResidualSugar; [LoadColumn(4)] public float Chlorides; [LoadColumn(5)] public float FreeSulfurDioxide; [LoadColumn(6)] public float TotalSulfurDioxide; [LoadColumn(7)] public float Density; [LoadColumn(8)] public float PH; [LoadColumn(9)] public float Sulphates; [LoadColumn(10)] public float Alcohol; [LoadColumn(11)] [ColumnName("Label")] public float Quality; [LoadColumn(12)] public float ID; } public class WinePrediction { [ColumnName("Score")] public float PredictionQuality; } class Program { static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip"); static readonly string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-train.csv"); static readonly string TestDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-test.csv"); static void Main(string[] args) { TrainAndSave(); LoadAndPrediction(); Console.WriteLine("Hit any key to finish the app"); Console.ReadKey(); } public static void TrainAndSave() { MLContext mlContext = new MLContext(seed: 1); // 準備資料 var trainData = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true); var testData = mlContext.Data.LoadFromTextFile<WineData>(path: TestDataPath, separatorChar: ',', hasHeader: true); var progressHandler = new RegressionExperimentProgressHandler(); uint ExperimentTime = 200; ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto() .CreateRegressionExperiment(ExperimentTime) .Execute(trainData, "Label", progressHandler: progressHandler); Debugger.PrintTopModels(experimentResult); RunDetail<RegressionMetrics> best = experimentResult.BestRun; ITransformer trainedModel = best.Model; // 評估 BestRun var predictions = trainedModel.Transform(testData); var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score"); Debugger.PrintRegressionMetrics(best.TrainerName, metrics); // 儲存模型 Console.WriteLine("====== Save model to local file ========="); mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath); } static void LoadAndPrediction() { MLContext mlContext = new MLContext(seed: 1); ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema); var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel); WineData wineData = new WineData { FixedAcidity = 7.6f, VolatileAcidity = 0.33f, CitricACID = 0.36f, ResidualSugar = 2.1f, Chlorides = 0.034f, FreeSulfurDioxide = 26f, TotalSulfurDioxide = 172f, Density = 0.9944f, PH = 3.42f, Sulphates = 0.48f, Alcohol = 10.5f }; var wineQuality = predictor.Predict(wineData); Console.WriteLine($"Wine Data Quality is:{wineQuality.PredictionQuality} "); } } }
四、程式碼分析
1、自動學習過程
var progressHandler = new RegressionExperimentProgressHandler(); uint ExperimentTime = 200; ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto() .CreateRegressionExperiment(ExperimentTime) .Execute(trainData, "Label", progressHandler: progressHandler); Debugger.PrintTopModels(experimentResult); //列印所有模型資料
ExperimentTime 是允許的試驗時間,progressHandler是一個報告程式,當每完成一種學習,系統就會呼叫一次報告事件。
public class RegressionExperimentProgressHandler : IProgress<RunDetail<RegressionMetrics>> { private int _iterationIndex; public void Report(RunDetail<RegressionMetrics> iterationResult) { _iterationIndex++; Console.WriteLine($"Report index:{_iterationIndex},TrainerName:{iterationResult.TrainerName},RuntimeInSeconds:{iterationResult.RuntimeInSeconds}"); } }
除錯結果如下:
Report index:1,TrainerName:SdcaRegression,RuntimeInSeconds:12.5244426 Report index:2,TrainerName:LightGbmRegression,RuntimeInSeconds:11.2034988 Report index:3,TrainerName:FastTreeRegression,RuntimeInSeconds:14.810409 Report index:4,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:14.7338553 Report index:5,TrainerName:FastForestRegression,RuntimeInSeconds:15.6224459 Report index:6,TrainerName:LbfgsPoissonRegression,RuntimeInSeconds:11.1668197 Report index:7,TrainerName:OnlineGradientDescentRegression,RuntimeInSeconds:10.5353 Report index:8,TrainerName:OlsRegression,RuntimeInSeconds:10.8905459 Report index:9,TrainerName:LightGbmRegression,RuntimeInSeconds:10.5703296 Report index:10,TrainerName:FastTreeRegression,RuntimeInSeconds:19.4470509 Report index:11,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:63.638882 Report index:12,TrainerName:LightGbmRegression,RuntimeInSeconds:10.7710518
學習結束後我們通過Debugger.PrintTopModels列印出所有模型資料:
public class Debugger { private const int Width = 114; public static void PrintTopModels(ExperimentResult<RegressionMetrics> experimentResult) { var topRuns = experimentResult.RunDetails .Where(r => r.ValidationMetrics != null && !double.IsNaN(r.ValidationMetrics.RSquared)) .OrderByDescending(r => r.ValidationMetrics.RSquared); Console.WriteLine("Top models ranked by R-Squared --"); PrintRegressionMetricsHeader(); for (var i = 0; i < topRuns.Count(); i++) { var run = topRuns.ElementAt(i); PrintIterationMetrics(i + 1, run.TrainerName, run.ValidationMetrics, run.RuntimeInSeconds); } } public static void PrintRegressionMetricsHeader() { CreateRow($"{"",-4} {"Trainer",-35} {"RSquared",8} {"Absolute-loss",13} {"Squared-loss",12} {"RMS-loss",8} {"Duration",9}", Width); } public static void PrintIterationMetrics(int iteration, string trainerName, RegressionMetrics metrics, double? runtimeInSeconds) { CreateRow($"{iteration,-4} {trainerName,-35} {metrics?.RSquared ?? double.NaN,8:F4} {metrics?.MeanAbsoluteError ?? double.NaN,13:F2} {metrics?.MeanSquaredError ?? double.NaN,12:F2} {metrics?.RootMeanSquaredError ?? double.NaN,8:F2} {runtimeInSeconds.Value,9:F1}", Width); } public static void CreateRow(string message, int width) { Console.WriteLine("|" + message.PadRight(width - 2) + "|"); } }
其中CreateRow程式碼功能用於排版。除錯結果如下:
Top models ranked by R-Squared -- | Trainer RSquared Absolute-loss Squared-loss RMS-loss Duration | |1 FastTreeTweedieRegression 0.4731 0.46 0.41 0.64 63.6 | |2 FastTreeTweedieRegression 0.4431 0.49 0.43 0.65 14.7 | |3 FastTreeRegression 0.4386 0.54 0.49 0.70 19.4 | |4 LightGbmRegression 0.4177 0.52 0.45 0.67 10.8 | |5 FastTreeRegression 0.4102 0.51 0.45 0.67 14.8 | |6 LightGbmRegression 0.3944 0.52 0.46 0.68 11.2 | |7 LightGbmRegression 0.3501 0.60 0.57 0.75 10.6 | |8 FastForestRegression 0.3381 0.60 0.58 0.76 15.6 | |9 OlsRegression 0.2829 0.56 0.53 0.73 10.9 | |10 LbfgsPoissonRegression 0.2760 0.62 0.63 0.80 11.2 | |11 SdcaRegression 0.2746 0.58 0.56 0.75 12.5 | |12 OnlineGradientDescentRegression 0.0593 0.69 0.81 0.90 10.5 |
根據結果可以看到,一些演算法被重複試驗,但在使用同一個演算法時其配置引數並不一樣,如闕值、深度等。
2、獲取最優模型
RunDetail<RegressionMetrics> best = experimentResult.BestRun;
ITransformer trainedModel = best.Model;
獲取最佳模型後,其評估和儲存的過程和之前程式碼一致。用測試資料評估結果:
************************************************* * Metrics for FastTreeTweedieRegression regression model *------------------------------------------------ * LossFn: 0.67 * R2 Score: 0.34 * Absolute loss: .63 * Squared loss: .67 * RMS loss: .82 *************************************************
看結果識別率約70%左右,這種結果是沒有辦法用於生產的,問題應該是我們沒有找到決定葡萄酒品質的關鍵特徵。
五、小結
到這篇文章為止,《ML.NET學習筆記系列》就結束了。學習過程中涉及的原始程式碼主要來源於:https://github.com/dotnet/machinelearning-samples 。
該工程中還有一些其他演算法應用的例子,包括:聚類、矩陣分解、異常檢測,其大體流程基本都差不多,有了我們這個系列的學習基礎有興趣的朋友可以自己研究一下。
六、資源獲取
原始碼下載地址:https://github.com/seabluescn/Study_ML.NET
迴歸工程名稱:Regression_WineQuality
AutoML工程名稱:Regression_WineQuality_AutoML