機器學習框架ML.NET學習筆記【9】自動學習

seabluescn發表於2019-06-10

一、概述

本篇我們首先通過迴歸演算法實現一個葡萄酒品質預測的程式,然後通過AutoML的方法再重新實現,通過對比兩種實現方式來學習AutoML的應用。

首先資料集來自於競賽網站kaggle.com的UCI Wine Quality Dataset資料集,訪問地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data

 該資料集,輸入為一些葡萄酒的化學檢測資料,比如酒精度等,輸出為品酒師的打分,具體欄位描述如下:

Data fields
Input variables (based on physicochemical tests): 
1 - fixed acidity 
2 - volatile acidity 
3 - citric acid 
4 - residual sugar 
5 - chlorides 
6 - free sulfur dioxide 
7 - total sulfur dioxide 
8 - density 
9 - pH 
10 - sulphates 
11 - alcohol

Output variable (based on sensory data): 
12 - quality (score between 0 and 10)

Other:
13 - id (unique ID for each sample, needed for submission)

   

二、程式碼

namespace Regression_WineQuality
{
    public class WineData
    {
        [LoadColumn(0)]
        public float FixedAcidity;

        [LoadColumn(1)]
        public float VolatileAcidity;

        [LoadColumn(2)]
        public float CitricACID;

        [LoadColumn(3)]
        public float ResidualSugar;

        [LoadColumn(4)]
        public float Chlorides;

        [LoadColumn(5)]
        public float FreeSulfurDioxide;

        [LoadColumn(6)]
        public float TotalSulfurDioxide;

        [LoadColumn(7)]
        public float Density;

        [LoadColumn(8)]
        public float PH;

        [LoadColumn(9)]
        public float Sulphates;

        [LoadColumn(10)]
        public float Alcohol;
      
        [LoadColumn(11)]
        [ColumnName("Label")]
        public float Quality;
       
        [LoadColumn(12)]
        public float Id;
    }

    public class WinePrediction
    {
        [ColumnName("Score")]
        public float PredictionQuality;
    }

    class Program
    {
        static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip");

        static void Main(string[] args)
        { 
            Train();
            Prediction();

            Console.WriteLine("Hit any key to finish the app");
            Console.ReadKey();
        }

        public static void Train()
        {
            MLContext mlContext = new MLContext(seed: 1);

            // 準備資料
            string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-full.csv");
            var fulldata = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true);

            var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.2);
            var trainData = trainTestData.TrainSet;
            var testData = trainTestData.TestSet;

            // 建立學習管道並通過訓練資料調整模型  
            var dataProcessPipeline = mlContext.Transforms.DropColumns("Id")
                .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.FreeSulfurDioxide)))
                .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.TotalSulfurDioxide)))
                .Append(mlContext.Transforms.Concatenate("Features", new string[] { nameof(WineData.FixedAcidity),
                                                                                    nameof(WineData.VolatileAcidity),
                                                                                    nameof(WineData.CitricACID),
                                                                                    nameof(WineData.ResidualSugar),
                                                                                    nameof(WineData.Chlorides),
                                                                                    nameof(WineData.FreeSulfurDioxide),
                                                                                    nameof(WineData.TotalSulfurDioxide),
                                                                                    nameof(WineData.Density),
                                                                                    nameof(WineData.PH),
                                                                                    nameof(WineData.Sulphates),
                                                                                    nameof(WineData.Alcohol)}));

            var trainer = mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName: "Label", featureColumnName: "Features");
            var trainingPipeline = dataProcessPipeline.Append(trainer);
            var trainedModel = trainingPipeline.Fit(trainData);

            // 評估
            var predictions = trainedModel.Transform(testData);
            var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score");
            PrintRegressionMetrics(trainer.ToString(), metrics);

            // 儲存模型
            Console.WriteLine("====== Save model to local file =========");
            mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath);
        }

        static void Prediction()
        {
            MLContext mlContext = new MLContext(seed: 1);

            ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema);
            var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel);

            WineData wineData = new WineData
            {
                FixedAcidity = 7.6f,
                VolatileAcidity = 0.33f,
                CitricACID = 0.36f,
                ResidualSugar = 2.1f,
                Chlorides = 0.034f,
                FreeSulfurDioxide = 26f,
                TotalSulfurDioxide = 172f,
                Density = 0.9944f,
                PH = 3.42f,
                Sulphates = 0.48f,
                Alcohol = 10.5f
            };

            var wineQuality = predictor.Predict(wineData);
            Console.WriteLine($"Wine Data  Quality is:{wineQuality.PredictionQuality} ");           
        }        
    }
}
View Code

 關於泊松迴歸的演算法,我們在進行人臉顏值判斷的那篇文章已經介紹過了,這個程式沒有涉及任何新的知識點,就不重複解釋了,主要目的是和下面的AutoML程式碼對比用的。 

 

三、自動學習

我們發現機器學習的大致流程基本都差不多,如:準備資料-明確特徵-選擇演算法-訓練等,有時我們存在這樣一個問題:該選擇什麼演算法?演算法的引數該如何配置?等等。而自動學習就解決了這個問題,框架會多次重複資料選擇、演算法選擇、引數調優、評估結果這一過程,通過這個過程找出評估效果最好的模型。

全部程式碼如下:

namespace Regression_WineQuality
{
    public class WineData
    {
        [LoadColumn(0)]
        public float FixedAcidity;

        [LoadColumn(1)]
        public float VolatileAcidity;

        [LoadColumn(2)]
        public float CitricACID;

        [LoadColumn(3)]
        public float ResidualSugar;

        [LoadColumn(4)]
        public float Chlorides;

        [LoadColumn(5)]
        public float FreeSulfurDioxide;

        [LoadColumn(6)]
        public float TotalSulfurDioxide;

        [LoadColumn(7)]
        public float Density;

        [LoadColumn(8)]
        public float PH;

        [LoadColumn(9)]
        public float Sulphates;

        [LoadColumn(10)]
        public float Alcohol;
      
        [LoadColumn(11)]
        [ColumnName("Label")]
        public float Quality;

        [LoadColumn(12)]       
        public float ID; 
    }

    public class WinePrediction
    {
        [ColumnName("Score")]
        public float PredictionQuality;
    }
 

    class Program
    {
        static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip");
        static readonly string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-train.csv");
        static readonly string TestDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-test.csv");

        static void Main(string[] args)
        {           
            TrainAndSave();
            LoadAndPrediction();

            Console.WriteLine("Hit any key to finish the app");
            Console.ReadKey();
        }

        public static void TrainAndSave()
        {
            MLContext mlContext = new MLContext(seed: 1);

            // 準備資料 
            var trainData = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true);
            var testData = mlContext.Data.LoadFromTextFile<WineData>(path: TestDataPath, separatorChar: ',', hasHeader: true);
         
            var progressHandler = new RegressionExperimentProgressHandler();
            uint ExperimentTime = 200;

            ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto()
               .CreateRegressionExperiment(ExperimentTime)
               .Execute(trainData, "Label", progressHandler: progressHandler);           

            Debugger.PrintTopModels(experimentResult);

            RunDetail<RegressionMetrics> best = experimentResult.BestRun;
            ITransformer trainedModel = best.Model;

            // 評估 BestRun
            var predictions = trainedModel.Transform(testData);
            var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score");
            Debugger.PrintRegressionMetrics(best.TrainerName, metrics);

            // 儲存模型
            Console.WriteLine("====== Save model to local file =========");
            mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath);           
        }
       

        static void LoadAndPrediction()
        {
            MLContext mlContext = new MLContext(seed: 1);

            ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema);
            var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel);

            WineData wineData = new WineData
            {
                FixedAcidity = 7.6f,
                VolatileAcidity = 0.33f,
                CitricACID = 0.36f,
                ResidualSugar = 2.1f,
                Chlorides = 0.034f,
                FreeSulfurDioxide = 26f,
                TotalSulfurDioxide = 172f,
                Density = 0.9944f,
                PH = 3.42f,
                Sulphates = 0.48f,
                Alcohol = 10.5f
            };

            var wineQuality = predictor.Predict(wineData);
            Console.WriteLine($"Wine Data  Quality is:{wineQuality.PredictionQuality} ");           
        }
    }
}
View Code

  

四、程式碼分析

1、自動學習過程

            var progressHandler = new RegressionExperimentProgressHandler();
            uint ExperimentTime = 200;

            ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto()
               .CreateRegressionExperiment(ExperimentTime)
               .Execute(trainData, "Label", progressHandler: progressHandler);           

            Debugger.PrintTopModels(experimentResult); //列印所有模型資料

  ExperimentTime 是允許的試驗時間,progressHandler是一個報告程式,當每完成一種學習,系統就會呼叫一次報告事件。

    public class RegressionExperimentProgressHandler : IProgress<RunDetail<RegressionMetrics>>
    {
        private int _iterationIndex;

        public void Report(RunDetail<RegressionMetrics> iterationResult)
        {
            _iterationIndex++;
            Console.WriteLine($"Report index:{_iterationIndex},TrainerName:{iterationResult.TrainerName},RuntimeInSeconds:{iterationResult.RuntimeInSeconds}");            
        }
    }

 除錯結果如下:

Report index:1,TrainerName:SdcaRegression,RuntimeInSeconds:12.5244426
Report index:2,TrainerName:LightGbmRegression,RuntimeInSeconds:11.2034988
Report index:3,TrainerName:FastTreeRegression,RuntimeInSeconds:14.810409
Report index:4,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:14.7338553
Report index:5,TrainerName:FastForestRegression,RuntimeInSeconds:15.6224459
Report index:6,TrainerName:LbfgsPoissonRegression,RuntimeInSeconds:11.1668197
Report index:7,TrainerName:OnlineGradientDescentRegression,RuntimeInSeconds:10.5353
Report index:8,TrainerName:OlsRegression,RuntimeInSeconds:10.8905459
Report index:9,TrainerName:LightGbmRegression,RuntimeInSeconds:10.5703296
Report index:10,TrainerName:FastTreeRegression,RuntimeInSeconds:19.4470509
Report index:11,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:63.638882
Report index:12,TrainerName:LightGbmRegression,RuntimeInSeconds:10.7710518

學習結束後我們通過Debugger.PrintTopModels列印出所有模型資料: 

   public class Debugger
    {
        private const int Width = 114;
        public  static void PrintTopModels(ExperimentResult<RegressionMetrics> experimentResult)
        {            
            var topRuns = experimentResult.RunDetails
                .Where(r => r.ValidationMetrics != null && !double.IsNaN(r.ValidationMetrics.RSquared))
                .OrderByDescending(r => r.ValidationMetrics.RSquared);

            Console.WriteLine("Top models ranked by R-Squared --");
            PrintRegressionMetricsHeader();
            for (var i = 0; i < topRuns.Count(); i++)
            {
                var run = topRuns.ElementAt(i);
                PrintIterationMetrics(i + 1, run.TrainerName, run.ValidationMetrics, run.RuntimeInSeconds);
            }
        }       

        public static void PrintRegressionMetricsHeader()
        {
            CreateRow($"{"",-4} {"Trainer",-35} {"RSquared",8} {"Absolute-loss",13} {"Squared-loss",12} {"RMS-loss",8} {"Duration",9}", Width);
        }

        public static void PrintIterationMetrics(int iteration, string trainerName, RegressionMetrics metrics, double? runtimeInSeconds)
        {
            CreateRow($"{iteration,-4} {trainerName,-35} {metrics?.RSquared ?? double.NaN,8:F4} {metrics?.MeanAbsoluteError ?? double.NaN,13:F2} {metrics?.MeanSquaredError ?? double.NaN,12:F2} {metrics?.RootMeanSquaredError ?? double.NaN,8:F2} {runtimeInSeconds.Value,9:F1}", Width);
        }

        public static void CreateRow(string message, int width)
        {
            Console.WriteLine("|" + message.PadRight(width - 2) + "|");
        }
}

 其中CreateRow程式碼功能用於排版。除錯結果如下:

Top models ranked by R-Squared --
|     Trainer                             RSquared Absolute-loss Squared-loss RMS-loss  Duration                 |
|1    FastTreeTweedieRegression             0.4731          0.46         0.41     0.64      63.6                 |
|2    FastTreeTweedieRegression             0.4431          0.49         0.43     0.65      14.7                 |
|3    FastTreeRegression                    0.4386          0.54         0.49     0.70      19.4                 |
|4    LightGbmRegression                    0.4177          0.52         0.45     0.67      10.8                 |
|5    FastTreeRegression                    0.4102          0.51         0.45     0.67      14.8                 |
|6    LightGbmRegression                    0.3944          0.52         0.46     0.68      11.2                 |
|7    LightGbmRegression                    0.3501          0.60         0.57     0.75      10.6                 |
|8    FastForestRegression                  0.3381          0.60         0.58     0.76      15.6                 |
|9    OlsRegression                         0.2829          0.56         0.53     0.73      10.9                 |
|10   LbfgsPoissonRegression                0.2760          0.62         0.63     0.80      11.2                 |
|11   SdcaRegression                        0.2746          0.58         0.56     0.75      12.5                 |
|12   OnlineGradientDescentRegression       0.0593          0.69         0.81     0.90      10.5                 |

根據結果可以看到,一些演算法被重複試驗,但在使用同一個演算法時其配置引數並不一樣,如闕值、深度等。

 

2、獲取最優模型

            RunDetail<RegressionMetrics> best = experimentResult.BestRun;
            ITransformer trainedModel = best.Model;

 獲取最佳模型後,其評估和儲存的過程和之前程式碼一致。用測試資料評估結果:

*************************************************
*       Metrics for FastTreeTweedieRegression regression model
*------------------------------------------------
*       LossFn:        0.67
*       R2 Score:      0.34
*       Absolute loss: .63
*       Squared loss:  .67
*       RMS loss:      .82
*************************************************

看結果識別率約70%左右,這種結果是沒有辦法用於生產的,問題應該是我們沒有找到決定葡萄酒品質的關鍵特徵。

 

五、小結

到這篇文章為止,《ML.NET學習筆記系列》就結束了。學習過程中涉及的原始程式碼主要來源於:https://github.com/dotnet/machinelearning-samples 。

該工程中還有一些其他演算法應用的例子,包括:聚類、矩陣分解、異常檢測,其大體流程基本都差不多,有了我們這個系列的學習基礎有興趣的朋友可以自己研究一下。

  

六、資源獲取 

原始碼下載地址:https://github.com/seabluescn/Study_ML.NET

迴歸工程名稱:Regression_WineQuality

AutoML工程名稱:Regression_WineQuality_AutoML

點選檢視機器學習框架ML.NET學習筆記系列文章目錄

 

相關文章