Spark - ML Tuning
官方文件:https://spark.apache.org/docs/2.2.0/ml-tuning.html
這一章節主要講述如何通過使用MLlib的工具來除錯模型演算法和pipeline,內建的交叉驗證和其他工具允許使用者優化模型和pipeline中的超引數;
目錄:
- 模型選擇,也就是調參;
- 交叉驗證;
- 訓練集、驗證集劃分;
模型選擇(調參)
機器學習的一個重要工作就是模型選擇,或者說根據給定任務使用資料來發現最優的模型和引數,也叫做除錯,既可以針對單個模型進行除錯,也可以針對整個pipeline的各個環節進行除錯,使用者可以一次對整個pipeline進行除錯而不是每次一個pipeline中的部分;
MLlib支援CrossValidator和TrainValidationSplit等模型選擇工具,這些工具需要下列引數:
- Estimator:待除錯的演算法或者Pipeline;
- 引數Map列表:用於搜尋的引數空間;
- Evaluator:衡量模型在集外測試集上表現的方法;
這些工具工作方式如下:
- 分割資料到訓練集和測試集;
- 對每一組訓練&測試資料,應用所有引數空間中的可選引數組合:
- 對每一組引數組合,使用其設定到演算法上,得到對應的model,並驗證該model的效能;
- 選擇得到最好效能的模型使用的引數組合;
Evaluator針對迴歸問題可以是RegressionEvaluator,針對二分資料可以是BinaryClassificationEvaluator,針對多分類問題的MulticlassClassificationEvaluator,預設的驗證方法可以通過setMetricName來修改;
交叉驗證
CrossValidator首先將資料分到一個個的fold中,使用這些fold集合作為訓練集和測試集,如果k=3,那麼CrossValidator將生成3個(訓練,測試)組合,也就是通過3個fold排列組合得到的,每一組使用2個fold作為訓練集,另一個fold作為測試集,為了驗證一個指定的引數組合,CrossValidator需要計算3個模型的平均效能,每個模型都是通過之前的一組訓練&測試集訓練得到;
確認了最佳引數後,CrossValidator最終會使用全部資料和最佳引數組合來重新訓練預測;
例子:通過交叉驗證進行模型選擇;
注意:交叉驗證在整個引數網格上是十分耗時的,下面的例子中,引數網格中numFeatures有3個可取值,regParam有2個可取值,CrossValidator使用2個fold,這將會訓練3*2*2個不同的模型,在實際工作中,通常會設定更多的引數、更多的引數取值以及更多的fold,換句話說,CrossValidator本身就是十分奢侈的,無論如何,與手工除錯相比,它依然是一種更加合理和自動化的調參手段;
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Prepare training documents, which are labeled.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "mapreduce spark"),
(7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
print(row)
劃分訓練、驗證集
對於超引數除錯,Spark還支援TrainValidationSplit,它一次只能驗證一組引數,這與CrossValidator一次進行k次截然不同,因此它更加快速,但是如果訓練集不夠大的化就無法得到一個真實的結果;
不像是CrossValidator,TrainValidationSplit建立一個訓練、測試組合,它根據trainRatio將資料分為兩部分,假設trainRatio=0.75,那麼資料集的75%作為訓練集,25%用於驗證;
與CrossValidator類似的是,TrainValidationSplit最終也會使用最佳引數和全部資料來訓練一個預測器;
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show()