《Spark機器學習》筆記——Spark迴歸模型(最小二乘迴歸、決策樹迴歸,模型效能評估、目標變數變換、引數調優)
資料集說明:
資料集下載地址
http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
=========================================
hour.csv和day.csv都有如下屬性,除了hour.csv檔案中沒有hr屬性以外
- instant: 記錄ID
- dteday : 時間日期
- season : 季節 (1:春季, 2:夏季, 3:秋季, 4:冬季)
- yr : 年份 (0: 2011, 1:2012)
- mnth : 月份 ( 1 to 12)
- hr : 當天時刻 (0 to 23)
- holiday : 當天是否是節假日(extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : 周幾
- workingday : 工作日 is 1, 其他 is 0.
+ weathersit : 天氣
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : 氣溫
Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: 體感溫度
Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: 溼度
Normalized humidity. The values are divided to 100 (max)
- windspeed: 風速Normalized wind speed. The values are divided to 67 (max)
- casual: 臨時使用者數count of casual users
- registered: 註冊使用者數count of registered users
- cnt: 目標變數,每小時的自行車的租用量,包括臨時使用者和註冊使用者count of total rental bikes including both casual and registered
=========================================
使用sed 1d hour.csv > hour_noheader.csv去掉第一行
使用jupyter notebook進入開發環境
程式碼:
from pyspark import SparkContext#匯入Spark的Python庫
sc = SparkContext("local","bike")#本地執行。
path = "file:///home/chenjie/bike/hour_noheader.csv"#已將資料集放在此路徑下,並已去掉第一行的頭資訊
raw_data = sc.textFile(path)#載入資料
records = raw_data.map(lambda x: x.split(","))#用,隔開,獲得記錄
num_data = raw_data.count()#得到總記錄數
first = records.first()#得到第一行
print first#列印第一行
print num_data#列印總總記錄數
def get_mapping(rdd, idx):
return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()
#以上定義了一個對映函式:首先將第idx列的特徵值去重,然後對每個值使用zipWithIndex函式對映到一個唯一的索引,這樣就組成了一個RDD的鍵值對映
#鍵是變數,值是索引
print records.map(lambda fields: fields[2]).distinct().collect()
#輸出結果為[u'1', u'3', u'2', u'4']
#表明第三列(季節)只有1,3,2,4四種取值
print "Mapping of first categorical feasture column: %s" % get_mapping(records, 2)
#Mapping of first categorical feasture column: {u'1': 0, u'3': 1, u'2': 2, u'4': 3}
#將1,3,2,4對映到0,1,2,3
mappings = [get_mapping (records, i) for i in range(2,10) ]#將記錄中的每一列都這樣對映
cat_len = sum(map(len, mappings))#型別變數的長度
num_len = len(records.first()[11:15])#實數變數的長度
total_len = num_len + cat_len#總長度
print "Feature vector length for categorical features : %d" % cat_len
print "Feature vector length for numerical features : %d" % num_len
print "Total feature vector length : %d" % total_len
#1、為線性模型建立特徵向量
from pyspark.mllib.regression import LabeledPoint
import numpy as np
#以下函式為線性模型建立特徵向量
def extract_features(record):
cat_vec = np.zeros(cat_len)#先新建一個cat_len長度的0向量
i = 0
step = 0
for field in record[2 : 9]:
m = mappings[i]#某屬性的特徵map
idx = m[field]#得到下標
cat_vec[idx + step] = 1 #將cat_vec向量的對應位置置為1
i = i + 1
step = step + len(m)#步數往後走
num_vec = np.array([float(field) for field in record[10 : 14]])#取出實數屬性組成實數向量
return np.concatenate((cat_vec, num_vec))#將類別向量和實數向量合併
def extract_label(record):
return float(record[-1])
data = records.map(lambda r : LabeledPoint(extract_label(r), extract_features(r)))
first_point = data.first()
print "Raw data :" + str(first[2:])
print "Label:" + str(first_point.label)
print "Linear Model feature vector:\n" + str(first_point.features)
print "Linear Model feature vector length :" + str(len(first_point.features))
#以下部分程式碼建立決策樹的特徵向量
def extract_features_dt(record):
return np.array(map(float, record[2 : 14]))
data_dt = records.map(lambda r : LabeledPoint(extract_label(r), extract_features_dt(r)))
first_point_dt = data_dt.first()
print "Decision Tree feature vector: " + str(first_point_dt.features)
print "Decision Tree feature vector length :" + str(len(first_point_dt.features))
#以下部分使用線性迴歸訓練模型
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.tree import DecisionTree
linear_model = LinearRegressionWithSGD.train(data, iterations = 10, step = 0.1, intercept = False)
true_vs_predicted = data.map(lambda p : (p.label, linear_model.predict(p.features)))
print "Linear Model predictions " + str(true_vs_predicted.take(5))
#以下部分使用決策樹訓練模型
dt_model = DecisionTree.trainRegressor(data_dt, {})
preds = dt_model.predict(data_dt.map(lambda p : p.features))
actual = data.map(lambda p : p.label)
true_vs_predicted_dt = actual.zip(preds)
print "Decision Tree predictions :" + str(true_vs_predicted_dt.take(5))
print "Decision Tree depth :" + str(dt_model.depth())
print "Decision Tree number of nodes :" + str(dt_model.numNodes())
#均方誤差
def squared_error(actual, pred):
return (pred - actual) ** 2
#平均絕對誤差
def abs_error(actual, pred):
return np.abs(pred - actual)
#均方根對數誤差
def squared_log_error(actual, pred):
return (np.log(pred + 1) - np.log(actual + 1)) ** 2
#以下部分得到線性迴歸的三個指標
mse = true_vs_predicted.map(lambda (t, p): squared_error(t, p)).mean()
print "Linear Model - Mean Squared Error : %2.4f" % mse
mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()
print "Linear Model - Mean Absolute Error: %2.4f" % mae
rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle
#以下部分得到決策樹的三個指標
mse_dt = true_vs_predicted_dt.map(lambda (t, p) : squared_error(t, p)).mean()
print "Decision Tree - Mean Squared Error : %2.4f" % mse_dt
mae_dt = true_vs_predicted_dt.map(lambda (t, p) : abs_error(t, p)).mean()
print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt
rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle_dt
#觀察目標變數
targets = records.map(lambda r : float(r[-1])).collect()
from matplotlib import pyplot as plt
plt.hist(targets, bins=40, color='green', normed=True)
fig = plt.gcf()
fig.set_size_inches(16,10)
plt.show()
#變幻目標變數——取對數
log_targets = records.map(lambda r : np.log(float(r[-1]))).collect()
plt.hist(log_targets, bins=40, color='green', normed=True)
fig = plt.gcf()
fig.set_size_inches(16,10)
plt.show()
#開方
sqrt_targets = records.map(lambda r : np.sqrt(float(r[-1]))).collect()
plt.hist(sqrt_targets, bins=40, color='green', normed=True)
fig = plt.gcf()
fig.set_size_inches(16,10)
plt.show()
data_log = data.map(lambda lp : LabeledPoint(np.log(lp.label), lp.features))
model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)
true_vs_predicted_log = data_log.map(lambda p : (np.exp(p.label), np.exp(model_log.predict(p.features))))
mse_log = true_vs_predicted_log.map(lambda (t, p): squared_error(t, p)).mean()
print "Linear Model - Mean Squared Error : %2.4f" % mse_log
mae_log = true_vs_predicted_log.map(lambda (t, p): abs_error(t, p)).mean()
print "Linear Model - Mean Absolute Error: %2.4f" % mae_log
rmsle_log = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle_log
print "未對原資料進行對數操作時:\n" + str(true_vs_predicted.take(3))
print "對原資料進行對數操作時:\n" + str(true_vs_predicted_log.take(3))
data_dt_log = data_dt.map(lambda lp : LabeledPoint(np.log(lp.label), lp.features))
dt_model_log = DecisionTree.trainRegressor(data_dt_log, {})
preds_log = dt_model_log.predict(data_dt_log.map(lambda p : p.features))
actual_log = data_dt_log.map(lambda p: p.label)
true_vs_predicted_dt_log = actual_log.zip(preds_log).map(lambda (t, p) : (np.exp(t), np.exp(p)))
mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): squared_error(t, p)).mean()
print "Decision Tree - Mean Squared Error : %2.4f" % mse_log_dt
mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): abs_error(t, p)).mean()
print "Decision Tree - Mean Absolute Error: %2.4f" % mae_log_dt
rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda (t, p): squared_log_error(t, p)).mean())
print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_log_dt
print "未對原資料進行對數操作時:\n" + str(true_vs_predicted_dt.take(3))
print "對原資料進行對數操作時:\n" + str(true_vs_predicted_dt_log.take(3))
data_with_idx = data.zipWithIndex().map(lambda (k, v) : (v, k))
test = data_with_idx.sample(False, 0.2, 42)
train = data_with_idx.subtractByKey(test)
train_data = train.map(lambda (idx, p) : p)
test_data = test.map(lambda (idx, p) : p)
train_size = train_data.count()
test_size = test_data.count()
print "訓練集大小:%d" % train_size
print "測試集大小:%d" % test_size
print "總共大小:%d" % num_data
print "訓練集大小 + 測試集大小:%d" % (train_size + test_size)
data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k, v) : (v, k))
test_dt = data_with_idx_dt.sample(False, 0.2, 42)
train_dt = data_with_idx_dt.subtractByKey(test_dt)
train_data_dt = train_dt.map(lambda (idx, p) : p)
test_data_dt = test_dt.map(lambda (idx, p) : p)
train_size_dt = train_data_dt.count()
test_size_dt = test_data_dt.count()
print "訓練集大小:%d" % train_size_dt
print "測試集大小:%d" % test_size_dt
print "總共大小:%d" % num_data
print "訓練集大小 + 測試集大小:%d" % (train_size_dt + test_size_dt)
def evaluate(train, test, iterations, step, regParam, regType, intercept):
model= LinearRegressionWithSGD.train(train, iterations, step, regParam=regParam, regType=regType, intercept=intercept)
tp = test.map(lambda p : (p.label, model.predict(p.features)))
rmsle = np.sqrt(tp.map(lambda (t, p) : squared_log_error(t, p)).mean())
return rmsle
#迭代次數對效能的影響
params = [1, 5, 10, 20, 50, 100]
metrics = [evaluate(train_data, test_data, param, 0.01, 0.0, 'l2', False) for param in params]
#注意這裡是字母L的小寫l不是1
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.xlabel("iterations")
plt.ylabel("RMSLE")
plt.show()
params = [0.01, 0.025, 0.05, 0.1, 1.0]
metrics = [evaluate(train_data, test_data, 10,param, 0.0, 'l2', False) for param in params]
#注意這裡是字母L的小寫l不是1
#步長對效能的影響
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.xlabel("step")
plt.ylabel("RMSLE")
plt.show()
params = [0.1, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]
metrics = [evaluate(train_data, test_data, 10, 0.1, param, 'l2', False) for param in params]
#注意這裡是字母L的小寫l不是1
#正則化引數對效能的影響
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.xlabel("regParam")
plt.xscale('log')
plt.ylabel("RMSLE")
plt.show()
params = [0.1, 0.01, 0.1, 1, 10.0, 100.0, 1000.0]
metrics = [evaluate(train_data, test_data, 10, 0.1, param, 'l1', False) for param in params]
#注意這裡是字母L的小寫l不是1
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.xlabel("regParam")
plt.xscale('log')
plt.ylabel("RMSLE")
plt.title("when regType change to l1 from l2")
plt.show()
model_l1 = LinearRegressionWithSGD.train(train_data, 10, 0.1, regParam=1.0, regType='l1', intercept=False)
model_l1_10 = LinearRegressionWithSGD.train(train_data, 10, 0.1, regParam=10.0, regType='l1', intercept=False)
model_l1_100 = LinearRegressionWithSGD.train(train_data, 10, 0.1, regParam=100.0, regType='l1', intercept=False)
print "L1(1.0)權重向量中0個書目:" + str(sum(model_l1.weights.array == 0))
print "L1(10.0)權重向量中0個書目:" + str(sum(model_l1_10.weights.array == 0))
print "L1(100.0)權重向量中0個書目:" + str(sum(model_l1_100.weights.array == 0))
params = [False, True]
metrics = [evaluate(train_data, test_data, 10, 0.1, 1.0, 'l2', param) for param in params]
#注意這裡是字母L的小寫l不是1
print params
print metrics
plt.bar(params, metrics)
fig = plt.gcf()
plt.xlabel("intercept")
plt.ylabel("RMSLE")
plt.show()
def evaluate_dt(train, test, maxDepth, maxBins):
model = DecisionTree.trainRegressor(train, {}, impurity='variance', maxDepth=maxDepth, maxBins=maxBins)
# print model
preds = model.predict(test.map(lambda p : p.features))
actual = test.map(lambda p : p.label)
tp = actual.zip(preds)
rmsle = np.sqrt(tp.map(lambda (t, p) : squared_log_error(t, p)).mean())
return rmsle
params = [1, 2, 3, 4, 5, 10, 20]
metrics = [evaluate_dt(train_data_dt, test_data_dt, param, 32) for param in params]
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.title("maxDepth's influence on DecisionTree")
plt.xlabel("maxDepth")
plt.ylabel("RMSLE")
plt.show()
params = [2, 4, 8, 16, 32, 64, 100]
metrics = [evaluate_dt(train_data_dt, test_data_dt, 5, param) for param in params]
print params
print metrics
plt.plot(params, metrics)
fig = plt.gcf()
plt.title("maxBins's influence on DecisionTree")
plt.xlabel("maxBins")
plt.ylabel("RMSLE")
plt.show()
相關文章
- 迴歸模型-評估指標模型指標
- 【火爐煉AI】機器學習006-用決策樹迴歸器構建房價評估模型AI機器學習模型
- 機器學習-樹迴歸機器學習
- 從線性模型到決策樹再到深度學習的分位數迴歸模型深度學習
- 機器學習入門:多變數線性迴歸機器學習變數
- 機器學習入門(二) — 迴歸模型 (理論)機器學習模型
- 吳恩達機器學習筆記 —— 2 單變數線性迴歸吳恩達機器學習筆記變數
- 吳恩達機器學習筆記 —— 5 多變數線性迴歸吳恩達機器學習筆記變數
- 【小白學AI】線性迴歸與邏輯迴歸(似然引數估計)AI邏輯迴歸
- 【《白話機器學習的數學》筆記1】迴歸機器學習筆記
- 03 迴歸演算法 - 線性迴歸求解 θ(最小二乘求解)演算法
- 機器學習筆記(2): Logistic 迴歸機器學習筆記
- 機器學習 | 線性迴歸與邏輯迴歸機器學習邏輯迴歸
- 迴歸模型的演算法效能評價模型演算法
- 機器學習——線性迴歸-KNN-決策樹(例項)機器學習KNN
- 迴歸樹(Regression Trees)模型的優缺點模型
- 機器學習之迴歸指標機器學習指標
- 【機器學習筆記】:大話線性迴歸(二)機器學習筆記
- spark-mlib線性迴歸Spark
- 機器學習之邏輯迴歸:模型訓練機器學習邏輯迴歸模型
- 邏輯迴歸模型邏輯迴歸模型
- Python學習筆記-StatsModels 統計迴歸(1)線性迴歸Python筆記
- 迴歸演算法全解析!一文讀懂機器學習中的迴歸模型演算法機器學習模型
- 【機器學習】--迴歸問題的數值優化機器學習優化
- 【火爐煉AI】機器學習004-嶺迴歸器的構建和模型評估AI機器學習模型
- 通用機器學習演算法:線性迴歸+決策樹+Xgboost機器學習演算法
- 吳恩達機器學習系列1——單變數線性迴歸吳恩達機器學習變數
- 偏最小二乘(pls)迴歸分析 matlabMatlab
- 迴歸樹
- 數值分析:最小二乘與嶺迴歸(Pytorch實現)PyTorch
- 機器學習入門 - 快速掌握邏輯迴歸模型機器學習邏輯迴歸模型
- 機器學習入門(三) — 迴歸模型(進階案例)機器學習模型
- 機器學習筆記-多類邏輯迴歸機器學習筆記邏輯迴歸
- 多元線性迴歸模型模型
- 正規方程法來求解線性迴歸模型引數模型
- 機器學習:迴歸問題機器學習
- 機器學習:線性迴歸機器學習
- 機器學習之Logistic迴歸機器學習
- 機器學習:邏輯迴歸機器學習邏輯迴歸