Kaggle 自行車租賃預測比賽專案實現

大樹2發表於2018-02-05

作者:大樹

更新時間:01.20

email:59888745@qq.com

資料處理,機器學習

 

回主目錄:2017 年學習記錄和總結

 

 

In [ ]:
Kaggle上有很多有意思的專案,大家得空可以試著做一做,其中有個關於香港賽馬預測的專案,若大家做的效果好,
預測的結果準確度高的話,可以輕鬆的 get money ,記得香港報紙有報導說有個大學教授通過統計學建模
進行賽馬贏了5000萬港幣,相信通過機器學習,深度學習,一定可以提高投注的準確度,恭喜大家發財阿,加油學吧~~


Kaggle自行車租賃預測比賽,這是一個連續值預測的問題,
也就是我們說的機器學習中的迴歸問題,我們們一起來看看這個問題

這是一個城市自行車租賃系統,提供的資料為2年內華盛頓按小時記錄的自行車租賃資料,其中訓練集由每個月
的前19天組成,測試集由20號之後的時間組成需要我們自己去預測)。

Kaggle自行車租賃預測比賽:https://www.kaggle.com/c/bike-sharing-demand
        
1.載入資料
2.資料分析
3.特徵資料提取
4.準備訓練集資料,測試集資料
5.模型選擇,先用自己合適演算法跑一個baseline的model出來,再進行後續的分析步驟,一步步提高
6.引數調優,用Grid Search找最好的引數
7.用模型預測打分
In [61]:
#load data, review the fild and data type
import pandas as pd

df_train = pd.read_csv('kaggle_bike_competition_train.csv',header=0)
df_train.head(5)
df_train.dtypes
Out[61]:
datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object
In [10]:
#look the data rows,columns
df_train.shape
Out[10]:
(10886, 12)
In [8]:
#看看有沒有預設的欄位, 沒有發現預設值
df_train.count()
Out[8]:
datetime      10886
season        10886
holiday       10886
workingday    10886
weather       10886
temp          10886
atemp         10886
humidity      10886
windspeed     10886
casual        10886
registered    10886
count         10886
dtype: int64
In [34]:
#來處理時間,因為它包含的資訊總是非常多的,畢竟變化都是隨著時間發生的嘛
df_train.head()
df_train['hour']=pd.DatetimeIndex(df_train.datetime).hour
df_train['day']=pd.DatetimeIndex(df_train.datetime).dayofweek
df_train['month']=pd.DatetimeIndex(df_train.datetime).month

#other method  
# df_train['dt']=pd.to_datetime(df_train['datetime'])
# df_train['day_of_week']=df_train['dt'].apply(lambda x:x.dayofweek)
# df_train['day_of_month']=df_train['dt'].apply(lambda x:x.day)

df_train.head()
Out[34]:
 
 datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcounthourdaymonth
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 5 1
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 5 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 5 1
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 5 1
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 5 1
In [42]:
#提取 相關特徵欄位
# df = df_train.drop(['datetime','casual','registered'],axis=1,inplace=True)
df_train = df_train[['season','holiday','workingday','weather','temp','atemp',
               'humidity','windspeed','count','month','day','hour']]
#df=df_train['datetime']
df_train.head(5)
Out[42]:
 
 seasonholidayworkingdayweathertempatemphumiditywindspeedcountmonthdayhour
0 1 0 0 1 9.84 14.395 81 0.0 16 1 5 0
1 1 0 0 1 9.02 13.635 80 0.0 40 1 5 1
2 1 0 0 1 9.02 13.635 80 0.0 32 1 5 2
3 1 0 0 1 9.84 14.395 75 0.0 13 1 5 3
4 1 0 0 1 9.84 14.395 75 0.0 1 1 5 4
In [43]:
df_train.shape
Out[43]:
(10886, 12)
In [ ]:
準備訓練集資料,測試集資料:
1. df_train_target目標,也就是count欄位
2. df_train_data用於產出特徵的資料
In [51]:
df_train_target = df_train['count'].values 
print(df_train_target.shape) 
df_train_data = df_train.drop(['count'],axis =1).values
print(df_train_data.shape) 
 
(10886,)
(10886, 11)
In [ ]:
演算法
我們們依舊會使用交叉驗證的方式交叉驗證集約佔全部資料的20%來看看模型的效果,
我們會試 支援向量迴歸/Suport Vector Regression, 嶺迴歸/Ridge Regression 
隨機森林迴歸/Random Forest Regressor每個模型會跑3趟看平均的結果
In [63]:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import explained_variance_score

# 切分一下資料(訓練集和測試集)
cv = cross_validation.ShuffleSplit(len(df_train_data), n_iter=3, test_size=0.2,
    random_state=0)

# 各種模型來一圈

print("嶺迴歸")    
for train, test in cv:    
    svc = linear_model.Ridge().fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
    
print("支援向量迴歸/SVR(kernel='rbf',C=10,gamma=.001)")
for train, test in cv:
    
    svc = svm.SVR(kernel ='rbf', C = 10, gamma = .001).fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
    
print("隨機森林迴歸/Random Forest(n_estimators = 100)")    
for train, test in cv:    
    svc = RandomForestRegressor(n_estimators = 100).fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
 
嶺迴歸
train score: 0.339, test score: 0.332

train score: 0.330, test score: 0.370

train score: 0.342, test score: 0.320

支援向量迴歸/SVR(kernel='rbf',C=10,gamma=.001)
train score: 0.417, test score: 0.408

train score: 0.406, test score: 0.452

train score: 0.419, test score: 0.390

隨機森林迴歸/Random Forest(n_estimators = 100)
train score: 0.981, test score: 0.867

train score: 0.981, test score: 0.880

train score: 0.981, test score: 0.869

In [ ]:
隨機森林迴歸獲得了最佳結果
不過,引數設定得是不是最好的,這個我們可以用GridSearch來幫助測試,找最好的引數
In [67]:
X = df_train_data
y = df_train_target

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2, random_state=0)

tuned_parameters = [{'n_estimators':[10,100,500,550]}]   
    
scores = ['r2']

for score in scores:
    
    print(score)
    
    clf = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring=score)
    clf.fit(X_train, y_train)

    print("最佳引數找到了:")
    print("")
    #best_estimator_ returns the best estimator chosen by the search
    print(clf.best_estimator_)
    print("")
    print("得分分別是:")
    print("")
    #grid_scores_的返回值:
    #    * a dict of parameter settings
    #    * the mean score over the cross-validation folds 
    #    * the list of scores for each fold
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
    print("")
 
r2
最佳引數找到了:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=550, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

得分分別是:

0.846 (+/-0.006) for {'n_estimators': 10}
0.862 (+/-0.005) for {'n_estimators': 100}
0.863 (+/-0.005) for {'n_estimators': 500}
0.864 (+/-0.005) for {'n_estimators': 550}

In [ ]:
Grid Search幫挑引數還是蠻方便的, 而且要看看模型狀態是不是,過擬合or欠擬合
我們發現n_estimators=500,550,擬合得最好
In [ ]:
 
In [ ]:
 

相關文章