機器學習處理流程、特徵工程,模型設計例項

大樹2發表於2018-01-18

 

 

 

 

作者:大樹

更新時間:01.14

email:59888745@qq.com

資料處理,機器學習

回主目錄:2017 年學習記錄和總結

 

 

阿里天池 大航杯“智造揚中”電力AI大賽 的案例分析實現

今天我來實現大航杯“智造揚中”電力AI大賽的案例實現,按照工業界流程來一一呈現:

  1. 業務場景定義 包括:核心目標定義,關鍵場景描述.
  2. 業務規則梳理 包括:業務規則提煉,規則聯動分析
  3. 資料定量分析 包括:資料多維分析,資料異常處理
  4. 模型設計研究 包括:應用場景定製,模型引數調優設定
  5. 運算和結果分析 包括:模型運算輸出,業務迴歸驗證

電力AI大賽大賽介紹請參考: https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.333.2.mnbu1L&raceId=231602

1.業務場景定義
a.電力AI大賽大賽介紹請參考URL.
b.通過分析,我們得知,業務需求是通分析江蘇鎮江揚中市的高新區企業歷史近2年的用電量,
希望能夠根據歷史資料去精準預測未來一個月每一天的用電量,如10月份。
c.高新技術產業開發區,高薪區,上班族(工作日,休息日,節假日,夏天還是冬天等
和用電量相關的關鍵場景。
2.業務規則梳理
a.通過分析,這是一個典型的迴歸類問題,和我們的流量預測非常相似,
我們來看看如何用資料驅動的方式去完成這樣一個預測。
3 .資料定量分析
3.1.載入資料,資料一覽

 

 

In [2]:
import numpy as np
import pandas as pd

_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
Out[2]:
 
 record_dateuser_idpower_consumption
0 2015/1/1 1 1135
1 2015/1/2 1 570
2 2015/1/3 1 3418
3 2015/1/4 1 3968
4 2015/1/5 1 3986
In [ ]:
3.2 資料清洗處理包括異常預設值空值重複值日期格式等
處理na的方法有這些具體業務具體看
dropna(),dropna(axis=0,how='all',thresh=None) #thresh =3,
fillna(0)填充d.mean()
isnull(),
notnull(),
drop_duplicates(),重複值_df.drop_duplicates(['user_id','record_date'])
In [11]:
import numpy as np
import pandas as pd

_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape

_df.dropna(axis=0,how='all',thresh=None) 
_df.drop_duplicates(['user_id','record_date'])

_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()
 
Out[11]:
 
 record_dateuser_idpower_consumption
0 2015-01-01 1 1135
1 2015-01-02 1 570
2 2015-01-03 1 3418
3 2015-01-04 1 3968
4 2015-01-05 1 3986
 

構造和時間相關的強特徵

In [7]:
import numpy as np
import pandas as pd

_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape

_df.dropna(axis=0,how='all',thresh=None) 
_df.drop_duplicates(['user_id','record_date'])

_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()

test_df=pd.date_range('2016-10-1',periods=31,freq='D')#create very data for 10.1--10.31
 
test_df=pd.DataFrame(test_df,columns=['record_date'])

test_df['power_consumption']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(0)
total_df.dropna()
#total_df.head()
total_df.tail()

#時間相關的特徵
total_df['day_of_week']=total_df['record_date'].apply(lambda x:x.dayofweek)
total_df['day_of_month']=total_df['record_date'].apply(lambda x:x.day)
total_df['day_of_year']=total_df['record_date'].apply(lambda x:x.dayofyear)
total_df['month_of_year']=total_df['record_date'].apply(lambda x:x.month)
total_df['year']=total_df['record_date'].apply(lambda x:x.year)

#新增工作日還是週末的資訊,週六週日和工作日的用電量顯然是不一樣
total_df['holiday']=0
total_df['holiday_sat']=0
total_df['holiday_sun']=0

#週末特徵資訊
total_df.loc[total_df.day_of_week ==5,'holiday']=1
total_df.loc[total_df.day_of_week ==5,'holiday_sat']=1

total_df.loc[total_df.day_of_week ==6,'holiday']=1
total_df.loc[total_df.day_of_week ==6,'holiday_sun']=1

#一個月4周的周資訊,屬於第幾周
def week_of_month(day):
    if day in range(1,8):return 1
    if day in range(8,15):return 2
    if day in range(15,22):return 3
    if day in range(22,32):return 4

total_df['week_of_month']=total_df['day_of_month'].apply(lambda x:week_of_month(x))
total_df.head()

#屬於上中下旬資訊,有些企業的任務是按照月份的上中下旬來安排的,同樣可能對用電量會有影響
def period_of_month(day):
    if day in range(1,11):return 1
    if day in range(11,21):return 2
    if day in range(21,32):return 3
    
total_df['period_of_month'] =total_df['day_of_month'].apply(lambda x:period_of_month(x))
total_df.head()

#上半月下半月資訊
def period2_of_month(day):
    if day in range(1,16):return 1
    if day in range(16,32):return 2
total_df['period2_of_month'] =total_df['day_of_month'].apply(lambda x:period2_of_month(x))
total_df.head()

# 手動填充節日資訊 另外一個對用電量非常大的影響是節假日,法定節假日大部分企業會放假,
# 電量會有大程度的下滑。我們通過查日曆的方式去手動填充一個特徵/欄位,表明這一天是否是節日。
def day_of_festival(day):
    l_festival=['2016-10-01','2016-10-02','2016-10-03','2016-10-04','2016-10-05','2016-10-06','2016-10-07']
    if day in l_festival:return 1
    else:return 0
    
total_df['festival_pc']=0
total_df['festival']=0


total_df['festival']=total_df['festival'].apply(lambda x:day_of_festival(x))

total_df.head(20)
Out[7]:
 
 power_consumptionrecord_dateuser_idday_of_weekday_of_monthday_of_yearmonth_of_yearyearholidayholiday_satholiday_sunweek_of_monthperiod_of_monthperiod2_of_monthfestival_pcfestival
0 1135.0 2015-01-01 1.0 3 1 1 1 2015 0 0 0 1 1 1 0 0
1 570.0 2015-01-02 1.0 4 2 2 1 2015 0 0 0 1 1 1 0 0
2 3418.0 2015-01-03 1.0 5 3 3 1 2015 1 1 0 1 1 1 0 0
3 3968.0 2015-01-04 1.0 6 4 4 1 2015 1 0 1 1 1 1 0 0
4 3986.0 2015-01-05 1.0 0 5 5 1 2015 0 0 0 1 1 1 0 0
5 4082.0 2015-01-06 1.0 1 6 6 1 2015 0 0 0 1 1 1 0 0
6 4172.0 2015-01-07 1.0 2 7 7 1 2015 0 0 0 1 1 1 0 0
7 4022.0 2015-01-08 1.0 3 8 8 1 2015 0 0 0 2 1 1 0 0
8 4025.0 2015-01-09 1.0 4 9 9 1 2015 0 0 0 2 1 1 0 0
9 4047.0 2015-01-10 1.0 5 10 10 1 2015 1 1 0 2 1 1 0 0
10 4135.0 2015-01-11 1.0 6 11 11 1 2015 1 0 1 2 2 1 0 0
11 4111.0 2015-01-12 1.0 0 12 12 1 2015 0 0 0 2 2 1 0 0
12 3926.0 2015-01-13 1.0 1 13 13 1 2015 0 0 0 2 2 1 0 0
13 4244.0 2015-01-14 1.0 2 14 14 1 2015 0 0 0 2 2 1 0 0
14 4144.0 2015-01-15 1.0 3 15 15 1 2015 0 0 0 3 2 1 0 0
15 4269.0 2015-01-16 1.0 4 16 16 1 2015 0 0 0 3 2 2 0 0
16 4262.0 2015-01-17 1.0 5 17 17 1 2015 1 1 0 3 2 2 0 0
17 2782.0 2015-01-18 1.0 6 18 18 1 2015 1 0 1 3 2 2 0 0
18 3327.0 2015-01-19 1.0 0 19 19 1 2015 0 0 0 3 2 2 0 0
19 4002.0 2015-01-20 1.0 1 20 20 1 2015 0 0 0 3 2 2 0 0
In [6]:
#已經有的資料特徵欄位
    # 可以看到有
    # 日期
    # 用電量
    # 星期幾
    # 一個月第幾天
    # 一年第幾天
    # 一年第幾個月
    # 年
    # 是否節假日
    # 月中第幾周
    # 一個月上中下旬哪個旬
    # 上半月還是下半月
    # 是否節日
col_names=total_df.columns.values
col_names

#確認一下訓練資料沒有預設值
counts={}
for name in col_names:
    count=total_df[name].isnull().sum()
    counts[name]=[count]

is_null_filds = pd.DataFrame(counts)
is_null_filds
Out[6]:
 
 day_of_monthday_of_weekday_of_yearfestivalfestival_pcholidayholiday_satholiday_sunmonth_of_yearperiod2_of_monthperiod_of_monthpower_consumptionrecord_dateuser_idweek_of_monthyear
0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 0 0
In [ ]:
# 4. 模型設計研究
包括:應用場景定製,模型引數設定
分離訓練集和測試集
我們根據日期分割訓練集和測試集用於後續的建模
In [15]:
## 非十月份的是訓練集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
train_y = train_X.power_consumption
train_X = train_X.drop(['power_consumption','record_date','year'],axis=1)
test_X = test_X.drop(['power_consumption','record_date','year'],axis=1)
train_X.head()
Out[15]:
 
 user_idday_of_weekday_of_monthday_of_yearmonth_of_yearholidayholiday_satholiday_sunweek_of_monthperiod_of_monthperiod2_of_monthfestival_pcfestival
0 1.0 3 1 1 1 0 0 0 1 1 1 0 0
1 1.0 4 2 2 1 0 0 0 1 1 1 0 0
2 1.0 5 3 3 1 1 1 0 1 1 1 0 0
3 1.0 6 4 4 1 1 0 1 1 1 1 0 0
4 1.0 0 5 5 1 0 0 0 1 1 1 0 0
In [16]:
train_X.shape
Out[16]:
(885468, 13)
In [ ]:
# 5 建模與調參,利用網格搜尋交叉驗證去查詢最好的引數,
# DecisionTree
In [17]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {'max_features': [0.7, 0.8, 0.9, 1],
              'max_depth':  [3, 5, 7, 9, 12]
             }

dt = DecisionTreeRegressor()

grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))
 
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=0.9,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
0.919640509654
In [ ]:
#考察一下訓練集上的擬合程度
In [18]:
best_dt_reg.score(train_X, train_y)
Out[18]:
0.91964050965438204
In [ ]:
#進行結果預測
In [28]:
from datetime import datetime 

#完成提交日期格式的轉換
def dataprocess(t):
    t = str(t)[0:10]
    time = datetime.strptime(t, '%Y-%m-%d')
    res = time.strftime('%Y%m%d')
    return res

#生成10月份31天的時間段
commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']

#用模型進行預測
test_X['user_id']=test_X['day_of_month'].apply(lambda x:x)
test_X
prediction = best_dt_reg.predict(test_X.values)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)
commit_df.head()
Out[28]:
 
 predict_datepredict_power_consumption
0 20161001 3820886
1 20161002 3845830
2 20161003 3845830
3 20161004 3845830
4 20161005 3845830
In [ ]:
RandomForest 模型融合
In [ ]:
# RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

### 多少顆樹,樹有多深(一般不超過10),建樹的時候不用全部屬性(具體看多少屬性), 取樣
param_grid = {
              'n_estimators': [5, 8, 10, 15, 20, 50, 100, 200],
              'max_depth': [3, 5, 7, 9],
              'max_features': [0.6, 0.7, 0.8, 0.9],
             }

rf = RandomForestRegressor()

grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)
grid.fit(train_X, train_y)

breg = grid.best_estimator_
print(breg)
print(breg.score(train_X, train_y))
In [ ]:
用模型進行預測
In [ ]:
from datetime import datetime 

def dataprocess(t):
    t = str(t)[0:10]
    time = datetime.strptime(t, '%Y-%m-%d')
    res = time.strftime('%Y%m%d')
    return res


#用模型進行預測
test_X['user_id']=test_X['day_of_month'].apply(lambda x:x)

commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']
prediction = breg.predict(test_X)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)

commit_df.head()
 

總結:

通過上面這個用電量分析預測未來用電量例子,我們可以發現,在建摸前對業務資料的分析, 特徵提取很重要,它直接決定了你預測的準確度的高低,所以好的特徵提取很重要。 只有儘可能全面準確的對業務場景的瞭解,才能比較好的做特徵提取, 在加上合適的演算法模型,才能作出好的效果.

 

完整版程式碼

In [ ]:
import numpy as np
import pandas as pd

_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
train_df = _df 
_df.head()
#_df.shape
#df_201609
#train_df.head(5)
_df['record_date']=pd.to_datetime(_df['record_date'])
_df.head()
train_df=_df[['record_date','power_consumption']].groupby(by='record_date').agg('sum')

train_df=train_df.reset_index()
train_df.head()

test_df=pd.date_range('2016-10-1',periods=31,freq='D')#create very data for 10.1--10.31
 
test_df=pd.DataFrame(test_df,columns=['record_date'])

test_df['power_consumption']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(np.random.randint(100,10000))
total_df.dropna()
#total_df.head()
total_df.tail()

#時間相關的特徵
total_df['day_of_week']=total_df['record_date'].apply(lambda x:x.dayofweek)
total_df['day_of_month']=total_df['record_date'].apply(lambda x:x.day)
total_df['day_of_year']=total_df['record_date'].apply(lambda x:x.dayofyear)
total_df['month_of_year']=total_df['record_date'].apply(lambda x:x.month)
total_df['year']=total_df['record_date'].apply(lambda x:x.year)

#新增工作日還是週末的資訊,週六週日和工作日的用電量顯然是不一樣
total_df['holiday']=0
total_df['holiday_sat']=0
total_df['holiday_sun']=0

#週末特徵資訊
total_df.loc[total_df.day_of_week ==5,'holiday']=1
total_df.loc[total_df.day_of_week ==5,'holiday_sat']=1

total_df.loc[total_df.day_of_week ==6,'holiday']=1
total_df.loc[total_df.day_of_week ==6,'holiday_sun']=1

#一個月4周的周資訊,屬於第幾周
def week_of_month(day):
    if day in range(1,8):return 1
    if day in range(8,15):return 2
    if day in range(15,22):return 3
    if day in range(22,32):return 4

total_df['week_of_month']=total_df['day_of_month'].apply(lambda x:week_of_month(x))
total_df.head()

#屬於第上中下旬資訊
def period_of_month(day):
    if day in range(1,11):return 1
    if day in range(11,21):return 2
    if day in range(21,32):return 3
    
total_df['period_of_month'] =total_df['day_of_month'].apply(lambda x:period_of_month(x))
total_df.head()

#上半月下半月資訊
def period2_of_month(day):
    if day in range(1,16):return 1
    if day in range(16,32):return 2
total_df['period2_of_month'] =total_df['day_of_month'].apply(lambda x:period2_of_month(x))
total_df.head()

# 手動填充節日資訊 另外一個對用電量非常大的影響是節假日,法定節假日大部分企業會放假,
# 電量會有大程度的下滑。我們通過查日曆的方式去手動填充一個特徵/欄位,表明這一天是否是節日。
def day_of_festival(day):
    l_festival=['2016-10-01','2016-10-02','2016-10-03','2016-10-04','2016-10-05','2016-10-06','2016-10-07']
    if day in l_festival:return 1
    else:return 0
    
total_df['festival_pc']=0
total_df['festival']=0


total_df['festival']=total_df['festival'].apply(lambda x:day_of_festival(x))

total_df.head(20)

#已經有的資料特徵欄位
    # 可以看到有
    # 日期
    # 用電量
    # 星期幾
    # 一個月第幾天
    # 一年第幾天
    # 一年第幾個月
    # 年
    # 是否節假日
    # 月中第幾周
    # 一個月上中下旬哪個旬
    # 上半月還是下半月
    # 是否節日
col_names=total_df.columns.values
col_names

#確認一下訓練資料沒有預設值
counts={}
for name in col_names:
    count=total_df[name].isnull().sum()
    counts[name]=[count]

is_null_filds = pd.DataFrame(counts)
is_null_filds

#新增獨熱向量編碼/one-hot encoding  ;針對星期幾這個特徵,初始化一個長度為7的向量[0,0,0,0,0,0,0]
    #對於類別型特徵,我們經常在特徵工程的時候會對他們做一些特殊的處理
    # 星期一會被填充成[1,0,0,0,0,0,0]
    # 星期二會被填充成[0,1,0,0,0,0,0]
    # 星期三會被填充成[0,0,1,0,0,0,0]
    # 星期四會被填充成[0,0,0,1,0,0,0]
    # 以此類推...


# 樹狀模型建模 樹狀模型是工業界最常用的機器學習演算法之一,我們在訓練集上去學習出來一個最好的決策路徑,而每條決策路徑的根節點是我們預測的結果;
# 1.分離訓練集和測試集
## 非十月份的是訓練集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
#print(train_X.shape)
#print(test_X.shape)

train_y = train_X.power_consumption
train_X = train_X.drop(['power_consumption','record_date','year'],axis=1)
test_X = test_X.drop(['power_consumption','record_date','year'],axis=1)

train_X.head()

#建模與調參;我們利用網格搜尋交叉驗證去查詢最好的引數
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {'max_features': [0.7, 0.8, 0.9, 1],
              'max_depth':  [3, 5, 7, 9, 12]
             }

dt = DecisionTreeRegressor()

grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))

from datetime import datetime 

#完成提交日期格式的轉換
def dataprocess(t):
    t = str(t)[0:10]
    time = datetime.strptime(t, '%Y-%m-%d')
    res = time.strftime('%Y%m%d')
    return res

#生成10月份31天的時間段
commit_df = pd.date_range('2016/10/1', periods=31, freq='D')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = ['predict_date']

#用模型進行預測
prediction = best_dt_reg.predict(test_X.values)
commit_df['predict_power_consumption'] = pd.DataFrame(prediction).astype('int')
commit_df['predict_date'] = commit_df['predict_date'].apply(dataprocess)
commit_df.head()
特徵重要度

 

 

%matplotlib inline
import matplotlib.pyplot as plt
print("Feature ranking:")
feature_names = [u'day_of_week', u'day_of_month', u'day_of_year', u'month_of_year',
       u'holiday', u'holiday_sat', u'holiday_sun', u'week_of_month',
       u'period_of_month', u'period2_of_month', u'festival_pc', u'festival']
feature_importances = breg.feature_importances_
indices = np.argsort(feature_importances)[::-1]

for f in indices:
    print("feature %s (%f)" % (feature_names[f], feature_importances[f]))

plt.figure(figsize=(20,8))
plt.title("Feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices],
       color="b",align="center")
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices])
plt.xlim([-1, train_X.shape[1]])
plt.show()

remark:

說明:

 

型設計:

load data

交叉驗證

classer

model=classer.fit(x,y)

predict = model.transforam(x,y)

predict.filter()

predict.count()

 

sklearn:

from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor

import xgboost as xgb

 

 

迴歸問題,對連續值進行預測,如上面的用電量預測:

 

DecisionTreeRegressor()

XGBRegressor()

RandomForestRegressor()

xgb.XGBRegressor()

GridSearchCV(xgb_model, param_grid, n_jobs=8)

 

param_grid = {'max_features': [0.7, 0.8, 0.9, 1],               'max_depth':  [3, 5, 7, 9, 12]              }

dt = DecisionTreeRegressor()

grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)

grid.fit(train_X, train_y)

best_dt_reg = grid.best_estimator_

best_dt_reg.predict(test_X.values)

 

rf = RandomForestRegressor()

grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)

grid.fit(train_X, train_y)

best_dt_reg = grid.best_estimator_

best_dt_reg.score(train_X, train_y)

best_dt_reg.predict(test_X.values)

 

param_grid = {               'max_depth': [3, 4, 5, 7, 9],               'n_estimators': [20, 40, 50, 80, 100, 200, 400, 800, 1000, 1200],               'learning_rate': [0.05, 0.1, 0.2, 0.3],               'subsample': [0.8, 1],               'colsample_bylevel':[0.8, 1]              }

# 使用xgboost的regressor完成迴歸

xgb_model = xgb.XGBRegressor()

# 資料擬合

rgs = GridSearchCV(xgb_model, param_grid, n_jobs=8)

rgs.fit(X, y)

print(rgs.best_score_)

print(rgs.best_params_)

 rgs.predict(test_X.values)

 

LogisticRegression邏輯迴歸      被用來解決分類問題(二元分類),但多類的分類(所謂的一對多方法)也適用;優點是對於每一個輸出的物件都有一個對應類別的概率

GaussianNB樸素貝葉斯              在多類的分類問題上表現的很好;

kNN(k-最近鄰)方法                                      通常用於一個更復雜分類演算法的一部分,用它的估計值做為一個物件的特徵;

DecisionTree決策樹分類和迴歸樹(CART)    適用於多類分類 支援向量機SVM                用於分類問題;邏輯迴歸

 

相關文章