版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
1 Python環境安裝
- shift + Enter :換行
- ctrl + Enter :執行
2 Python IDE 環境安裝
3 資料預處理
-
頭幾行展示
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import KFold # import data filename= "C:\ML\MLData\data.csv" raw = pd.read_csv(filename) print (raw.shape) raw.head() 複製程式碼
-
尾幾行展示
-
去除空值
-
matplot列屬性繪製分佈
#plt.subplot(211) first is raw second Column # 透明程度 (顏色深度和密度) alpha = 0.02 # 指定圖大概佔用的區域 plt.figure(figsize=(10,10)) # loc_x and loc_y(一行兩列第一個位置) plt.subplot(121) # scatter 散點圖 plt.scatter(kobe.loc_x, kobe.loc_y, color=`R`, alpha=alpha) plt.title(`loc_x and loc_y`) # lat and lon(一行兩列第二個位置) plt.subplot(122) plt.scatter(kobe.lon, kobe.lat, color=`B`, alpha=alpha) plt.title(`lat and lon`) 複製程式碼
-
角度和極座標預處理
raw[`dist`] = np.sqrt(raw[`loc_x`]**2 + raw[`loc_y`]**2) loc_x_zero = raw[`loc_x`] == 0 #print (loc_x_zero) raw[`angle`] = np.array([0]*len(raw)) raw[`angle`][~loc_x_zero] = np.arctan(raw[`loc_y`][~loc_x_zero] / raw[`loc_x`][~loc_x_zero]) raw[`angle`][loc_x_zero] = np.pi / 2 複製程式碼
-
時間處理
raw[`remaining_time`] = raw[`minutes_remaining`] * 60 + raw[`seconds_remaining`] 複製程式碼
-
屬性唯一值及分組統計列印出來
投籃方式 print(kobe.action_type.unique()) print(kobe.combined_shot_type.unique()) print(kobe.shot_type.unique()) 分組統計 print(kobe.shot_type.value_counts()) 複製程式碼
-
按列進行特殊符號處理
kobe[`season`].unique() array([`2000-01`, `2001-02`, `2002-03`, `2003-04`, `2004-05`, `2005-06`, `2006-07`, `2007-08`, `2008-09`, `2009-10`, `2010-11`, `2011-12`, `2012-13`, `2013-14`, `2014-15`, `2015-16`, `1996-97`, `1997-98`, `1998-99`, `1999-00`], dtype=object) raw[`season`] = raw[`season`].apply(lambda x: int(x.split(`-`)[1]) ) raw[`season`].unique() array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 97, 98, 99, 0], dtype=int64) 複製程式碼
-
pd的DataFrame使用技巧(matchup兩隊對決,opponent對手是誰)
pd.DataFrame({`matchup`:kobe.matchup, `opponent`:kobe.opponent}) 複製程式碼
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
-
屬性相關性展示是否是線性關係(位置和投籃位置)
plt.figure(figsize=(5,5)) plt.scatter(raw.dist, raw.shot_distance, color=`blue`) plt.title(`dist and shot_distance`) 複製程式碼
-
pd的groupby對kebe的投射位置進行分組
gs = kobe.groupby(`shot_zone_area`) print (kobe[`shot_zone_area`].value_counts()) print (len(gs)) Center(C) 11289 Right Side Center(RC) 3981 Right Side(R) 3859 Left Side Center(LC) 3364 Left Side(L) 3132 Back Court(BC) 72 Name: shot_zone_area, dtype: int64 6 複製程式碼
-
區域劃分拉鍊展示
import matplotlib.cm as cm plt.figure(figsize=(20,10)) def scatter_plot_by_category(feat): alpha = 0.1 gs = kobe.groupby(feat) cs = cm.rainbow(np.linspace(0, 1, len(gs))) for g, c in zip(gs, cs): plt.scatter(g[1].loc_x, g[1].loc_y, color=c, alpha=alpha) # shot_zone_area plt.subplot(131) scatter_plot_by_category(`shot_zone_area`) plt.title(`shot_zone_area`) # shot_zone_basic plt.subplot(132) scatter_plot_by_category(`shot_zone_basic`) plt.title(`shot_zone_basic`) # shot_zone_range plt.subplot(133) scatter_plot_by_category(`shot_zone_range`) plt.title(`shot_zone_range`) 複製程式碼
-
去除某一列
drops = [`shot_id`, `team_id`, `team_name`, `shot_zone_area`, `shot_zone_range`, `shot_zone_basic`, `matchup`, `lon`, `lat`, `seconds_remaining`, `minutes_remaining`, `shot_distance`, `loc_x`, `loc_y`, `game_event_id`, `game_id`, `game_date`] for drop in drops: raw = raw.drop(drop, 1) 複製程式碼
-
獨熱編碼(one-hot編碼)(一列變多列(0000000)prefix指定新增列字首)
print (raw[`combined_shot_type`].value_counts()) pd.get_dummies(raw[`combined_shot_type`], prefix=`combined_shot_type`)[0:2] Jump Shot 23485 Layup 5448 Dunk 1286 Tip Shot 184 Hook Shot 153 Bank Shot 141 Name: combined_shot_type, dtype: int64 複製程式碼
-
獨熱編碼之後,拼接成1列後,刪除對應列。
categorical_vars = [`action_type`, `combined_shot_type`, `shot_type`, `opponent`, `period`, `season`] for var in categorical_vars: raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1) raw = raw.drop(var, 1) 複製程式碼
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
4 模型建立
-
1 測試集和訓練集準備
train_kobe = raw[pd.notnull(raw[`shot_made_flag`])] train_kobe = train_kobe.drop(`shot_made_flag`, 1) train_label = train_kobe[`shot_made_flag`] test_kobe = raw[pd.isnull(raw[`shot_made_flag`])] test_kobe = test_kobe.drop(`shot_made_flag`, 1) 複製程式碼
-
2 隨機森林分類
from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import confusion_matrix,log_loss import time # find the best n_estimators for RandomForestClassifier print(`Finding best n_estimators for RandomForestClassifier...`) min_score = 100000 best_n = 0 scores_n = [] range_n = np.logspace(0,2,num=3).astype(int) for n in range_n: print("the number of trees : {0}".format(n)) t1 = time.time() rfc_score = 0. rfc = RandomForestClassifier(n_estimators=n) for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True): rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k]) #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10 pred = rfc.predict(train_kobe.iloc[test_k]) rfc_score += log_loss(train_label.iloc[test_k], pred) / 10 scores_n.append(rfc_score) if rfc_score < min_score: min_score = rfc_score best_n = n t2 = time.time() print(`Done processing {0} trees ({1:.3f}sec)`.format(n, t2-t1)) print(best_n, min_score) # find best max_depth for RandomForestClassifier print(`Finding best max_depth for RandomForestClassifier...`) min_score = 100000 best_m = 0 scores_m = [] range_m = np.logspace(0,2,num=3).astype(int) for m in range_m: print("the max depth : {0}".format(m)) t1 = time.time() rfc_score = 0. rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n) for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True): rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k]) #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10 pred = rfc.predict(train_kobe.iloc[test_k]) rfc_score += log_loss(train_label.iloc[test_k], pred) / 10 scores_m.append(rfc_score) if rfc_score < min_score: min_score = rfc_score best_m = m t2 = time.time() print(`Done processing {0} trees ({1:.3f}sec)`.format(m, t2-t1)) print(best_m, min_score) 複製程式碼
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel(`score`)
plt.xlabel(`number of trees`)
plt.subplot(122)
plt.plot(range_m, scores_m)
plt.ylabel(`score`)
plt.xlabel(`max depth`)
複製程式碼
model = RandomForestClassifier(n_estimators=best_n, max_depth=best_m)
model.fit(train_kobe, train_label)
# 474241623
複製程式碼
5 總結
綜上所述, numpy與pandas與matplotlit與sklearn四劍客組成了強大的資料分析預處理支援。
秦凱新 於深圳 201812081439