Python環境安裝及資料基本預處理-大資料ML樣本集案例實戰

凱新雲技術社群發表於2019-03-01

原文網址 : https://flycode.co/archives/290033

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：1120746959@qq.com，如有任何學術交流，可隨時聯絡。

1 Python環境安裝

shift + Enter :換行
ctrl + Enter ：執行

2 Python IDE 環境安裝

3 資料預處理

頭幾行展示

  import numpy as np 
  import pandas as pd 
  import matplotlib.pyplot as plt
  %matplotlib inline
  
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.cross_validation import KFold
  
  # import data
  filename= "C:\ML\MLData\data.csv"
  raw = pd.read_csv(filename)
  print (raw.shape)
  raw.head()
複製程式碼

尾幾行展示
去除空值

matplot列屬性繪製分佈

  #plt.subplot(211) first is raw second Column
  # 透明程度 （顏色深度和密度）
  alpha = 0.02
  # 指定圖大概佔用的區域
  plt.figure(figsize=(10,10))
  # loc_x and loc_y（一行兩列第一個位置）
  plt.subplot(121)
  # scatter 散點圖
  plt.scatter(kobe.loc_x, kobe.loc_y, color=`R`, alpha=alpha)
  plt.title(`loc_x and loc_y`)
  # lat and lon（一行兩列第二個位置）
  plt.subplot(122)
  plt.scatter(kobe.lon, kobe.lat, color=`B`, alpha=alpha)
  plt.title(`lat and lon`)
複製程式碼

角度和極座標預處理

  raw[`dist`] = np.sqrt(raw[`loc_x`]**2 + raw[`loc_y`]**2)
  loc_x_zero = raw[`loc_x`] == 0
  #print (loc_x_zero)
  raw[`angle`] = np.array([0]*len(raw))
  raw[`angle`][~loc_x_zero] = np.arctan(raw[`loc_y`][~loc_x_zero] / raw[`loc_x`][~loc_x_zero])
  raw[`angle`][loc_x_zero] = np.pi / 2 
複製程式碼

時間處理

  raw[`remaining_time`] = raw[`minutes_remaining`] * 60 + raw[`seconds_remaining`]
複製程式碼

屬性唯一值及分組統計列印出來

  投籃方式
  print(kobe.action_type.unique())
  print(kobe.combined_shot_type.unique())
  print(kobe.shot_type.unique())
  分組統計
  print(kobe.shot_type.value_counts())
複製程式碼

按列進行特殊符號處理

  kobe[`season`].unique()  
  
  array([`2000-01`, `2001-02`, `2002-03`, `2003-04`, `2004-05`, `2005-06`,
         `2006-07`, `2007-08`, `2008-09`, `2009-10`, `2010-11`, `2011-12`,
         `2012-13`, `2013-14`, `2014-15`, `2015-16`, `1996-97`, `1997-98`,
         `1998-99`, `1999-00`], dtype=object)

  raw[`season`] = raw[`season`].apply(lambda x: int(x.split(`-`)[1]) )
  raw[`season`].unique()
  
  array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 97,
        98, 99,  0], dtype=int64)
複製程式碼

pd的DataFrame使用技巧（matchup兩隊對決，opponent對手是誰）

  pd.DataFrame({`matchup`:kobe.matchup, `opponent`:kobe.opponent})
複製程式碼

屬性相關性展示是否是線性關係（位置和投籃位置）

  plt.figure(figsize=(5,5))
  
  plt.scatter(raw.dist, raw.shot_distance, color=`blue`)
  plt.title(`dist and shot_distance`)
複製程式碼

pd的groupby對kebe的投射位置進行分組

  gs = kobe.groupby(`shot_zone_area`)
  print (kobe[`shot_zone_area`].value_counts())
  print (len(gs))
  
  Center(C)                11289
  Right Side Center(RC)     3981
  Right Side(R)             3859
  Left Side Center(LC)      3364
  Left Side(L)              3132
  Back Court(BC)              72
  Name: shot_zone_area, dtype: int64
  6
複製程式碼

區域劃分拉鍊展示

  import matplotlib.cm as cm
  plt.figure(figsize=(20,10))
  
  def scatter_plot_by_category(feat):
      alpha = 0.1
      gs = kobe.groupby(feat)
      cs = cm.rainbow(np.linspace(0, 1, len(gs)))
      for g, c in zip(gs, cs):
          plt.scatter(g[1].loc_x, g[1].loc_y, color=c, alpha=alpha)
  
  # shot_zone_area
  plt.subplot(131)
  scatter_plot_by_category(`shot_zone_area`)
  plt.title(`shot_zone_area`)
  
  # shot_zone_basic
  plt.subplot(132)
  scatter_plot_by_category(`shot_zone_basic`)
  plt.title(`shot_zone_basic`)
  
  # shot_zone_range
  plt.subplot(133)
  scatter_plot_by_category(`shot_zone_range`)
  plt.title(`shot_zone_range`)
複製程式碼

去除某一列

  drops = [`shot_id`, `team_id`, `team_name`, `shot_zone_area`, `shot_zone_range`, `shot_zone_basic`, 
           `matchup`, `lon`, `lat`, `seconds_remaining`, `minutes_remaining`, 
           `shot_distance`, `loc_x`, `loc_y`, `game_event_id`, `game_id`, `game_date`]
  for drop in drops:
      raw = raw.drop(drop, 1)
複製程式碼

獨熱編碼（one-hot編碼）（一列變多列（0000000）prefix指定新增列字首）

  print (raw[`combined_shot_type`].value_counts())
  pd.get_dummies(raw[`combined_shot_type`], prefix=`combined_shot_type`)[0:2]
  
  Jump Shot    23485
  Layup         5448
  Dunk          1286
  Tip Shot       184
  Hook Shot      153
  Bank Shot      141
  Name: combined_shot_type, dtype: int64
複製程式碼

獨熱編碼之後，拼接成1列後，刪除對應列。

  categorical_vars = [`action_type`, `combined_shot_type`, `shot_type`, `opponent`, `period`, `season`]
  for var in categorical_vars:
      raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1)
      raw = raw.drop(var, 1)
複製程式碼

4 模型建立

1 測試集和訓練集準備

  train_kobe = raw[pd.notnull(raw[`shot_made_flag`])]
  train_kobe = train_kobe.drop(`shot_made_flag`, 1)
  train_label = train_kobe[`shot_made_flag`]
  test_kobe = raw[pd.isnull(raw[`shot_made_flag`])]
  test_kobe = test_kobe.drop(`shot_made_flag`, 1)
複製程式碼

2 隨機森林分類

  from sklearn.ensemble import RandomForestRegressor
  from sklearn.metrics import confusion_matrix,log_loss
  import time
  
  # find the best n_estimators for RandomForestClassifier
  print(`Finding best n_estimators for RandomForestClassifier...`)
  min_score = 100000
  best_n = 0
  scores_n = []
  range_n = np.logspace(0,2,num=3).astype(int)
  for n in range_n:
      print("the number of trees : {0}".format(n))
      t1 = time.time()

  rfc_score = 0.
  rfc = RandomForestClassifier(n_estimators=n)
  for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
  rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
  #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
  pred = rfc.predict(train_kobe.iloc[test_k])
  rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
  scores_n.append(rfc_score)
  if rfc_score < min_score:
  min_score = rfc_score
  best_n = n
  
  t2 = time.time()
  print(`Done processing {0} trees ({1:.3f}sec)`.format(n, t2-t1))
  print(best_n, min_score)

  # find best max_depth for RandomForestClassifier
  print(`Finding best max_depth for RandomForestClassifier...`)
  min_score = 100000
  best_m = 0
  scores_m = []
  range_m = np.logspace(0,2,num=3).astype(int)
  for m in range_m:
  print("the max depth : {0}".format(m))
  t1 = time.time()
  
  rfc_score = 0.
  rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
  for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
      rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
      #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
      pred = rfc.predict(train_kobe.iloc[test_k])
      rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
  scores_m.append(rfc_score)
  if rfc_score < min_score:
      min_score = rfc_score
      best_m = m
  
  t2 = time.time()
  print(`Done processing {0} trees ({1:.3f}sec)`.format(m, t2-t1))
  print(best_m, min_score)
複製程式碼

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel(`score`)
plt.xlabel(`number of trees`)

plt.subplot(122)
plt.plot(range_m, scores_m)
plt.ylabel(`score`)
plt.xlabel(`max depth`)
複製程式碼

model = RandomForestClassifier(n_estimators=best_n, max_depth=best_m)
model.fit(train_kobe, train_label)
# 474241623
複製程式碼

5 總結

綜上所述， numpy與pandas與matplotlit與sklearn四劍客組成了強大的資料分析預處理支援。

秦凱新於深圳 201812081439

Pandas多維特徵資料預處理及sklearn資料不均衡處理相關技術實踐-大資料ML樣本集案例實戰
2018-12-19
特徵大資料
時間序列資料的預處理及基於ARIMA模型進行趨勢預測-大資料ML樣本集案例實戰
2018-12-10
模型大資料
資料處理利器python與scala物件導向對比分析2-大資料ML樣本集案例實戰
2018-12-16
Python物件大資料
Python技術棧與Spark交叉資料分析雙向整合進階實戰–大資料ML樣本集案例實戰
2019-03-01
PythonSpark大資料
Python技術棧與Spark交叉資料分析雙向整合進階實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
Python技術棧與Spark交叉資料分析雙向整合技術實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
Python基礎演算法庫及視覺化庫使用實踐-大資料ML樣本集案例實戰
2018-12-11
Python演算法視覺化大資料
信用卡欺詐行為邏輯迴歸資料分析-大資料ML樣本集案例實戰
2018-12-08
邏輯迴歸大資料
基於邏輯迴歸及隨機森林的多分類問題資料分析-大資料ML樣本集案例實戰
2018-12-08
邏輯迴歸隨機森林大資料
基於Scikit-learn迴歸基礎問題及TPR及ROC指標相關技術實踐-大資料ML樣本集案例實戰
2019-02-17
指標大資料
搭建雲網站資料處理的環境——安裝docker
2024-11-11
網站Docker
基於python的大資料分析-資料處理（程式碼實戰）
2019-08-30
Python大資料
資料預處理（資料清洗）的一般方法及python實現
2019-01-28
Python
大資料處理的基本流程
2019-06-11
大資料
（特徵工程實戰）ML最實用的資料預處理與特徵工程常用函式！
2020-12-13
特徵工程函式
資料分析--資料預處理
2023-12-14
資料預處理-資料清理
2020-01-19
資料預處理
2021-09-09
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
Windows環境下達夢資料庫安裝及解除安裝手冊
2021-11-21
Windows資料庫
Linux環境下達夢資料庫安裝及解除安裝手冊
2021-11-17
Linux資料庫
遙感專欄：（一）常用的遙感資料處理python庫及環境搭建
2020-10-02
Python
GO web 開發實戰三，資料庫預處理
2023-02-08
GoWeb資料庫
資料預處理-資料歸約
2020-01-19
我的《海量資料處理與大資料技術實戰》出版啦！
2020-08-28
大資料
Mac環境下安裝MongoDB資料庫
2023-02-17
MacMongoDB資料庫
python 處理資料
2020-10-29
Python
大資料處理過程是怎樣
2022-12-05
大資料
資料預處理 demo
2020-02-19
大資料處理流程包括哪些環節
2024-01-25
大資料
騰訊雲大資料實戰案例
2020-10-25
大資料
資料預處理-資料整合與資料變換
2020-01-19
Hadoop大資料實戰系列文章之安裝Hadoop
2020-11-05
Hadoop大資料
java大資料處理：如何使用Java技術實現高效的大資料處理
2023-11-22
Java大資料
大資料基礎學習-1.CentOS-7.0環境安裝
2018-04-23
大資料CentOS
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
nlp 中文資料預處理
2019-12-02
TANet資料預處理流程
2020-10-07