版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
1 資料的預處理
-
時間序列資料生成
import pandas as pd import numpy as np date_range: 可以指定開始時間與週期 H:小時 D:天 M:月 # TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01 rng = pd.date_range('2016-07-01', periods = 10, freq = '3D') rng DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10', '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22', '2016-07-25', '2016-07-28'], dtype='datetime64[ns]', freq='3D') time=pd.Series(np.random.randn(20), index=pd.date_range(dt.datetime(2016,1,1),periods=20)) print(time) 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 複製程式碼
-
truncate過濾
time.truncate(before='2016-1-10') 2016-01-10 -0.005444 2016-01-11 -0.134409 2016-01-12 0.249318 2016-01-13 -0.297842 2016-01-14 -0.128514 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 time.truncate(after='2016-1-10') 2016-01-01 -0.129379 2016-01-02 0.164480 2016-01-03 -0.639117 2016-01-04 -0.427224 2016-01-05 2.055133 2016-01-06 1.116075 2016-01-07 0.357426 2016-01-08 0.274249 2016-01-09 0.834405 2016-01-10 -0.005444 Freq: D, dtype: float64 print(time['2016-01-15':'2016-01-20']) 2016-01-15 0.063690 2016-01-16 -2.246031 2016-01-17 0.359552 2016-01-18 0.383030 2016-01-19 0.402717 2016-01-20 -0.694068 Freq: D, dtype: float64 data=pd.date_range('2010-01-01','2011-01-01',freq='M') print(data) DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'], dtype='datetime64[ns]', freq='M') # 指定索引 rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D') rng pd.Series(range(len(rng)), index = rng) 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 複製程式碼
-
指定索引
periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')] ts = pd.Series(np.random.randn(len(periods)), index = periods) ts 2016-07-01 0 2016-07-02 1 2016-07-03 2 2016-07-04 3 2016-07-05 4 2016-07-06 5 2016-07-07 6 2016-07-08 7 2016-07-09 8 2016-07-10 9 Freq: D, dtype: int32 複製程式碼
-
時間戳和時間週期可以轉換
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H')) ts 2016-07-10 08:00:00 0 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 2016-07-10 12:00:00 4 2016-07-10 13:00:00 5 2016-07-10 14:00:00 6 2016-07-10 15:00:00 7 2016-07-10 16:00:00 8 2016-07-10 17:00:00 9 Freq: H, dtype: int32 ts_period = ts.to_period() ts_period 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 2016-07-10 12:00 4 2016-07-10 13:00 5 2016-07-10 14:00 6 2016-07-10 15:00 7 2016-07-10 16:00 8 2016-07-10 17:00 9 Freq: H, dtype: int32 ts_period['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 08:00 0 2016-07-10 09:00 1 2016-07-10 10:00 2 2016-07-10 11:00 3 Freq: H, dtype: int32 ts['2016-07-10 08:30':'2016-07-10 11:45'] 2016-07-10 09:00:00 1 2016-07-10 10:00:00 2 2016-07-10 11:00:00 3 Freq: H, dtype: int32 複製程式碼
2 資料重取樣
-
時間資料由一個頻率轉換到另一個頻率
-
降取樣
-
升取樣
rng = pd.date_range('1/1/2011', periods=90, freq='D') ts = pd.Series(np.random.randn(len(rng)), index=rng) ts.head() 2011-01-01 -1.025562 2011-01-02 0.410895 2011-01-03 0.660311 2011-01-04 0.710293 2011-01-05 0.444985 Freq: D, dtype: float64 ts.resample('M').sum() 2011-01-31 2.510102 2011-02-28 0.583209 2011-03-31 2.749411 Freq: M, dtype: float64 ts.resample('3D').sum() 2011-01-01 0.045643 2011-01-04 -2.255206 2011-01-07 0.571142 2011-01-10 0.835032 2011-01-13 -0.396766 2011-01-16 -1.156253 2011-01-19 -1.286884 2011-01-22 2.883952 2011-01-25 1.566908 2011-01-28 1.435563 2011-01-31 0.311565 2011-02-03 -2.541235 2011-02-06 0.317075 2011-02-09 1.598877 2011-02-12 -1.950509 2011-02-15 2.928312 2011-02-18 -0.733715 2011-02-21 1.674817 2011-02-24 -2.078872 2011-02-27 2.172320 2011-03-02 -2.022104 2011-03-05 -0.070356 2011-03-08 1.276671 2011-03-11 -2.835132 2011-03-14 -1.384113 2011-03-17 1.517565 2011-03-20 -0.550406 2011-03-23 0.773430 2011-03-26 2.244319 2011-03-29 2.951082 Freq: 3D, dtype: float64 day3Ts = ts.resample('3D').mean() day3Ts 2011-01-01 0.015214 2011-01-04 -0.751735 2011-01-07 0.190381 2011-01-10 0.278344 2011-01-13 -0.132255 2011-01-16 -0.385418 2011-01-19 -0.428961 2011-01-22 0.961317 2011-01-25 0.522303 2011-01-28 0.478521 2011-01-31 0.103855 2011-02-03 -0.847078 2011-02-06 0.105692 2011-02-09 0.532959 2011-02-12 -0.650170 2011-02-15 0.976104 2011-02-18 -0.244572 2011-02-21 0.558272 2011-02-24 -0.692957 2011-02-27 0.724107 2011-03-02 -0.674035 2011-03-05 -0.023452 2011-03-08 0.425557 2011-03-11 -0.945044 2011-03-14 -0.461371 2011-03-17 0.505855 2011-03-20 -0.183469 2011-03-23 0.257810 2011-03-26 0.748106 2011-03-29 0.983694 Freq: 3D, dtype: float64 ## 下采樣 print(day3Ts.resample('D').asfreq()) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 NaN 2011-01-13 -0.132255 2011-01-14 NaN 2011-01-15 NaN 2011-01-16 -0.385418 2011-01-17 NaN 2011-01-18 NaN 2011-01-19 -0.428961 2011-01-20 NaN 2011-01-21 NaN 2011-01-22 0.961317 Freq: D, Length: 88, dtype: float64 複製程式碼
-
ffill 空值取前面的值
-
bfill 空值取後面的值
-
interpolate 線性取值
day3Ts.resample('D').ffill(1) 2011-01-01 0.015214 2011-01-02 0.015214 2011-01-03 NaN 2011-01-04 -0.751735 2011-01-05 -0.751735 2011-01-06 NaN 2011-01-07 0.190381 2011-01-08 0.190381 2011-01-09 NaN 2011-01-10 0.278344 2011-01-11 0.278344 day3Ts.resample('D').bfill(1) 2011-01-01 0.015214 2011-01-02 NaN 2011-01-03 -0.751735 2011-01-04 -0.751735 2011-01-05 NaN 2011-01-06 0.190381 2011-01-07 0.190381 2011-01-08 NaN 2011-01-09 0.278344 2011-01-10 0.278344 2011-01-11 NaN 2011-01-12 -0.132255 2011-01-13 -0.132255 day3Ts.resample('D').interpolate('linear') 2011-01-01 0.015214 2011-01-02 -0.240435 2011-01-03 -0.496085 2011-01-04 -0.751735 2011-01-05 -0.437697 2011-01-06 -0.123658 2011-01-07 0.190381 2011-01-08 0.219702 2011-01-09 0.249023 2011-01-10 0.278344 2011-01-11 0.141478 2011-01-12 0.004611 2011-01-13 -0.132255 2011-01-14 -0.216643 2011-01-15 -0.301030 複製程式碼
3 滑動窗
-
滑動窗計算
%matplotlib inline import matplotlib.pylab import numpy as np import pandas as pd df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600)) df.head() 2016-07-01 -0.192140 2016-07-02 0.357953 2016-07-03 -0.201847 2016-07-04 -0.372230 2016-07-05 1.414753 Freq: D, dtype: float64 r = df.rolling(window = 10) #r.max, r.median, r.std, r.skew, r.sum, r.var print(r.mean()) 016-07-01 NaN 2016-07-02 NaN 2016-07-03 NaN 2016-07-04 NaN 2016-07-05 NaN 2016-07-06 NaN 2016-07-07 NaN 2016-07-08 NaN 2016-07-09 NaN 2016-07-10 0.300133 2016-07-11 0.284780 2016-07-12 0.252831 2016-07-13 0.220699 2016-07-14 0.167137 2016-07-15 0.018593 2016-07-16 -0.061414 2016-07-17 -0.134593 2016-07-18 -0.153333 2016-07-19 -0.218928 2016-07-20 -0.169426 2016-07-21 -0.219747 2016-07-22 -0.181266 2016-07-23 -0.173674 2016-07-24 -0.130629 2016-07-25 -0.166730 2016-07-26 -0.233044 2016-07-27 -0.256642 2016-07-28 -0.280738 2016-07-29 -0.289893 2016-07-30 -0.379625 ... 2018-01-22 -0.211467 2018-01-23 0.034996 2018-01-24 -0.105910 2018-01-25 -0.145774 2018-01-26 -0.089320 2018-01-27 -0.164370 2018-01-28 -0.110892 2018-01-29 -0.205786 2018-01-30 -0.101162 2018-01-31 -0.034760 2018-02-01 0.229333 2018-02-02 0.043741 2018-02-03 0.052837 2018-02-04 0.057746 2018-02-05 -0.071401 2018-02-06 -0.011153 2018-02-07 -0.045737 2018-02-08 -0.021983 2018-02-09 -0.196715 2018-02-10 -0.063721 2018-02-11 -0.289452 2018-02-12 -0.050946 2018-02-13 -0.047014 2018-02-14 0.048754 2018-02-15 0.143949 2018-02-16 0.424823 2018-02-17 0.361878 2018-02-18 0.363235 2018-02-19 0.517436 2018-02-20 0.368020 Freq: D, Length: 600, dtype: float64 複製程式碼
-
視覺化
import matplotlib.pyplot as plt %matplotlib inline plt.figure(figsize=(15, 5)) df.plot(style='r--') df.rolling(window=10).mean().plot(style='b') 複製程式碼
4 ARIMA預測
-
資料的預處理
import pandas_datareader import datetime import matplotlib.pylab as plt import seaborn as sns from matplotlib.pylab import style from statsmodels.tsa.arima_model import ARIMA from statsmodels.graphics.tsaplots import plot_acf, plot_pacf style.use('ggplot') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False stockFile = 'data/T10yr.csv' stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0]) stock.head(10) 複製程式碼
stock_week = stock['Close'].resample('W-MON').mean()
stock_train = stock_week['2000':'2015']
stock_train.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Stock Close")
sns.despine()
複製程式碼
stock_diff = stock_train.diff()
stock_diff = stock_diff.dropna()
plt.figure()
plt.plot(stock_diff)
plt.title('一階差分')
plt.show()
複製程式碼
acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
複製程式碼
pacf = plot_pacf(stock_diff, lags=20)
plt.title("PACF")
pacf.show()
複製程式碼
model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')
result = model.fit()
#print(result.summary())
pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')
print (pred)
2014-06-09 2.463559
2014-06-16 2.455539
2014-06-23 2.449569
2014-06-30 2.444183
2014-07-07 2.438962
2014-07-14 2.433788
2014-07-21 2.428627
2014-07-28 2.423470
2014-08-04 2.418315
2014-08-11 2.413159
2014-08-18 2.408004
2014-08-25 2.402849
2014-09-01 2.397693
2014-09-08 2.392538
2014-09-15 2.387383
plt.figure(figsize=(6, 6))
plt.xticks(rotation=45)
plt.plot(pred)
plt.plot(stock_train)
複製程式碼
5 總結
方便複習,整成筆記,內容粗略,勿怪
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。