Pandas高階教程之:window操作

flydean發表於2021-07-19

原文網址 : https://www.cnblogs.com/flydean/p/15028715.html

簡介

在資料統計中，經常需要進行一些範圍操作，這些範圍我們可以稱之為一個window 。Pandas提供了一個rolling方法，通過滾動window來進行統計計算。

本文將會探討一下rolling中的window用法。

滾動視窗

我們有5個數，我們希望滾動統計兩個數的和，那麼可以這樣：

In [1]: s = pd.Series(range(5))

In [2]: s.rolling(window=2).sum()
Out[2]: 
0    NaN
1    1.0
2    3.0
3    5.0
4    7.0
dtype: float64

rolling 物件可以通過for來遍歷：

In [3]: for window in s.rolling(window=2):
   ...:     print(window)
   ...: 
0    0
dtype: int64
0    0
1    1
dtype: int64
1    1
2    2
dtype: int64
2    2
3    3
dtype: int64
3    3
4    4
dtype: int64

pandas中有四種window操作，我們看下他們的定義：

名稱	方法	返回物件	是否支援時間序列	是否支援鏈式groupby操作
固定或者可滑動的視窗	`rolling`	`Rolling`	Yes	Yes
scipy.signal庫提供的加權非矩形視窗	`rolling`	`Window`	No	No
累積值的視窗	`expanding`	`Expanding`	No	Yes
值上的累積和指數加權視窗	`ewm`	`ExponentialMovingWindow`	No	Yes (as of version 1.2)

看一個基於時間rolling的例子：

In [4]: s = pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))

In [5]: s.rolling(window='2D').sum()
Out[5]: 
2020-01-01    0.0
2020-01-02    1.0
2020-01-03    3.0
2020-01-04    5.0
2020-01-05    7.0
Freq: D, dtype: float64

設定min_periods可以指定window中的最小的NaN的個數：

In [8]: s = pd.Series([np.nan, 1, 2, np.nan, np.nan, 3])

In [9]: s.rolling(window=3, min_periods=1).sum()
Out[9]: 
0    NaN
1    1.0
2    3.0
3    3.0
4    2.0
5    3.0
dtype: float64

In [10]: s.rolling(window=3, min_periods=2).sum()
Out[10]: 
0    NaN
1    NaN
2    3.0
3    3.0
4    NaN
5    NaN
dtype: float64

# Equivalent to min_periods=3
In [11]: s.rolling(window=3, min_periods=None).sum()
Out[11]: 
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
dtype: float64

Center window

預設情況下window的統計是以最右為準，比如window=5,那麼前面的0，1，2，3 因為沒有達到5，所以為NaN。

In [19]: s = pd.Series(range(10))

In [20]: s.rolling(window=5).mean()
Out[20]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

可以對這種方式進行修改，設定 center=True 可以從中間統計：

In [21]: s.rolling(window=5, center=True).mean()
Out[21]: 
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    NaN
9    NaN
dtype: float64

Weighted window 加權視窗

使用 win_type 可以指定加權視窗的型別。其中win_type 必須是scipy.signal 中的window型別。

舉幾個例子：

In [47]: s = pd.Series(range(10))

In [48]: s.rolling(window=5).mean()
Out[48]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

In [49]: s.rolling(window=5, win_type="triang").mean()
Out[49]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

# Supplementary Scipy arguments passed in the aggregation function
In [50]: s.rolling(window=5, win_type="gaussian").mean(std=0.1)
Out[50]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

擴充套件視窗

擴充套件視窗會產生聚合統計資訊的值，其中包含該時間點之前的所有可用資料。

In [51]: df = pd.DataFrame(range(5))

In [52]: df.rolling(window=len(df), min_periods=1).mean()
Out[52]: 
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

In [53]: df.expanding(min_periods=1).mean()
Out[53]: 
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

指數加權視窗

指數加權視窗與擴充套件視窗相似，但每個先驗點相對於當前點均按指數加權。

加權計算的公式是這樣的：

\(y_t=Σ^t_{i=0}{w_ix_{t-i}\over{Σ^t_{i=0}w_i}}\)

其中\(x_t\)是輸入，\(y_t\)是輸出，\(w_i\)是權重。

EW有兩種模式，一種模式是 adjust=True ，這種情況下 \(?_?=(1−?)^?\)

一種模式是 adjust=False ，這種情況下：

\[y_0=x_0\\n y_t=(1-a)y_{t-1}+ax_t \]

其中 0<?≤1, 根據EM方式的不同a可以有不同的取值：

\[a=\{ {{2\over {s+1}} \qquad span模式其中s >= 1\\ {1\over{1+c}}\qquad center of mass c>=0 \\ 1-exp^{log0.5\over h} \qquad half-life h > 0 } \]

舉個例子：

In [54]: df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})

In [55]: df
Out[55]: 
     B
0  0.0
1  1.0
2  2.0
3  NaN
4  4.0

In [56]: times = ["2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17"]

In [57]: df.ewm(halflife="4 days", times=pd.DatetimeIndex(times)).mean()
Out[57]: 
          B
0  0.000000
1  0.585786
2  1.523889
3  1.523889
4  3.233686

本文已收錄於 http://www.flydean.com/12-python-pandas-window/

最通俗的解讀，最深刻的乾貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

Pandas高階教程之:GroupBy用法
2021-07-12
Pandas高階教程之:統計方法
2021-07-08
Pandas高階教程之:自定義選項
2021-07-22
Pandas高階教程之:時間處理
2021-10-11
Pandas高階教程之:Dataframe的合併
2021-06-14
Pandas高階教程之:稀疏資料結構
2021-07-20
資料結構
Pandas高階教程之:plot畫圖詳解
2021-07-07
Pandas高階教程之:處理缺失資料
2021-06-24
Pandas高階教程之:category資料型別
2021-06-28
Go資料型別
Pandas高階教程之:處理text資料
2021-06-23
Pandas高階教程之:Dataframe的重排和旋轉
2021-06-15
hyperf 教程之 hyperf-auth 高階用法
2020-05-28
Hive高階操作-查詢操作
2024-06-28
Hive
shell程式設計，實戰高階進階教學
2020-11-25
程式設計
Java 8 Strem高階操作
2019-01-19
JavaREM
hive03_高階操作
2024-07-26
Hive
Pandas之:Pandas高階教程以鐵達尼號真實資料為例
2021-06-07
pandas 列操作
2020-12-17
Pandas進階貳 pandas基礎
2020-12-20
【Pandas學習筆記02】-資料處理高階用法
2021-12-01
筆記
高手系列！資料科學傢俬藏pandas高階用法大全 ⛵
2022-12-01
資料科學
ansible高階操作 serial滾動更新
2024-06-10
（新手)使用pandas操作EXCEL
2019-01-08
Excel
pandas操作csv檔案
2019-04-11
C++高階教程之繼承得本質：單繼承（一)
2020-11-30
C++繼承
前端進階課程之this指向
2018-11-12
前端
React教程之高階元件
2019-03-15
React元件
python列表(list)的使用技巧及高階操作
2018-08-22
Python
ThreeJs-05紋理材質高階操作
2024-12-02
JS
如何用python pandas操作excel?
2021-09-11
PythonExcel
Go語言學習教程：xorm表基本操作及高階操作
2019-04-04
GoORM
Spark Streaming中的Window操作
2020-12-28
Spark
神奇的 SQL ，高階處理之 Window Functions → 打破我們的侷限！
2023-12-18
SQLFunction
SpringBoot操作ES進行各種高階查詢
2019-08-12
Spring Boot
前端進階課程之宣告提升
2018-11-10
前端
Pandas 分組聚合操作詳解
2023-11-15
切片操作專題之numpy、pandas
2020-12-31
好程式設計師分享大資料教程之執行緒高階部分
2019-12-09
程式設計師大資料執行緒