滑動視窗分析SQL實踐
標籤
PostgreSQL , Greenplum , 滑動視窗 , 滑窗 , 視窗語法 , window , frame , 幀
背景
滑動視窗分析是資料分析中比較常見的需求,例如需要分析每一天的最近7天內的UV、PV、sum, count, avg, min, max等。
因為每一條記錄的最近7天的資料都不一樣,不能直接GROUP BY,而需要使用 幀 的技術,得到當前行最近7行的資料,並進行統計。
使用的是PostgreSQL的視窗語法。
統計分析 – 滑窗SQL實踐
1、建立測試表,每條記錄代表某個GROUP 下的當前VALUE。
使用滑動分析,得到每條記錄最近7條的SUM,COUNT,count(distinct),avg,min,max等。
postgres=# create table t_slide (id int, grp int2, val int, crt_time timestamp);
CREATE TABLE
寫入測試資料
postgres=# insert into t_slide select generate_series(1,10000000), random()*100, random()*10000, clock_timestamp();
INSERT 0 10000000
建立索引
postgres=# create index idx_t_slide_1 on t_slide (grp,crt_time);
CREATE INDEX
滑窗查詢
postgres=# select *, count(*) over s, -- 視窗統計
sum(val) over s,
avg(val) over s,
min(val) over s,
max(val) over s
from t_slide
window s as (partition by grp order by crt_time rows 6 PRECEDING) -- 定義視窗(幀)
limit 100;
id | grp | val | crt_time | count | sum | avg | min | max
-------+-----+------+----------------------------+-------+-------+-----------------------+------+------
188 | 0 | 90 | 2017-11-29 20:03:01.62792 | 1 | 90 | 90.0000000000000000 | 90 | 90
437 | 0 | 5981 | 2017-11-29 20:03:01.628205 | 2 | 6071 | 3035.5000000000000000 | 90 | 5981
720 | 0 | 6932 | 2017-11-29 20:03:01.628541 | 3 | 13003 | 4334.3333333333333333 | 90 | 6932
979 | 0 | 4227 | 2017-11-29 20:03:01.628841 | 4 | 17230 | 4307.5000000000000000 | 90 | 6932
1642 | 0 | 2754 | 2017-11-29 20:03:01.629642 | 5 | 19984 | 3996.8000000000000000 | 90 | 6932
1917 | 0 | 6376 | 2017-11-29 20:03:01.629954 | 6 | 26360 | 4393.3333333333333333 | 90 | 6932
2112 | 0 | 2538 | 2017-11-29 20:03:01.63018 | 7 | 28898 | 4128.2857142857142857 | 90 | 6932
2170 | 0 | 7598 | 2017-11-29 20:03:01.630235 | 7 | 36406 | 5200.8571428571428571 | 2538 | 7598
2173 | 0 | 7168 | 2017-11-29 20:03:01.630237 | 7 | 37593 | 5370.4285714285714286 | 2538 | 7598
2495 | 0 | 1026 | 2017-11-29 20:03:01.630611 | 7 | 31687 | 4526.7142857142857143 | 1026 | 7598
2656 | 0 | 2522 | 2017-11-29 20:03:01.630799 | 7 | 29982 | 4283.1428571428571429 | 1026 | 7598
2850 | 0 | 5016 | 2017-11-29 20:03:01.631038 | 7 | 32244 | 4606.2857142857142857 | 1026 | 7598
2876 | 0 | 6510 | 2017-11-29 20:03:01.631073 | 7 | 32378 | 4625.4285714285714286 | 1026 | 7598
3289 | 0 | 9566 | 2017-11-29 20:03:01.631524 | 7 | 39406 | 5629.4285714285714286 | 1026 | 9566
3413 | 0 | 86 | 2017-11-29 20:03:01.631665 | 7 | 31894 | 4556.2857142857142857 | 86 | 9566
3673 | 0 | 7581 | 2017-11-29 20:03:01.631969 | 7 | 32307 | 4615.2857142857142857 | 86 | 9566
3745 | 0 | 6976 | 2017-11-29 20:03:01.632037 | 7 | 38257 | 5465.2857142857142857 | 86 | 9566
4435 | 0 | 1981 | 2017-11-29 20:03:01.632848 | 7 | 37716 | 5388.0000000000000000 | 86 | 9566
4439 | 0 | 3453 | 2017-11-29 20:03:01.632852 | 7 | 36153 | 5164.7142857142857143 | 86 | 9566
4555 | 0 | 9474 | 2017-11-29 20:03:01.632983 | 7 | 39117 | 5588.1428571428571429 | 86 | 9566
4689 | 0 | 9176 | 2017-11-29 20:03:01.633119 | 7 | 38727 | 5532.4285714285714286 | 86 | 9474
4714 | 0 | 7124 | 2017-11-29 20:03:01.633175 | 7 | 45765 | 6537.8571428571428571 | 1981 | 9474
5019 | 0 | 6520 | 2017-11-29 20:03:01.633495 | 7 | 44704 | 6386.2857142857142857 | 1981 | 9474
執行計劃如下
postgres=# explain select *, count(*) over s, sum(val) over s, avg(val) over s, min(val) over s, max(val) over s from t_slide window s as (partition by grp order by crt_time rows 6 PRECEDING) limit 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=0.43..5.91 rows=100 width=74)
-> WindowAgg (cost=0.43..547641.44 rows=10000000 width=74)
-> Index Scan using idx_t_slide_1 on t_slide (cost=0.43..272641.44 rows=10000000 width=18)
(3 rows)
效率,毫秒級
postgres=# explain (analyze,verbose,timing,costs,buffers,summary) select *, count(*) over s, sum(val) over s, avg(val) over s, min(val) over s, max(val) over s from t_slide window s as (partition by grp order by crt_time rows 6 PRECEDING) limit 100;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..5.91 rows=100 width=74) (actual time=0.038..0.429 rows=100 loops=1)
Output: id, grp, val, crt_time, (count(*) OVER (?)), (sum(val) OVER (?)), (avg(val) OVER (?)), (min(val) OVER (?)), (max(val) OVER (?))
Buffers: shared hit=74
-> WindowAgg (cost=0.43..547641.44 rows=10000000 width=74) (actual time=0.037..0.393 rows=100 loops=1)
Output: id, grp, val, crt_time, count(*) OVER (?), sum(val) OVER (?), avg(val) OVER (?), min(val) OVER (?), max(val) OVER (?)
Buffers: shared hit=74
-> Index Scan using idx_t_slide_1 on public.t_slide (cost=0.43..272641.44 rows=10000000 width=18) (actual time=0.020..0.176 rows=101 loops=1)
Output: id, grp, val, crt_time
Buffers: shared hit=74
Planning time: 0.115 ms
Execution time: 0.498 ms
(11 rows)
估值計算 – 滑窗SQL實踐
PostgreSQL的估值計算外掛
https://github.com/aggregateknowledge/postgresql-hll
通過估值計算外掛,可以快速的實現UV統計,結合視窗語法,可以快速實現滑動視窗範圍的UV統計。
最近7天的HLL估值型別得到的UV。
how about a sliding window of uniques over the past 6 days?
SELECT date, #hll_union_agg(users) OVER seven_days
FROM daily_uniques
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);
the number of uniques you saw yesterday that you didn`t see today?
SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
視窗、幀 語法
幀,指定記錄範圍,如果只指定了開始,則開始記錄到當前記錄表示一幀,統計基於這個視窗的話,就是基於這個幀的範圍資料進行統計。
視窗,指整個視窗(over partition)範圍進行統計。
A window function call represents the application of an aggregate-like function over some portion of
the rows selected by a query.
Unlike non-window aggregate calls, this is not tied to grouping of the
selected rows into a single output row — each row remains separate in the query output.
However the window function has access to all the rows that would be part of the current row`s group
according to the grouping specification (PARTITION BY list) of the window function call.
The syntax of a window function call is one of the following:
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
where window_definition has the syntax
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]
and the optional frame_clause can be one of
{ RANGE | ROWS } frame_start
{ RANGE | ROWS } BETWEEN frame_start AND frame_end
where frame_start and frame_end can be one of
UNBOUNDED PRECEDING
value PRECEDING
CURRENT ROW
value FOLLOWING
UNBOUNDED FOLLOWING
https://www.postgresql.org/docs/10/static/functions-window.html
https://www.postgresql.org/docs/10/static/tutorial-window.html
https://www.postgresql.org/docs/10/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
相關文章
- 滑動視窗演算法基本原理與實踐演算法
- 自己實現一個滑動視窗
- Sentinel 原理-滑動視窗
- 細聊滑動視窗
- 滑動視窗演算法演算法
- Hive實戰—時間滑動視窗計算Hive
- 滑動視窗最大值的golang實現Golang
- TCP 流量控制-滑動視窗TCP
- 滑動視窗演算法思路演算法
- Sentinel滑動視窗演算法演算法
- 滑動視窗問題總結
- 滑動視窗與雙指標指標
- Flink的滾動視窗、會話視窗、滑動視窗及其應用會話
- mysql視窗函式中的滑動視窗MySql函式
- 使用 Redis 實現限流——滑動視窗演算法Redis演算法
- 演算法~利用zset實現滑動視窗限流演算法
- 原始碼分析 Alibaba sentinel 滑動視窗實現原理(文末附原理圖)原始碼
- 力扣刷題-滑動視窗(字串)力扣字串
- Golang的滑動視窗計數器Redis限速實現GolangRedis
- 滑動視窗濾波器原理分析及詳細程式碼講解實現
- WeetCode2滑動視窗系列
- 滑動視窗法——Leetcode例題LeetCode
- 239. 滑動視窗最大值
- TCP的滑動視窗和擁塞控制TCP
- [分散式限流] 滑動視窗演算法的 Golang 實現分散式演算法Golang
- 氣球遊戲騰訊面試題滑動視窗解法遊戲面試題
- 【演算法】滑動視窗三步走演算法
- 滑動視窗相關的題目總結
- 滑動視窗(Sliding Window)技巧總結
- 滑動視窗法——子串相關問題
- [Python手撕]滑動視窗最大值Python
- 滑動視窗最大值——棧與佇列佇列
- 透過滑動視窗實現介面呼叫的多種限制策略
- 騰訊面試題-求滑動視窗的最大值面試題
- 對滑動視窗單調性的一點思考
- 滑動視窗演算法(Sliding Window Algorithm)演算法Go
- 「LeetCode Top100」之滑動視窗LeetCode
- 滑動視窗(Sliding Window)演算法介紹演算法