滑動視窗分析SQL實踐

德哥發表於2017-11-30

標籤

PostgreSQL , Greenplum , 滑動視窗 , 滑窗 , 視窗語法 , window , frame , 幀


背景

滑動視窗分析是資料分析中比較常見的需求,例如需要分析每一天的最近7天內的UV、PV、sum, count, avg, min, max等。

pic

因為每一條記錄的最近7天的資料都不一樣,不能直接GROUP BY,而需要使用 幀 的技術,得到當前行最近7行的資料,並進行統計。

使用的是PostgreSQL的視窗語法。

統計分析 – 滑窗SQL實踐

1、建立測試表,每條記錄代表某個GROUP 下的當前VALUE。

使用滑動分析,得到每條記錄最近7條的SUM,COUNT,count(distinct),avg,min,max等。

postgres=# create table t_slide (id int, grp int2, val int, crt_time timestamp);  
CREATE TABLE  

寫入測試資料

postgres=# insert into t_slide select generate_series(1,10000000), random()*100, random()*10000, clock_timestamp();  
INSERT 0 10000000  

建立索引

postgres=# create index idx_t_slide_1 on t_slide (grp,crt_time);  
CREATE INDEX  

滑窗查詢

postgres=# select *, count(*) over s,   -- 視窗統計  
                     sum(val) over s,   
		     avg(val) over s,   
		     min(val) over s,   
		     max(val) over s   
           from t_slide   
	   window s as (partition by grp order by crt_time rows 6 PRECEDING)  -- 定義視窗(幀)  
limit 100;  
  
    
  id   | grp | val  |          crt_time          | count |  sum  |          avg          | min  | max    
-------+-----+------+----------------------------+-------+-------+-----------------------+------+------  
   188 |   0 |   90 | 2017-11-29 20:03:01.62792  |     1 |    90 |   90.0000000000000000 |   90 |   90  
   437 |   0 | 5981 | 2017-11-29 20:03:01.628205 |     2 |  6071 | 3035.5000000000000000 |   90 | 5981  
   720 |   0 | 6932 | 2017-11-29 20:03:01.628541 |     3 | 13003 | 4334.3333333333333333 |   90 | 6932  
   979 |   0 | 4227 | 2017-11-29 20:03:01.628841 |     4 | 17230 | 4307.5000000000000000 |   90 | 6932  
  1642 |   0 | 2754 | 2017-11-29 20:03:01.629642 |     5 | 19984 | 3996.8000000000000000 |   90 | 6932  
  1917 |   0 | 6376 | 2017-11-29 20:03:01.629954 |     6 | 26360 | 4393.3333333333333333 |   90 | 6932  
  2112 |   0 | 2538 | 2017-11-29 20:03:01.63018  |     7 | 28898 | 4128.2857142857142857 |   90 | 6932  
  2170 |   0 | 7598 | 2017-11-29 20:03:01.630235 |     7 | 36406 | 5200.8571428571428571 | 2538 | 7598  
  2173 |   0 | 7168 | 2017-11-29 20:03:01.630237 |     7 | 37593 | 5370.4285714285714286 | 2538 | 7598  
  2495 |   0 | 1026 | 2017-11-29 20:03:01.630611 |     7 | 31687 | 4526.7142857142857143 | 1026 | 7598  
  2656 |   0 | 2522 | 2017-11-29 20:03:01.630799 |     7 | 29982 | 4283.1428571428571429 | 1026 | 7598  
  2850 |   0 | 5016 | 2017-11-29 20:03:01.631038 |     7 | 32244 | 4606.2857142857142857 | 1026 | 7598  
  2876 |   0 | 6510 | 2017-11-29 20:03:01.631073 |     7 | 32378 | 4625.4285714285714286 | 1026 | 7598  
  3289 |   0 | 9566 | 2017-11-29 20:03:01.631524 |     7 | 39406 | 5629.4285714285714286 | 1026 | 9566  
  3413 |   0 |   86 | 2017-11-29 20:03:01.631665 |     7 | 31894 | 4556.2857142857142857 |   86 | 9566  
  3673 |   0 | 7581 | 2017-11-29 20:03:01.631969 |     7 | 32307 | 4615.2857142857142857 |   86 | 9566  
  3745 |   0 | 6976 | 2017-11-29 20:03:01.632037 |     7 | 38257 | 5465.2857142857142857 |   86 | 9566  
  4435 |   0 | 1981 | 2017-11-29 20:03:01.632848 |     7 | 37716 | 5388.0000000000000000 |   86 | 9566  
  4439 |   0 | 3453 | 2017-11-29 20:03:01.632852 |     7 | 36153 | 5164.7142857142857143 |   86 | 9566  
  4555 |   0 | 9474 | 2017-11-29 20:03:01.632983 |     7 | 39117 | 5588.1428571428571429 |   86 | 9566  
  4689 |   0 | 9176 | 2017-11-29 20:03:01.633119 |     7 | 38727 | 5532.4285714285714286 |   86 | 9474  
  4714 |   0 | 7124 | 2017-11-29 20:03:01.633175 |     7 | 45765 | 6537.8571428571428571 | 1981 | 9474  
  5019 |   0 | 6520 | 2017-11-29 20:03:01.633495 |     7 | 44704 | 6386.2857142857142857 | 1981 | 9474  

執行計劃如下

postgres=# explain select *, count(*) over s, sum(val) over s, avg(val) over s, min(val) over s, max(val) over s from t_slide window s as (partition by grp order by crt_time rows 6 PRECEDING) limit 100;  
                                              QUERY PLAN                                                
------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.43..5.91 rows=100 width=74)  
   ->  WindowAgg  (cost=0.43..547641.44 rows=10000000 width=74)  
         ->  Index Scan using idx_t_slide_1 on t_slide  (cost=0.43..272641.44 rows=10000000 width=18)  
(3 rows)  

效率,毫秒級

postgres=# explain (analyze,verbose,timing,costs,buffers,summary) select *, count(*) over s, sum(val) over s, avg(val) over s, min(val) over s, max(val) over s from t_slide window s as (partition by grp order by crt_time rows 6 PRECEDING) limit 100;  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.43..5.91 rows=100 width=74) (actual time=0.038..0.429 rows=100 loops=1)  
   Output: id, grp, val, crt_time, (count(*) OVER (?)), (sum(val) OVER (?)), (avg(val) OVER (?)), (min(val) OVER (?)), (max(val) OVER (?))  
   Buffers: shared hit=74  
   ->  WindowAgg  (cost=0.43..547641.44 rows=10000000 width=74) (actual time=0.037..0.393 rows=100 loops=1)  
         Output: id, grp, val, crt_time, count(*) OVER (?), sum(val) OVER (?), avg(val) OVER (?), min(val) OVER (?), max(val) OVER (?)  
         Buffers: shared hit=74  
         ->  Index Scan using idx_t_slide_1 on public.t_slide  (cost=0.43..272641.44 rows=10000000 width=18) (actual time=0.020..0.176 rows=101 loops=1)  
               Output: id, grp, val, crt_time  
               Buffers: shared hit=74  
 Planning time: 0.115 ms  
 Execution time: 0.498 ms  
(11 rows)  

估值計算 – 滑窗SQL實踐

PostgreSQL的估值計算外掛

https://github.com/aggregateknowledge/postgresql-hll

通過估值計算外掛,可以快速的實現UV統計,結合視窗語法,可以快速實現滑動視窗範圍的UV統計。

最近7天的HLL估值型別得到的UV。

how about a sliding window of uniques over the past 6 days?

SELECT date, #hll_union_agg(users) OVER seven_days  
FROM daily_uniques  
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);  

the number of uniques you saw yesterday that you didn`t see today?

SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques  
FROM daily_uniques  
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);  

視窗、幀 語法

幀,指定記錄範圍,如果只指定了開始,則開始記錄到當前記錄表示一幀,統計基於這個視窗的話,就是基於這個幀的範圍資料進行統計。

視窗,指整個視窗(over partition)範圍進行統計。

A window function call represents the application of an aggregate-like function over some portion of
the rows selected by a query.

Unlike non-window aggregate calls, this is not tied to grouping of the
selected rows into a single output row — each row remains separate in the query output.

However the window function has access to all the rows that would be part of the current row`s group
according to the grouping specification (PARTITION BY list) of the window function call.

The syntax of a window function call is one of the following:

function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER window_name  
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )  
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER window_name  
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )  

where window_definition has the syntax

[ existing_window_name ]  
[ PARTITION BY expression [, ...] ]  
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]  
[ frame_clause ]  

and the optional frame_clause can be one of

{ RANGE | ROWS } frame_start  
{ RANGE | ROWS } BETWEEN frame_start AND frame_end  

where frame_start and frame_end can be one of

UNBOUNDED PRECEDING  
value PRECEDING  
CURRENT ROW  
value FOLLOWING  
UNBOUNDED FOLLOWING  

https://www.postgresql.org/docs/10/static/functions-window.html

https://www.postgresql.org/docs/10/static/tutorial-window.html

https://www.postgresql.org/docs/10/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS


相關文章