PostgreSQLOracle相容性之-PartitionByOuterJoin實現稠化報表

德哥發表於2017-12-22

標籤

PostgreSQL , Oracle , 稠化報表 , partition by outer join


背景

背景介紹:借用Oracle的一篇例子:

http://blog.sina.com.cn/s/blog_4cef5c7b01016lm5.html

在資料庫表中,儲存的資料經常是稀疏資料(sparse data),而不是稠密資料(dense data)。

先來了解一下什麼是稀疏資料,比如一個產品銷售情況表(比如有產品名、銷售時間(精確到年月)、銷售量3個列),假設某個時間某些產品它沒有銷售,一般也不會將這些產品的銷售量儲存為0,而是不儲存,這樣在產品銷售情況表中就會產生很多缺失的行(gap rows),導致的結果就是特定產品銷售資料按時間維度進行排序,是不連續的,或者說此產品銷售在時間序列上是有缺失的。顧名思義,稠密資料是相對於稀疏資料來說的,還是用上面的假設說明,也就是說產品在某個時間沒有銷售,也必須儲存此產品銷售情況,銷售量置0儲存,這樣對某個特定產品來說它在時間序列就是連續的,但是事實經常不是如此,所以才有將稀疏資料稠密化的過程,資料稠密化在資料倉儲應用中很常見。

例如以下報表,由於某些商品在某些月份沒有銷售,所以不會有記錄。

postgres=# create table t_sales(year text, month text, prod name, sales int);  
CREATE TABLE  
postgres=# insert into t_sales values (`2008`, `1`, `a`, 1000);  
INSERT 0 1  
postgres=# insert into t_sales values (`2008`, `1`, `b`, 1500);  
INSERT 0 1  
postgres=# insert into t_sales values (`2008`, `2`, `a`, 2000);  
INSERT 0 1  
postgres=# insert into t_sales values (`2008`, `2`, `b`, 3000);  
INSERT 0 1  
postgres=# insert into t_sales values (`2008`, `2`, `c`, 1000);  
INSERT 0 1  
postgres=# insert into t_sales values (`2008`, `3`, `a`, 3000);  
INSERT 0 1  
  
  
postgres=# select * from t_sales;  
 year | month | prod | sales   
------+-------+------+-------  
 2008 | 1     | a    |  1000  
 2008 | 1     | b    |  1500  
 2008 | 2     | a    |  2000  
 2008 | 2     | b    |  3000  
 2008 | 2     | c    |  1000  
 2008 | 3     | a    |  3000  
(6 rows)  

Oracle 通過partition by outer join語法可以方便的實現稠化。

SELECT .....  
  
FROM table_reference  
  
PARTITION BY (expr [, expr ]... )  
  
RIGHT OUTER JOIN table_reference  
  
   
  
SELECT .....  
  
FROM table_reference  
  
LEFT OUTER JOIN table_reference  

達到類似這樣的效果

postgres=# select * from continue_out();  
 year | month | prod | sales   
------+-------+------+-------  
 2008 | 2     | a    |  2000  
 2008 | 2     | b    |  3000  
 2008 | 2     | c    |  1000  
 2008 | 3     | a    |  3000  
 2008 | 3     | b    |     0  
 2008 | 3     | c    |     0  
 2008 | 1     | a    |  1000  
 2008 | 1     | b    |  1500  
 2008 | 1     | c    |     0  
(9 rows)  

PostgreSQL Partition By Outer Join實現

通過UDF,可以達到類似的效果:

create or replace function continue_out () returns setof t_sales as $$  
declare  
  v_year text;  
  v_month text;  
begin  
  -- 按年、月為分割槽 補齊prod  (select distinct year, month from t_sales)  
  -- 如果已有元表,則直接取元表  
  -- 如無元表,對於連續型資料,亦可以使用PG的generate_series直接生成。  
  for v_year, v_month in   
    select distinct year, month from t_sales    
  loop  
    -- 全量 prod  (select distinct prod from t_sales)   
    -- 如已有元表,則直接取元表  
    -- 如無元表,對於連續型資料,亦可以使用PG的generate_series直接生成。  
    return query   
      select   
        coalesce(t1.year,v_year),   
        coalesce(t1.month,v_month),   
        t2.prod,   
        coalesce(t1.sales, 0)   
      from   
      t_sales t1   
        right outer join   
      (select distinct prod from t_sales) t2   
        on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month);   
  end loop;    
end;  
$$ language plpgsql strict;  

如下

postgres=# select * from continue_out();  
 year | month | prod | sales   
------+-------+------+-------  
 2008 | 2     | a    |  2000  
 2008 | 2     | b    |  3000  
 2008 | 2     | c    |  1000  
 2008 | 3     | a    |  3000  
 2008 | 3     | b    |     0  
 2008 | 3     | c    |     0  
 2008 | 1     | a    |  1000  
 2008 | 1     | b    |  1500  
 2008 | 1     | c    |     0  
(9 rows)  

通過視窗,可以實現同比,從而實現快速的搜尋到銷量逆增長的商品。

postgres=# select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);  
 year | month | prod | sales | ?column?   
------+-------+------+-------+----------  
 2008 | 1     | a    |  1000 |        0  
 2008 | 2     | a    |  2000 |     1000  
 2008 | 3     | a    |  3000 |     1000  
 2008 | 1     | b    |  1500 |        0  
 2008 | 2     | b    |  3000 |     1500  
 2008 | 3     | b    |     0 |    -3000  
 2008 | 1     | c    |     0 |        0  
 2008 | 2     | c    |  1000 |     1000  
 2008 | 3     | c    |     0 |    -1000  
(9 rows)  

詳細執行計劃如下:

postgres=# load `auto_explain`;
LOAD
Time: 22.171 ms
postgres=# set auto_explain.
auto_explain.log_analyze            auto_explain.log_format             auto_explain.log_nested_statements  auto_explain.log_triggers           auto_explain.sample_rate            
auto_explain.log_buffers            auto_explain.log_min_duration       auto_explain.log_timing             auto_explain.log_verbose            
postgres=# set auto_explain.log_analyze =on;
SET
Time: 0.272 ms
postgres=# set auto_explain.log_buffers =on;
SET
Time: 0.257 ms
postgres=# set auto_explain.log_min_duration =0;
SET
Time: 0.279 ms
postgres=# set auto_explain.log_nested_statements =on;
SET
Time: 0.262 ms
postgres=# set auto_explain.log_timing =on;
SET
Time: 0.269 ms
postgres=# set auto_explain.log_verbose =on;
SET
Time: 0.245 ms
postgres=# set client_min_messages =`log`;
SET
Time: 0.265 ms

postgres=# select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);
LOG:  duration: 0.030 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1 
    right outer join 
  (select distinct prod from t_sales) t2 
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.024..0.027 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.009..0.009 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.007..0.007 rows=3 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.003..0.004 rows=3 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 3
              Buffers: shared hit=1
LOG:  duration: 0.020 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1 
    right outer join 
  (select distinct prod from t_sales) t2 
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.016..0.018 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.008..0.009 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.003..0.004 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=1 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.002..0.003 rows=1 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 5
              Buffers: shared hit=1
LOG:  duration: 0.019 ms  plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1 
    right outer join 
  (select distinct prod from t_sales) t2 
  on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join  (cost=34.04..38.80 rows=200 width=132) (actual time=0.014..0.017 rows=3 loops=1)
  Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
  Hash Cond: (t_sales.prod = t1.prod)
  Buffers: shared hit=2
  ->  HashAggregate  (cost=16.38..18.38 rows=200 width=64) (actual time=0.007..0.008 rows=3 loops=1)
        Output: t_sales.prod
        Group Key: t_sales.prod
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
              Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
              Buffers: shared hit=1
  ->  Hash  (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=2 loops=1)
        Output: t1.year, t1.month, t1.sales, t1.prod
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on public.t_sales t1  (cost=0.00..17.65 rows=1 width=132) (actual time=0.001..0.003 rows=2 loops=1)
              Output: t1.year, t1.month, t1.sales, t1.prod
              Filter: ((t1.year = $2) AND (t1.month = $3))
              Rows Removed by Filter: 4
              Buffers: shared hit=1
LOG:  duration: 0.045 ms  plan:
Query Text: select distinct year, month from t_sales
HashAggregate  (cost=17.65..19.65 rows=200 width=64) (actual time=0.040..0.041 rows=3 loops=1)
  Output: year, month
  Group Key: t_sales.year, t_sales.month
  Buffers: shared hit=1
  ->  Seq Scan on public.t_sales  (cost=0.00..15.10 rows=510 width=64) (actual time=0.028..0.029 rows=6 loops=1)
        Output: year, month, prod, sales
        Buffers: shared hit=1
LOG:  duration: 0.832 ms  plan:
Query Text: select *, sales - coalesce(lag(sales) over w1, sales)  from continue_out() window w1 as (partition by prod order by year,month);
WindowAgg  (cost=60.08..85.08 rows=1000 width=136) (actual time=0.808..0.818 rows=9 loops=1)
  Output: year, month, prod, sales, (sales - COALESCE(lag(sales) OVER (?), sales))
  Buffers: shared hit=9
  ->  Sort  (cost=60.08..62.58 rows=1000 width=132) (actual time=0.803..0.804 rows=9 loops=1)
        Output: year, month, prod, sales
        Sort Key: continue_out.prod, continue_out.year, continue_out.month
        Sort Method: quicksort  Memory: 26kB
        Buffers: shared hit=9
        ->  Function Scan on public.continue_out  (cost=0.25..10.25 rows=1000 width=132) (actual time=0.782..0.783 rows=9 loops=1)
              Output: year, month, prod, sales
              Function Call: continue_out()
              Buffers: shared hit=9
 year | month | prod | sales | ?column? 
------+-------+------+-------+----------
 2008 | 1     | a    |  1000 |        0
 2008 | 2     | a    |  2000 |     1000
 2008 | 3     | a    |  3000 |     1000
 2008 | 1     | b    |  1500 |        0
 2008 | 2     | b    |  3000 |     1500
 2008 | 3     | b    |     0 |    -3000
 2008 | 1     | c    |     0 |        0
 2008 | 2     | c    |  1000 |     1000
 2008 | 3     | c    |     0 |    -1000
(9 rows)

Time: 1.362 ms

PostgreSQL的UDF非常強大,可以實現很多功能。

當然也期待PostgreSQL直接將partition by語法引入,那就更加方便了。

參考

https://www.postgresql.org/docs/10/static/functions-window.html

https://www.postgresql.org/docs/10/static/plpgsql.html

http://blog.sina.com.cn/s/blog_4cef5c7b01016lm5.html


相關文章