資料透視-商場(如沃爾瑪)選址應用

德哥發表於2017-03-07

標籤

PostgreSQL , 資料透視 , cube , grouping sets , rollup


背景

人群透視是商業與資料結合的案例之一,比如大型商場的選址,可與分析的資料包括車流、人流量等等。

結合資料可以更深入的分析人群的組成結構,消費能力等等,給大型商場的選址帶來更多的參考價值。

pic

那麼如何使用資料庫透視人群資料呢?

pic

場景構建

1. 人群屬性表

記載了每個人的各個屬性段落,比如收入、車齡、固定資產等等。如下

create table people(  
  id serial8 primary key,  -- 使用者ID  
  c1 int2, -- 年齡分段, 假設分5個檔, 使用0,1,2,3,4表示  
  c2 int2, -- 個人收入分段, 假設分3個檔, 使用0,1,2表示  
  c3 int2, -- 車齡分段, 假設分5個檔, 使用0,1,2,3,4表示  
  c4 int2, -- 家庭收入分段, 假設分3個檔, 使用0,1,2表示  
  c5 int2, -- 固定資產分段, 假設分3個檔, 使用0,1,2表示  
  c6 int2  -- 存款分段, 假設分3個檔, 使用0,1,2表示  
);  

2. 人群動態軌跡

記錄的是人群的活動位置或軌跡

使用PostgreSQL PostGIS外掛,可以很方便的記錄軌跡資料,並且支援GIST索引,可以快速的根據某個區域或範圍搜尋對應的人群。

create table people_loc(  
  id int8,  -- 使用者ID  
  -- loc geometry,  -- 位置  
  crt_time timestamp  -- 時間  
);  

生成測試資料

1. 生成1000萬人群的測試資料, 其中車齡為4, 年齡段為4的不插入,製造一些空洞。

insert into people (c1,c2,c3,c4,c5,c6)  
select   
mod((random()*10)::int,4),  
mod((random()*10)::int,3),  
mod((random()*10)::int,4),  
mod((random()*10)::int,3),  
mod((random()*10)::int,3),  
mod((random()*10)::int,3)  
from generate_series(1,10000000);  

postgres=# select * from people limit 10;  
 id | c1 | c2 | c3 | c4 | c5 | c6   
----+----+----+----+----+----+----  
  1 |  2 |  1 |  3 |  0 |  1 |  2  
  2 |  0 |  0 |  1 |  0 |  1 |  0  
  3 |  2 |  1 |  0 |  2 |  0 |  2  
  4 |  1 |  0 |  0 |  0 |  1 |  2  
  5 |  3 |  2 |  2 |  1 |  2 |  1  
  6 |  1 |  2 |  0 |  0 |  1 |  1  
  7 |  2 |  1 |  0 |  1 |  0 |  0  
  8 |  1 |  1 |  0 |  1 |  0 |  2  
  9 |  3 |  0 |  3 |  1 |  2 |  1  
 10 |  3 |  2 |  2 |  0 |  2 |  1  
(10 rows)  

2. 生成1000萬人群軌跡資料

insert into people_loc (id, crt_time)  
select random()*10000000, now()+format(`%L`, (500000-random()*1000000))::interval  
from generate_series(1,10000000);  


postgres=# select * from people_loc  limit 10;  
   id    |          crt_time            
---------+----------------------------  
 7278581 | 2017-03-05 16:35:13.828435  
 3456421 | 2017-03-07 09:08:26.853477  
  976602 | 2017-03-04 18:47:49.176176  
 1996929 | 2017-03-11 08:46:31.955573  
 6590325 | 2017-03-11 14:48:55.231263  
 7252414 | 2017-03-04 08:17:28.731733  
 8763332 | 2017-03-01 15:37:11.57363  
 9426083 | 2017-03-11 17:51:46.474757  
 4399781 | 2017-03-05 08:07:45.962599  
 9049432 | 2017-03-09 14:10:42.211882  
(10 rows)  

資料透視

1. 選擇人群

以某個點為中心、或者根據某個閉環區域,圈一部分人群,(採用PostGIS)

這裡不舉例GIS(跟興趣的童鞋可以使用PostGIS測試一下,效能槓槓的),我直接以時間為度量直接圈人。

select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date;  

有人可能要問,如果這個時間段,同一個人出現了多條軌跡,怎麼處理呢?

這裡使用了IN,PostgreSQL 的優化器很強大,JOIN時資料庫會自動聚合,不必在這裡GROUP BY,原理可參考如下文章。

《聊一下PostgreSQL優化器 – in裡面有重複值時PostgreSQL如何處理?》

2. 資料透視

PostgreSQL的SQL相容性非常強大,對於資料透視,可以使用grouping sets, cube, rollup等語法。

《GROUPING SETS, CUBE and ROLLUP》

select c1,c2,c3,c4,c5,c6,count(*) cnt  
from   
people  
where id in (  
  select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date  
)  
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,());  

 c1 | c2 | c3 | c4 | c5 | c6 |   cnt     
----+----+----+----+----+----+---------  
    |  0 |    |    |    |    |  555530  
    |  1 |    |    |    |    |  555525  
    |  2 |    |    |    |    |  475596  
    |    |    |    |    |    | 1586651  
    |    |    |  0 |    |    |  554079  
    |    |    |  1 |    |    |  555864  
    |    |    |  2 |    |    |  476708  
    |    |    |    |    |  0 |  554738  
    |    |    |    |    |  1 |  554843  
    |    |    |    |    |  2 |  477070  
    |    |    |    |  0 |    |  554552  
    |    |    |    |  1 |    |  555073  
    |    |    |    |  2 |    |  477026  
  0 |    |    |    |    |    |  396349  
  1 |    |    |    |    |    |  475616  
  2 |    |    |    |    |    |  397502  
  3 |    |    |    |    |    |  317184  
    |    |  0 |    |    |    |  396947  
    |    |  1 |    |    |    |  475504  
    |    |  2 |    |    |    |  395852  
    |    |  3 |    |    |    |  318348  
(21 rows)  

更多透視用法參考cube, rollup, grouping sets用法。

目前PostgreSQL, HybridDB, Greenplum都支援以上語法。

3. 結果轉換

使用WITH語法,將以上結果進行轉換

with tmp as (  
select c1,c2,c3,c4,c5,c6,count(*) cnt  
from   
people  
where id in (  
  select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date  
)  
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())  
)  
select case   
when c1 is not null then `c1_`||c1   
when c2 is not null then `c2_`||c2   
when c3 is not null then `c3_`||c3   
when c4 is not null then `c4_`||c4   
when c5 is not null then `c5_`||c5   
when c6 is not null then `c6_`||c6   
else `cnt` end AS col,  
t1.cnt as private,  
t2.cnt as all,  
t1.cnt::numeric/t2.cnt as ratio  
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2  
;  

 col  | private |   all   |         ratio            
------+---------+---------+------------------------  
 c2_0 |  555530 | 1586651 | 0.35012740672019240526  
 c2_1 |  555525 | 1586651 | 0.35012425542857250901  
 c2_2 |  475596 | 1586651 | 0.29974833785123508572  
 cnt  | 1586651 | 1586651 | 1.00000000000000000000  
 c4_0 |  554079 | 1586651 | 0.34921290189209851442  
 c4_1 |  555864 | 1586651 | 0.35033791300040147455  
 c4_2 |  476708 | 1586651 | 0.30044918510750001103  
 c6_0 |  554738 | 1586651 | 0.34962824212760083976  
 c6_1 |  554843 | 1586651 | 0.34969441925161866094  
 c6_2 |  477070 | 1586651 | 0.30067733862078049930  
 c5_0 |  554552 | 1586651 | 0.34951101407934069937  
 c5_1 |  555073 | 1586651 | 0.34983937866613388830  
 c5_2 |  477026 | 1586651 | 0.30064960725452541233  
 c1_0 |  396349 | 1586651 | 0.24980225645085151051  
 c1_1 |  475616 | 1586651 | 0.29976094301771467071  
 c1_2 |  397502 | 1586651 | 0.25052894429839958504  
 c1_3 |  317184 | 1586651 | 0.19990785623303423374  
 c3_0 |  396947 | 1586651 | 0.25017915092859110163  
 c3_1 |  475504 | 1586651 | 0.29969035408542899478  
 c3_2 |  395852 | 1586651 | 0.24948901806383382357  
 c3_3 |  318348 | 1586651 | 0.20064147692214608001  
(21 rows)  

Time: 8466.507 ms  

perf report

# Events: 8K cycles  
#  
# Overhead   Command       Shared Object                                               Symbol  
# ........  ........  ..................  ...................................................  
#  
     6.29%  postgres  postgres            [.] comparetup_heap  
            |  
            --- comparetup_heap  
               |            
               |--41.84%-- (nil)  
               |            
               |--33.36%-- 0x1  
               |            
               |--8.44%-- 0x23e8e  
               |            
               |--8.43%-- 0x2  
               |            
                --7.93%-- 0x3  

     5.16%  postgres  postgres            [.] slot_deform_tuple.lto_priv.1138  
            |  
            --- slot_deform_tuple.lto_priv.1138  

     3.82%  postgres  postgres            [.] mergeprereadone  
            |  
            --- mergeprereadone  

     3.79%  postgres  postgres            [.] qsort_ssup  
            |  
            --- qsort_ssup  

     3.51%  postgres  postgres            [.] tuplesort_gettuple_common.lto_priv.1348  
            |  
            --- tuplesort_gettuple_common.lto_priv.1348  
               |            
               |--32.14%-- 0x1  
               |            
               |--22.28%-- 0x2  
               |            
               |--18.95%-- (nil)  
               |            
               |--11.41%-- 0x10  
               |            
               |--5.72%-- 0x3  
               |            
               |--1.91%-- 0x3d84d9  
               |            
               |--1.91%-- 0xef259  
               |            
               |--1.91%-- get_select_query_def.lto_priv.1324  
               |            
               |--1.91%-- 0x95c9af  
               |            
                --1.88%-- 0x3a0e54  

4. left join 補缺(可選)

對於空洞值,如果你要補齊的話,使用left join即可

select * from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`)) t (col);  


 col    
------  
 c1_0  
 c1_1  
 c1_2  
 c1_3  
 c1_4  
 c2_0  
 c2_1  
 c2_2  
 c3_0  
 c3_1  
 c3_2  
 c3_3  
 c3_4  
 c4_0  
 c4_1  
 c4_2  
 c5_0  
 c5_1  
 c5_2  
 c6_0  
 c6_1  
 c6_2  
(22 rows)  

補缺如下

with tmp as (  
select c1,c2,c3,c4,c5,c6,count(*) cnt  
from   
people  
where id in (  
  select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date  
)  
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())  
),  
tmp2 as (  
select case   
when c1 is not null then `c1_`||c1   
when c2 is not null then `c2_`||c2   
when c3 is not null then `c3_`||c3   
when c4 is not null then `c4_`||c4   
when c5 is not null then `c5_`||c5   
when c6 is not null then `c6_`||c6   
else `cnt` end AS col,  
t1.cnt as private,  
t2.cnt as all,  
t1.cnt::numeric/t2.cnt as ratio  
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2  
)  
select t1.col,coalesce(t2.ratio,0) ratio from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`))   
t1 (col)   
left join tmp2 t2  
on (t1.col=t2.col)  
order by t1.col;   


 col  |         ratio            
------+------------------------  
 c1_0 | 0.24980225645085151051  
 c1_1 | 0.29976094301771467071  
 c1_2 | 0.25052894429839958504  
 c1_3 | 0.19990785623303423374  
 c1_4 |                      0  
 c2_0 | 0.35012740672019240526  
 c2_1 | 0.35012425542857250901  
 c2_2 | 0.29974833785123508572  
 c3_0 | 0.25017915092859110163  
 c3_1 | 0.29969035408542899478  
 c3_2 | 0.24948901806383382357  
 c3_3 | 0.20064147692214608001  
 c3_4 |                      0  
 c4_0 | 0.34921290189209851442  
 c4_1 | 0.35033791300040147455  
 c4_2 | 0.30044918510750001103  
 c5_0 | 0.34951101407934069937  
 c5_1 | 0.34983937866613388830  
 c5_2 | 0.30064960725452541233  
 c6_0 | 0.34962824212760083976  
 c6_1 | 0.34969441925161866094  
 c6_2 | 0.30067733862078049930  
(22 rows)  

5. 行列變換(可選)

如果要將以上資料,多行轉換為單行,可以使用tablefunc外掛,PostgreSQL玩法巨多哦。

https://www.postgresql.org/docs/9.6/static/tablefunc.html

create extension tablefunc;  

select * from   
crosstab($$  
with tmp as (  
select c1,c2,c3,c4,c5,c6,count(*) cnt  
from   
people  
where id in (  
  select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date  
)  
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())  
),  
tmp2 as (  
select case   
when c1 is not null then `c1_`||c1   
when c2 is not null then `c2_`||c2   
when c3 is not null then `c3_`||c3   
when c4 is not null then `c4_`||c4   
when c5 is not null then `c5_`||c5   
when c6 is not null then `c6_`||c6   
else `cnt` end AS col,  
t1.cnt as private,  
t2.cnt as all,  
t1.cnt::numeric/t2.cnt as ratio  
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2  
)  
select `row`::text , t1.col,coalesce(t2.ratio,0) ratio from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`))   
t1 (col)   
left join tmp2 t2  
on (t1.col=t2.col)  
order by t1.col  
$$  
)  
as  
(  
row text,  
c1_0 numeric,  
c1_1 numeric,  
c1_2 numeric,  
c1_3 numeric,  
c1_4 numeric,  
c2_0 numeric,  
c2_1 numeric,  
c2_2 numeric,  
c3_0 numeric,  
c3_1 numeric,  
c3_2 numeric,  
c3_3 numeric,  
c3_4 numeric,  
c4_0 numeric,  
c4_1 numeric,  
c4_2 numeric,  
c5_0 numeric,  
c5_1 numeric,  
c5_2 numeric,  
c6_0 numeric,  
c6_1 numeric,  
c6_2 numeric  
);  


 row |          c1_0          |          c1_1          |          c1_2          |          c1_3          | c1_4 |          c2_0          |          c2_1          |          c2_2          |          c3_0          |          c3_1            
|          c3_2          |          c3_3          | c3_4 |          c4_0          |          c4_1          |          c4_2          |          c5_0          |          c5_1          |          c5_2          |          c6_0          |      
      c6_1          |          c6_2            
-----+------------------------+------------------------+------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------  
+------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----  
--------------------+------------------------  
 row | 0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 |    0 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969035408542899478   
| 0.24948901806383382357 | 0.20064147692214608001 |    0 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.30064960725452541233 | 0.34962824212760083976 | 0.3  
4969441925161866094 | 0.30067733862078049930  
(1 row)  

透視優化

1. 關於索引(BRIN, GIST, BTREE_GIST)

通常我們會限定兩個維度,篩選人群,1時間範圍,2地理位置範圍。

由於軌跡資料通常是時間和堆的線性相關性很好的,所以,在索引方面,可以使用BRIN索引。

brin索引詳見

《PostgreSQL 聚集儲存 與 BRIN索引 – 高併發行為、軌跡類大吞吐資料查詢場景解說》

而對於地理位置,如果要進行快速篩選的話,可以建立GIST索引

如果要建立兩者的複合索引,可以使用btree_gist外掛,那麼時間和地理位置就能放在一個GIST索引中了。

create extension btree_gist;  

2. 遞迴優化

如果軌跡點很多,但是大多數為重複人群,可使用遞迴優化IN查詢

參考

《用PostgreSQL找回618秒逝去的青春 – 遞迴收斂優化》

《distinct xx和count(distinct xx)的變態遞迴優化方法 – 索引收斂(skip scan)掃描》

《時序資料合併場景加速分析和實現 – 複合索引,視窗分組查詢加速,變態遞迴加速》

3. case when 優化,在使用本例的cube,grouping sets,rollup前,或者其他不支援資料透視語法的資料庫中,可以使用case when的方法來聚合,但是每條資料都要經過case when的計算,耗費很大的CPU。

select   
  sum(case when c1=0 then 1 else 0 end)/(count(*))::numeric as c1_0,  
  sum(case when c1=1 then 1 else 0 end)/(count(*))::numeric as c1_1,  
  sum(case when c1=2 then 1 else 0 end)/(count(*))::numeric as c1_2,  
  sum(case when c1=3 then 1 else 0 end)/(count(*))::numeric as c1_3,  
  sum(case when c1=4 then 1 else 0 end)/(count(*))::numeric as c1_4,  
  sum(case when c2=0 then 1 else 0 end)/(count(*))::numeric as c2_0,  
  sum(case when c2=1 then 1 else 0 end)/(count(*))::numeric as c2_1,  
  sum(case when c2=2 then 1 else 0 end)/(count(*))::numeric as c2_2,  
  sum(case when c3=0 then 1 else 0 end)/(count(*))::numeric as c3_0,  
  sum(case when c3=1 then 1 else 0 end)/(count(*))::numeric as c3_1,  
  sum(case when c3=2 then 1 else 0 end)/(count(*))::numeric as c3_2,  
  sum(case when c3=3 then 1 else 0 end)/(count(*))::numeric as c3_3,  
  sum(case when c3=4 then 1 else 0 end)/(count(*))::numeric as c3_4,  
  sum(case when c4=0 then 1 else 0 end)/(count(*))::numeric as c4_0,  
  sum(case when c4=1 then 1 else 0 end)/(count(*))::numeric as c4_1,  
  sum(case when c4=2 then 1 else 0 end)/(count(*))::numeric as c4_2,  
  sum(case when c5=0 then 1 else 0 end)/(count(*))::numeric as c5_0,  
  sum(case when c5=1 then 1 else 0 end)/(count(*))::numeric as c5_1,  
  sum(case when c5=2 then 1 else 0 end)/(count(*))::numeric as c5_2,  
  sum(case when c6=0 then 1 else 0 end)/(count(*))::numeric as c6_0,  
  sum(case when c6=1 then 1 else 0 end)/(count(*))::numeric as c6_1,  
  sum(case when c6=2 then 1 else 0 end)/(count(*))::numeric as c6_2  
from   
people  
where id in (  
  select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date  
);  

          c1_0          |          c1_1          |          c1_2          |          c1_3          |            c1_4            |          c2_0          |          c2_1          |          c2_2          |          c3_0          |          
  c3_1          |          c3_2          |          c3_3          |            c3_4            |          c4_0          |          c4_1          |          c4_2          |          c5_0          |          c5_1          |          c5_2    
        |          c6_0          |          c6_1          |          c6_2            
------------------------+------------------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+--------  
----------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------------  
--------+------------------------+------------------------+------------------------  
 0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 | 0.000000000000000000000000 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969  
035408542899478 | 0.24948901806383382357 | 0.20064147692214608001 | 0.000000000000000000000000 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.3006496072545  
2541233 | 0.34962824212760083976 | 0.34969441925161866094 | 0.30067733862078049930  
(1 row)  

Time: 8282.168 ms  

perf report

# Events: 8K cycles  
#  
# Overhead   Command       Shared Object                                               Symbol  
# ........  ........  ..................  ...................................................  
#  
    12.15%  postgres  postgres            [.] ExecMakeFunctionResultNoSets  
            |  
            --- ExecMakeFunctionResultNoSets  
               |            
                --100.00%-- (nil)  

     7.11%  postgres  postgres            [.] ExecEvalCase  
            |  
            --- ExecEvalCase  
               |            
                --100.00%-- (nil)  

     6.85%  postgres  postgres            [.] ExecTargetList.isra.6.lto_priv.1346  
            |  
            --- ExecTargetList.isra.6.lto_priv.1346  

     5.43%  postgres  postgres            [.] ExecProject.constprop.414  
            |  
            --- ExecProject.constprop.414  

     5.37%  postgres  postgres            [.] ExecEvalScalarVarFast  
            |  
            --- ExecEvalScalarVarFast  

     4.35%  postgres  postgres            [.] slot_getattr  
            |  
            --- slot_getattr  

     4.13%  postgres  postgres            [.] advance_aggregates  
            |  
            --- advance_aggregates  

     3.43%  postgres  postgres            [.] slot_deform_tuple.lto_priv.1138  
            |  
            --- slot_deform_tuple.lto_priv.1138  

     3.12%  postgres  postgres            [.] ExecClearTuple  
            |  
            --- ExecClearTuple  

     2.82%  postgres  postgres            [.] IndexNext  
            |  
            --- IndexNext  

     2.45%  postgres  postgres            [.] ExecEvalConst  
            |  
            --- ExecEvalConst  
               |            
                --100.00%-- (nil)  

小結

1. 語法cube, grouping sets, rollup給資料透視提供了比較好的便利。

2. 行列變換可以使用tablefunc外掛。

3. case when過多時,對CPU的開銷會比較大。

4. 結合PostGIS可以很方便的基於地理位置和時間維度,分析人群特性。

5. 阿里雲HybridDB, PostgreSQL都能提供以上功能,其中HybridDB為分散式資料倉儲。


相關文章