資料透視-商場(如沃爾瑪)選址應用
標籤
PostgreSQL , 資料透視 , cube , grouping sets , rollup
背景
人群透視是商業與資料結合的案例之一,比如大型商場的選址,可與分析的資料包括車流、人流量等等。
結合資料可以更深入的分析人群的組成結構,消費能力等等,給大型商場的選址帶來更多的參考價值。
那麼如何使用資料庫透視人群資料呢?
場景構建
1. 人群屬性表
記載了每個人的各個屬性段落,比如收入、車齡、固定資產等等。如下
create table people(
id serial8 primary key, -- 使用者ID
c1 int2, -- 年齡分段, 假設分5個檔, 使用0,1,2,3,4表示
c2 int2, -- 個人收入分段, 假設分3個檔, 使用0,1,2表示
c3 int2, -- 車齡分段, 假設分5個檔, 使用0,1,2,3,4表示
c4 int2, -- 家庭收入分段, 假設分3個檔, 使用0,1,2表示
c5 int2, -- 固定資產分段, 假設分3個檔, 使用0,1,2表示
c6 int2 -- 存款分段, 假設分3個檔, 使用0,1,2表示
);
2. 人群動態軌跡
記錄的是人群的活動位置或軌跡
使用PostgreSQL PostGIS外掛,可以很方便的記錄軌跡資料,並且支援GIST索引,可以快速的根據某個區域或範圍搜尋對應的人群。
create table people_loc(
id int8, -- 使用者ID
-- loc geometry, -- 位置
crt_time timestamp -- 時間
);
生成測試資料
1. 生成1000萬人群的測試資料, 其中車齡為4, 年齡段為4的不插入,製造一些空洞。
insert into people (c1,c2,c3,c4,c5,c6)
select
mod((random()*10)::int,4),
mod((random()*10)::int,3),
mod((random()*10)::int,4),
mod((random()*10)::int,3),
mod((random()*10)::int,3),
mod((random()*10)::int,3)
from generate_series(1,10000000);
postgres=# select * from people limit 10;
id | c1 | c2 | c3 | c4 | c5 | c6
----+----+----+----+----+----+----
1 | 2 | 1 | 3 | 0 | 1 | 2
2 | 0 | 0 | 1 | 0 | 1 | 0
3 | 2 | 1 | 0 | 2 | 0 | 2
4 | 1 | 0 | 0 | 0 | 1 | 2
5 | 3 | 2 | 2 | 1 | 2 | 1
6 | 1 | 2 | 0 | 0 | 1 | 1
7 | 2 | 1 | 0 | 1 | 0 | 0
8 | 1 | 1 | 0 | 1 | 0 | 2
9 | 3 | 0 | 3 | 1 | 2 | 1
10 | 3 | 2 | 2 | 0 | 2 | 1
(10 rows)
2. 生成1000萬人群軌跡資料
insert into people_loc (id, crt_time)
select random()*10000000, now()+format(`%L`, (500000-random()*1000000))::interval
from generate_series(1,10000000);
postgres=# select * from people_loc limit 10;
id | crt_time
---------+----------------------------
7278581 | 2017-03-05 16:35:13.828435
3456421 | 2017-03-07 09:08:26.853477
976602 | 2017-03-04 18:47:49.176176
1996929 | 2017-03-11 08:46:31.955573
6590325 | 2017-03-11 14:48:55.231263
7252414 | 2017-03-04 08:17:28.731733
8763332 | 2017-03-01 15:37:11.57363
9426083 | 2017-03-11 17:51:46.474757
4399781 | 2017-03-05 08:07:45.962599
9049432 | 2017-03-09 14:10:42.211882
(10 rows)
資料透視
1. 選擇人群
以某個點為中心、或者根據某個閉環區域,圈一部分人群,(採用PostGIS)
這裡不舉例GIS(跟興趣的童鞋可以使用PostGIS測試一下,效能槓槓的),我直接以時間為度量直接圈人。
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date;
有人可能要問,如果這個時間段,同一個人出現了多條軌跡,怎麼處理呢?
這裡使用了IN,PostgreSQL 的優化器很強大,JOIN時資料庫會自動聚合,不必在這裡GROUP BY,原理可參考如下文章。
《聊一下PostgreSQL優化器 – in裡面有重複值時PostgreSQL如何處理?》
2. 資料透視
PostgreSQL的SQL相容性非常強大,對於資料透視,可以使用grouping sets, cube, rollup等語法。
《GROUPING SETS, CUBE and ROLLUP》
select c1,c2,c3,c4,c5,c6,count(*) cnt
from
people
where id in (
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date
)
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,());
c1 | c2 | c3 | c4 | c5 | c6 | cnt
----+----+----+----+----+----+---------
| 0 | | | | | 555530
| 1 | | | | | 555525
| 2 | | | | | 475596
| | | | | | 1586651
| | | 0 | | | 554079
| | | 1 | | | 555864
| | | 2 | | | 476708
| | | | | 0 | 554738
| | | | | 1 | 554843
| | | | | 2 | 477070
| | | | 0 | | 554552
| | | | 1 | | 555073
| | | | 2 | | 477026
0 | | | | | | 396349
1 | | | | | | 475616
2 | | | | | | 397502
3 | | | | | | 317184
| | 0 | | | | 396947
| | 1 | | | | 475504
| | 2 | | | | 395852
| | 3 | | | | 318348
(21 rows)
更多透視用法參考cube, rollup, grouping sets用法。
目前PostgreSQL, HybridDB, Greenplum都支援以上語法。
3. 結果轉換
使用WITH語法,將以上結果進行轉換
with tmp as (
select c1,c2,c3,c4,c5,c6,count(*) cnt
from
people
where id in (
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date
)
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
)
select case
when c1 is not null then `c1_`||c1
when c2 is not null then `c2_`||c2
when c3 is not null then `c3_`||c3
when c4 is not null then `c4_`||c4
when c5 is not null then `c5_`||c5
when c6 is not null then `c6_`||c6
else `cnt` end AS col,
t1.cnt as private,
t2.cnt as all,
t1.cnt::numeric/t2.cnt as ratio
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
;
col | private | all | ratio
------+---------+---------+------------------------
c2_0 | 555530 | 1586651 | 0.35012740672019240526
c2_1 | 555525 | 1586651 | 0.35012425542857250901
c2_2 | 475596 | 1586651 | 0.29974833785123508572
cnt | 1586651 | 1586651 | 1.00000000000000000000
c4_0 | 554079 | 1586651 | 0.34921290189209851442
c4_1 | 555864 | 1586651 | 0.35033791300040147455
c4_2 | 476708 | 1586651 | 0.30044918510750001103
c6_0 | 554738 | 1586651 | 0.34962824212760083976
c6_1 | 554843 | 1586651 | 0.34969441925161866094
c6_2 | 477070 | 1586651 | 0.30067733862078049930
c5_0 | 554552 | 1586651 | 0.34951101407934069937
c5_1 | 555073 | 1586651 | 0.34983937866613388830
c5_2 | 477026 | 1586651 | 0.30064960725452541233
c1_0 | 396349 | 1586651 | 0.24980225645085151051
c1_1 | 475616 | 1586651 | 0.29976094301771467071
c1_2 | 397502 | 1586651 | 0.25052894429839958504
c1_3 | 317184 | 1586651 | 0.19990785623303423374
c3_0 | 396947 | 1586651 | 0.25017915092859110163
c3_1 | 475504 | 1586651 | 0.29969035408542899478
c3_2 | 395852 | 1586651 | 0.24948901806383382357
c3_3 | 318348 | 1586651 | 0.20064147692214608001
(21 rows)
Time: 8466.507 ms
perf report
# Events: 8K cycles
#
# Overhead Command Shared Object Symbol
# ........ ........ .................. ...................................................
#
6.29% postgres postgres [.] comparetup_heap
|
--- comparetup_heap
|
|--41.84%-- (nil)
|
|--33.36%-- 0x1
|
|--8.44%-- 0x23e8e
|
|--8.43%-- 0x2
|
--7.93%-- 0x3
5.16% postgres postgres [.] slot_deform_tuple.lto_priv.1138
|
--- slot_deform_tuple.lto_priv.1138
3.82% postgres postgres [.] mergeprereadone
|
--- mergeprereadone
3.79% postgres postgres [.] qsort_ssup
|
--- qsort_ssup
3.51% postgres postgres [.] tuplesort_gettuple_common.lto_priv.1348
|
--- tuplesort_gettuple_common.lto_priv.1348
|
|--32.14%-- 0x1
|
|--22.28%-- 0x2
|
|--18.95%-- (nil)
|
|--11.41%-- 0x10
|
|--5.72%-- 0x3
|
|--1.91%-- 0x3d84d9
|
|--1.91%-- 0xef259
|
|--1.91%-- get_select_query_def.lto_priv.1324
|
|--1.91%-- 0x95c9af
|
--1.88%-- 0x3a0e54
4. left join 補缺(可選)
對於空洞值,如果你要補齊的話,使用left join即可
select * from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`)) t (col);
col
------
c1_0
c1_1
c1_2
c1_3
c1_4
c2_0
c2_1
c2_2
c3_0
c3_1
c3_2
c3_3
c3_4
c4_0
c4_1
c4_2
c5_0
c5_1
c5_2
c6_0
c6_1
c6_2
(22 rows)
補缺如下
with tmp as (
select c1,c2,c3,c4,c5,c6,count(*) cnt
from
people
where id in (
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date
)
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
),
tmp2 as (
select case
when c1 is not null then `c1_`||c1
when c2 is not null then `c2_`||c2
when c3 is not null then `c3_`||c3
when c4 is not null then `c4_`||c4
when c5 is not null then `c5_`||c5
when c6 is not null then `c6_`||c6
else `cnt` end AS col,
t1.cnt as private,
t2.cnt as all,
t1.cnt::numeric/t2.cnt as ratio
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
)
select t1.col,coalesce(t2.ratio,0) ratio from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`))
t1 (col)
left join tmp2 t2
on (t1.col=t2.col)
order by t1.col;
col | ratio
------+------------------------
c1_0 | 0.24980225645085151051
c1_1 | 0.29976094301771467071
c1_2 | 0.25052894429839958504
c1_3 | 0.19990785623303423374
c1_4 | 0
c2_0 | 0.35012740672019240526
c2_1 | 0.35012425542857250901
c2_2 | 0.29974833785123508572
c3_0 | 0.25017915092859110163
c3_1 | 0.29969035408542899478
c3_2 | 0.24948901806383382357
c3_3 | 0.20064147692214608001
c3_4 | 0
c4_0 | 0.34921290189209851442
c4_1 | 0.35033791300040147455
c4_2 | 0.30044918510750001103
c5_0 | 0.34951101407934069937
c5_1 | 0.34983937866613388830
c5_2 | 0.30064960725452541233
c6_0 | 0.34962824212760083976
c6_1 | 0.34969441925161866094
c6_2 | 0.30067733862078049930
(22 rows)
5. 行列變換(可選)
如果要將以上資料,多行轉換為單行,可以使用tablefunc外掛,PostgreSQL玩法巨多哦。
https://www.postgresql.org/docs/9.6/static/tablefunc.html
create extension tablefunc;
select * from
crosstab($$
with tmp as (
select c1,c2,c3,c4,c5,c6,count(*) cnt
from
people
where id in (
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date
)
GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
),
tmp2 as (
select case
when c1 is not null then `c1_`||c1
when c2 is not null then `c2_`||c2
when c3 is not null then `c3_`||c3
when c4 is not null then `c4_`||c4
when c5 is not null then `c5_`||c5
when c6 is not null then `c6_`||c6
else `cnt` end AS col,
t1.cnt as private,
t2.cnt as all,
t1.cnt::numeric/t2.cnt as ratio
from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
)
select `row`::text , t1.col,coalesce(t2.ratio,0) ratio from (values (`c1_0`),(`c1_1`),(`c1_2`),(`c1_3`),(`c1_4`),(`c2_0`),(`c2_1`),(`c2_2`),(`c3_0`),(`c3_1`),(`c3_2`),(`c3_3`),(`c3_4`),(`c4_0`),(`c4_1`),(`c4_2`),(`c5_0`),(`c5_1`),(`c5_2`),(`c6_0`),(`c6_1`),(`c6_2`))
t1 (col)
left join tmp2 t2
on (t1.col=t2.col)
order by t1.col
$$
)
as
(
row text,
c1_0 numeric,
c1_1 numeric,
c1_2 numeric,
c1_3 numeric,
c1_4 numeric,
c2_0 numeric,
c2_1 numeric,
c2_2 numeric,
c3_0 numeric,
c3_1 numeric,
c3_2 numeric,
c3_3 numeric,
c3_4 numeric,
c4_0 numeric,
c4_1 numeric,
c4_2 numeric,
c5_0 numeric,
c5_1 numeric,
c5_2 numeric,
c6_0 numeric,
c6_1 numeric,
c6_2 numeric
);
row | c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 | c3_1
| c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2 | c6_0 |
c6_1 | c6_2
-----+------------------------+------------------------+------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------
+------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----
--------------------+------------------------
row | 0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 | 0 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969035408542899478
| 0.24948901806383382357 | 0.20064147692214608001 | 0 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.30064960725452541233 | 0.34962824212760083976 | 0.3
4969441925161866094 | 0.30067733862078049930
(1 row)
透視優化
1. 關於索引(BRIN, GIST, BTREE_GIST)
通常我們會限定兩個維度,篩選人群,1時間範圍,2地理位置範圍。
由於軌跡資料通常是時間和堆的線性相關性很好的,所以,在索引方面,可以使用BRIN索引。
brin索引詳見
《PostgreSQL 聚集儲存 與 BRIN索引 – 高併發行為、軌跡類大吞吐資料查詢場景解說》
而對於地理位置,如果要進行快速篩選的話,可以建立GIST索引
如果要建立兩者的複合索引,可以使用btree_gist外掛,那麼時間和地理位置就能放在一個GIST索引中了。
create extension btree_gist;
2. 遞迴優化
如果軌跡點很多,但是大多數為重複人群,可使用遞迴優化IN查詢
參考
《用PostgreSQL找回618秒逝去的青春 – 遞迴收斂優化》
《distinct xx和count(distinct xx)的變態遞迴優化方法 – 索引收斂(skip scan)掃描》
《時序資料合併場景加速分析和實現 – 複合索引,視窗分組查詢加速,變態遞迴加速》
3. case when 優化,在使用本例的cube,grouping sets,rollup前,或者其他不支援資料透視語法的資料庫中,可以使用case when的方法來聚合,但是每條資料都要經過case when的計算,耗費很大的CPU。
select
sum(case when c1=0 then 1 else 0 end)/(count(*))::numeric as c1_0,
sum(case when c1=1 then 1 else 0 end)/(count(*))::numeric as c1_1,
sum(case when c1=2 then 1 else 0 end)/(count(*))::numeric as c1_2,
sum(case when c1=3 then 1 else 0 end)/(count(*))::numeric as c1_3,
sum(case when c1=4 then 1 else 0 end)/(count(*))::numeric as c1_4,
sum(case when c2=0 then 1 else 0 end)/(count(*))::numeric as c2_0,
sum(case when c2=1 then 1 else 0 end)/(count(*))::numeric as c2_1,
sum(case when c2=2 then 1 else 0 end)/(count(*))::numeric as c2_2,
sum(case when c3=0 then 1 else 0 end)/(count(*))::numeric as c3_0,
sum(case when c3=1 then 1 else 0 end)/(count(*))::numeric as c3_1,
sum(case when c3=2 then 1 else 0 end)/(count(*))::numeric as c3_2,
sum(case when c3=3 then 1 else 0 end)/(count(*))::numeric as c3_3,
sum(case when c3=4 then 1 else 0 end)/(count(*))::numeric as c3_4,
sum(case when c4=0 then 1 else 0 end)/(count(*))::numeric as c4_0,
sum(case when c4=1 then 1 else 0 end)/(count(*))::numeric as c4_1,
sum(case when c4=2 then 1 else 0 end)/(count(*))::numeric as c4_2,
sum(case when c5=0 then 1 else 0 end)/(count(*))::numeric as c5_0,
sum(case when c5=1 then 1 else 0 end)/(count(*))::numeric as c5_1,
sum(case when c5=2 then 1 else 0 end)/(count(*))::numeric as c5_2,
sum(case when c6=0 then 1 else 0 end)/(count(*))::numeric as c6_0,
sum(case when c6=1 then 1 else 0 end)/(count(*))::numeric as c6_1,
sum(case when c6=2 then 1 else 0 end)/(count(*))::numeric as c6_2
from
people
where id in (
select id from people_loc where crt_time between `2017-03-06`::date and `2017-03-08`::date
);
c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 |
c3_1 | c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2
| c6_0 | c6_1 | c6_2
------------------------+------------------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+--------
----------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------------
--------+------------------------+------------------------+------------------------
0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 | 0.000000000000000000000000 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969
035408542899478 | 0.24948901806383382357 | 0.20064147692214608001 | 0.000000000000000000000000 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.3006496072545
2541233 | 0.34962824212760083976 | 0.34969441925161866094 | 0.30067733862078049930
(1 row)
Time: 8282.168 ms
perf report
# Events: 8K cycles
#
# Overhead Command Shared Object Symbol
# ........ ........ .................. ...................................................
#
12.15% postgres postgres [.] ExecMakeFunctionResultNoSets
|
--- ExecMakeFunctionResultNoSets
|
--100.00%-- (nil)
7.11% postgres postgres [.] ExecEvalCase
|
--- ExecEvalCase
|
--100.00%-- (nil)
6.85% postgres postgres [.] ExecTargetList.isra.6.lto_priv.1346
|
--- ExecTargetList.isra.6.lto_priv.1346
5.43% postgres postgres [.] ExecProject.constprop.414
|
--- ExecProject.constprop.414
5.37% postgres postgres [.] ExecEvalScalarVarFast
|
--- ExecEvalScalarVarFast
4.35% postgres postgres [.] slot_getattr
|
--- slot_getattr
4.13% postgres postgres [.] advance_aggregates
|
--- advance_aggregates
3.43% postgres postgres [.] slot_deform_tuple.lto_priv.1138
|
--- slot_deform_tuple.lto_priv.1138
3.12% postgres postgres [.] ExecClearTuple
|
--- ExecClearTuple
2.82% postgres postgres [.] IndexNext
|
--- IndexNext
2.45% postgres postgres [.] ExecEvalConst
|
--- ExecEvalConst
|
--100.00%-- (nil)
小結
1. 語法cube, grouping sets, rollup給資料透視提供了比較好的便利。
2. 行列變換可以使用tablefunc外掛。
3. case when過多時,對CPU的開銷會比較大。
4. 結合PostGIS可以很方便的基於地理位置和時間維度,分析人群特性。
5. 阿里雲HybridDB, PostgreSQL都能提供以上功能,其中HybridDB為分散式資料倉儲。
相關文章
- 醫療大資料五大應用透視大資料
- Excel資料透視表怎麼做 Excel資料透視表技巧Excel
- 商場導檢視用什麼軟體做,懶圖科技智慧商場專業地圖應用服務地圖
- Apache Kafka在沃爾瑪的應用ApacheKafka
- 商場導檢視怎麼做?商場導檢視用什麼軟體做?
- 透過機器人應用視覺機器人視覺
- EXCEL資料透視表工具Excel
- 圖資料庫及應用場景資料庫
- 鍵指如飛FlyKey for mac - 快速檢視應用快捷鍵Mac
- 透視表excel透視表怎麼做 excel的資料透視表怎麼弄Excel
- 資料網址
- Excel資料分析入門-資料透視表Excel
- 透視美國大資料戰略大資料
- ActiveReports 報表應用教程 (7)---交叉報表及資料透檢視實現方案
- redis資料型別及應用場景Redis資料型別
- 圖資料庫有哪些應用場景?資料庫
- HTAP資料庫及應用場景分析資料庫
- Redis 資料型別及應用場景Redis資料型別
- 工業大資料應用場景分析大資料
- 一文徹底搞透Redis的資料型別及具體的應用場景Redis資料型別
- 資料視覺化的應用都有哪些?視覺化
- 供應商主資料
- 從“稜鏡門”可以透視出資料安全的應對策略
- 【安全告警資料分析之道:一】資料透視篇
- 沃爾瑪大資料技術真相大資料
- 大資料應用商業價值日益凸顯大資料
- 大資料來看開封菜的選址邏輯大資料
- 大資料工具指南:從選擇到應用大資料
- 曼孚科技入選IDC中國資料智慧市場代表廠商
- Redis五種資料型別應用場景Redis資料型別
- Redis的資料結構與應用場景Redis資料結構
- [譯] Java 資料流的不同應用場景Java
- Redis的資料結構及應用場景Redis資料結構
- String資料型別的應用場景資料型別
- sorted set 資料型別的應用場景資料型別
- 電商API原資料介面的應用場景API
- NoSQL資料庫的35個應用場景SQL資料庫
- 透視美國大資料爆發全景大資料