(流式、lambda、觸發器)實時處理大比拼-物聯網(IoT)金融,時序處理最佳實踐
標籤
PostgreSQL , 物聯網 , 感測器 , lambda , 排程 , 實時 , 流式更新 , UPSERT , insert on conflict do update
背景
越來越多的資料要求實時的分析、聚合、展示最新值、展示異常值、實時的搜尋。
例如 金融資料、物聯網感測器的資料、網路遊戲的線上資料等等。
關於實時搜尋,可以參考這篇最佳實踐:
關於海量資料的”寫入、共享、儲存、計算”,以及離線分析,則可以參考這篇最佳實踐:
關於實時分析、實時更新、實時聚合、實時展示最新值、異常值,是本文的主要內容。
提起實時分析,不得不說流式計算,使用者可以參考本文:
《流計算風雲再起 – PostgreSQL攜PipelineDB力挺IoT》
pipelinedb是一個SQL介面的流計算資料庫,正在進行外掛化的改造,未來可以作為PostgreSQL資料庫的外掛使用。
本文將以感測器資料的實時寫入、實時更新最新值、實時統計為例,分析三種不同的方案(流式、lambda式、同步實時)的優缺點。
場景設計
有一百萬個感測器,每個感測器定期上報資料,使用者需求:
1. 實時的檢視感測器的最新值,
2. 實時按時間段檢視感測器歷史資料的統計值。
3. 實時檢視感測器的歷史明細資料。
4. 實時按其他維度檢視感測器歷史資料的統計值。
由於資料量可能非常龐大(100TB級),為了實現這4個需求,要求統計資料需要實時或準實時的被計算出來。
表結構設計
明細資料
create table sensor_data(
pk serial8 primary key, -- 主鍵
ts timestamp, -- 時間戳
sid int, -- 感測器ID
val numeric(10,2) -- 資料
);
實時聚合設計
1. 每個感測器最後的value
create table sensor_lastdata(
sid int primary key, -- 感測器ID,主鍵
last_ts timestamp, -- 時間戳
last_val numeric(10,2) -- 值
);
2. 每個感測器每個時段(例如小時)的所有值,總和,記錄數,最大值,最小值,平均值,方差。
create table sensor_aggdata(
sid int, -- 感測器ID
ts_group varchar(10), -- 時間維度分組,例如小時(yyyymmddhh24)
sum_val numeric, -- 和
min_val numeric(10,2), -- 最小值
max_val numeric(10,2), -- 最大值
avg_val numeric(10,2), -- 平均值
count_val int, -- 計數
all_vals numeric(10,2)[], -- 明細值
unique (sid,ts_group) -- 唯一約束
);
3. 按地域或其他維度,實時統計感測器上報的資料
略
如何從明細資料取感測器的最新值
取出每個感測器ID的最新值。使用SQL來取,有兩種方法,一種是聚合,另一種是視窗函式。
插入一批測試資料
postgres=# insert into sensor_data(ts,sid,val) select clock_timestamp(), random()*100, random()*10000 from generate_series(1,100000);
方法1,聚合。
按SID分組,將VAL聚合為陣列(按PK逆序排序),取陣列的第一個VALUE。
參考用法:https://www.postgresql.org/docs/9.6/static/functions-aggregate.html
postgres=# select sid, (array_agg(ts order by pk desc))[1] as last_ts, (array_agg(val order by pk desc))[1] as last_val from sensor_data group by sid;
sid | last_ts | last_val
-----+----------------------------+----------
0 | 2017-05-18 14:09:10.625812 | 6480.54
1 | 2017-05-18 14:09:10.627607 | 9644.29
2 | 2017-05-18 14:09:10.627951 | 3995.04
3 | 2017-05-18 14:09:10.627466 | 840.80
4 | 2017-05-18 14:09:10.627703 | 1500.59
5 | 2017-05-18 14:09:10.627813 | 3109.42
6 | 2017-05-18 14:09:10.62754 | 4131.31
7 | 2017-05-18 14:09:10.627851 | 9333.88
......
方法2,視窗。
postgres=# select sid,ts,val from (select sid,ts,val,row_number() over(partition by sid order by pk desc) as rn from sensor_data) t where rn=1;
sid | ts | val
-----+----------------------------+---------
0 | 2017-05-18 14:09:10.625812 | 6480.54
1 | 2017-05-18 14:09:10.627607 | 9644.29
2 | 2017-05-18 14:09:10.627951 | 3995.04
3 | 2017-05-18 14:09:10.627466 | 840.80
4 | 2017-05-18 14:09:10.627703 | 1500.59
5 | 2017-05-18 14:09:10.627813 | 3109.42
6 | 2017-05-18 14:09:10.62754 | 4131.31
7 | 2017-05-18 14:09:10.627851 | 9333.88
......
這兩種方法哪種好一點呢?請看執行計劃
postgres=# set work_mem =`16MB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select sid, (array_agg(ts order by pk desc))[1] as last_ts, (array_agg(val order by pk desc))[1] as last_val from sensor_data group by sid;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=7117.15..7823.57 rows=101 width=44) (actual time=29.628..88.095 rows=101 loops=1)
Output: sid, (array_agg(ts ORDER BY pk DESC))[1], (array_agg(val ORDER BY pk DESC))[1]
Group Key: sensor_data.sid
Buffers: shared hit=736
-> Sort (cost=7117.15..7293.38 rows=70490 width=26) (actual time=29.273..36.249 rows=70490 loops=1)
Output: sid, ts, pk, val
Sort Key: sensor_data.sid
Sort Method: quicksort Memory: 8580kB
Buffers: shared hit=736
-> Seq Scan on public.sensor_data (cost=0.00..1440.90 rows=70490 width=26) (actual time=0.243..9.768 rows=70490 loops=1)
Output: sid, ts, pk, val
Buffers: shared hit=736
Planning time: 0.077 ms
Execution time: 88.489 ms
(14 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select sid,ts,val from (select sid,ts,val,row_number() over(partition by sid order by pk desc) as rn from sensor_data) t where rn=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=7117.15..9408.08 rows=352 width=18) (actual time=46.074..81.377 rows=101 loops=1)
Output: t.sid, t.ts, t.val
Filter: (t.rn = 1)
Rows Removed by Filter: 70389
Buffers: shared hit=736
-> WindowAgg (cost=7117.15..8526.95 rows=70490 width=34) (actual time=46.072..76.115 rows=70490 loops=1)
Output: sensor_data.sid, sensor_data.ts, sensor_data.val, row_number() OVER (?), sensor_data.pk
Buffers: shared hit=736
-> Sort (cost=7117.15..7293.38 rows=70490 width=26) (actual time=46.065..51.742 rows=70490 loops=1)
Output: sensor_data.sid, sensor_data.pk, sensor_data.ts, sensor_data.val
Sort Key: sensor_data.sid, sensor_data.pk DESC
Sort Method: quicksort Memory: 8580kB
Buffers: shared hit=736
-> Seq Scan on public.sensor_data (cost=0.00..1440.90 rows=70490 width=26) (actual time=0.245..9.863 rows=70490 loops=1)
Output: sensor_data.sid, sensor_data.pk, sensor_data.ts, sensor_data.val
Buffers: shared hit=736
Planning time: 0.100 ms
Execution time: 82.480 ms
(18 rows)
實時更新、統計 – 設計與壓測
1. lambda
lambda方式,感測器資料寫入明細表,以任務排程的方式,從明細表取出資料並刪除,將取出的資料進行增量統計,合併到統計結果中。
統計維度可能較多,為了並行,剝離資料獲取和刪除部分的功能。
批量獲取並刪除明細資料,按pk排序,批量獲取若干條。
函式如下:
create or replace function get_sensor_data(i_limit int) returns sensor_data[] as $$
declare
arr_pk int8[];
arr_sensor_data sensor_data[];
begin
select array_agg(t.sensor_data), array_agg((t.sensor_data).pk)
into arr_sensor_data, arr_pk
from (select sensor_data from sensor_data order by pk limit i_limit for update skip locked) t ;
delete from sensor_data WHERE pk = any (arr_pk);
return arr_sensor_data;
end;
$$ language plpgsql strict;
明細資料獲取到之後,繼續下一步的動作。
存在則更新,不存在則插入,採用PostgreSQL的insert on conflict語法。
1. 實時更新感測器的最新值
insert into sensor_lastdata
select sid, (array_agg(ts order by pk desc))[1] as last_ts, (array_agg(val order by pk desc))[1] as last_val from
unnest(get_sensor_data(1000))
group by sid
on conflict (sid) do update set last_ts=excluded.last_ts,last_val=excluded.last_val;
2. 批量增量統計感測器的值
統計值的合併方法請關注SQL內容,明細資料按SID聚合為陣列按PK順序存放。
insert into sensor_aggdata (sid,ts_group,sum_val,min_val,max_val,avg_val,count_val,all_vals)
select sid,to_char(ts,`yyyymmddhh24`),sum(val),min(val),max(val),avg(val),count(val),array_agg(val order by pk) from unnest(get_sensor_data(1000))
group by sid,to_char(ts,`yyyymmddhh24`)
on conflict (sid,ts_group) do update set
sum_val=sensor_aggdata.sum_val+excluded.sum_val,
min_val=least(sensor_aggdata.min_val, excluded.min_val),
max_val=greatest(sensor_aggdata.max_val, excluded.max_val),
avg_val=(sensor_aggdata.sum_val+excluded.sum_val)/(sensor_aggdata.count_val+excluded.count_val),
count_val=sensor_aggdata.count_val+excluded.count_val,
all_vals=array_cat(sensor_aggdata.all_vals, excluded.all_vals);
壓測
create table sensor_data(
pk serial8 primary key, -- 主鍵
ts timestamp, -- 時間戳
sid int, -- 感測器ID
val numeric(10,2) -- 資料
);
create table sensor_lastdata(
sid int primary key, -- 感測器ID,主鍵
last_ts timestamp, -- 時間戳
last_val numeric(10,2) -- 值
);
create table sensor_aggdata(
sid int, -- 感測器ID
ts_group varchar(10), -- 時間維度分組,例如小時(yyyymmddhh24)
sum_val numeric, -- 和
min_val numeric(10,2), -- 最小值
max_val numeric(10,2), -- 最大值
avg_val numeric(10,2), -- 平均值
count_val int, -- 計數
all_vals numeric(10,2)[], -- 明細值
unique (sid,ts_group) -- 唯一約束
);
壓測1,實時寫入,並實時更新感測器的最新值
vi ins.sql
set sid random(1,1000000)
insert into sensor_data(ts,sid,val) values (clock_timestamp(), :sid, random()*1000);
每次合併5萬條
vi lambda1.sql
insert into sensor_lastdata select sid, (array_agg(ts order by pk desc))[1] as last_ts, (array_agg(val order by pk desc))[1] as last_val from unnest(get_sensor_data(50000)) group by sid on conflict (sid) do update set last_ts=excluded.last_ts,last_val=excluded.last_val;
寫入約10萬條/s。
pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 64 -j 64 -T 120
transaction type: ./ins.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 12742596
latency average = 0.603 ms
latency stddev = 2.163 ms
tps = 106184.095420 (including connections establishing)
tps = 106188.650794 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 set sid random(1,1000000)
0.602 insert into sensor_data(ts,sid,val) values (clock_timestamp(), :sid, random()*1000);
增量消費,並更新最新值約5萬條/s。
pgbench -M prepared -n -r -P 1 -f ./lambda1.sql -c 1 -j 1 -T 1200
progress: 236.0 s, 1.0 tps, lat 649.196 ms stddev 0.000
progress: 237.0 s, 2.0 tps, lat 868.952 ms stddev 6.024
progress: 238.0 s, 1.0 tps, lat 728.553 ms stddev 0.000
progress: 239.0 s, 258.1 tps, lat 5.335 ms stddev 44.167
progress: 240.0 s, 850.9 tps, lat 0.983 ms stddev 14.506
progress: 241.0 s, 7962.2 tps, lat 0.146 ms stddev 3.672
progress: 242.0 s, 13488.1 tps, lat 0.074 ms stddev 0.006
postgres=# select count(*) from sensor_data;
count
-------
0
(1 row)
postgres=# select * from sensor_lastdata limit 10;
sid | last_ts | last_val
------+----------------------------+----------
672 | 2017-05-18 16:33:43.569255 | 196.01
178 | 2017-05-18 16:33:31.23651 | 593.16
686 | 2017-05-18 16:33:38.792138 | 762.95
4906 | 2017-05-18 16:33:43.498217 | 150.13
544 | 2017-05-18 16:33:45.338635 | 410.31
165 | 2017-05-18 16:33:28.393902 | 678.75
625 | 2017-05-18 16:33:37.077898 | 229.06
1316 | 2017-05-18 16:33:45.218268 | 27.55
3091 | 2017-05-18 16:33:33.320828 | 697.75
340 | 2017-05-18 16:33:31.567852 | 24.18
(10 rows)
每批統計10萬時,效能可以略微提升
progress: 211.0 s, 1.0 tps, lat 1428.401 ms stddev 0.000
progress: 212.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 213.0 s, 1.0 tps, lat 1375.766 ms stddev 0.000
progress: 214.0 s, 2665.9 tps, lat 0.699 ms stddev 23.234
progress: 215.0 s, 8963.1 tps, lat 0.083 ms stddev 0.008
progress: 216.0 s, 1699.4 tps, lat 0.741 ms stddev 12.434
progress: 217.0 s, 13247.9 tps, lat 0.075 ms stddev 0.006
壓測2,實時寫入,並批量增量統計感測器的值
每次合併10萬條
vi lambda2.sql
insert into sensor_aggdata (sid,ts_group,sum_val,min_val,max_val,avg_val,count_val,all_vals) select sid,to_char(ts,`yyyymmddhh24`),sum(val),min(val),max(val),avg(val),count(val),array_agg(val order by pk) from unnest(get_sensor_data(100000)) group by sid,to_char(ts,`yyyymmddhh24`) on conflict (sid,ts_group) do update set sum_val=sensor_aggdata.sum_val+excluded.sum_val, min_val=least(sensor_aggdata.min_val, excluded.min_val), max_val=greatest(sensor_aggdata.max_val, excluded.max_val), avg_val=(sensor_aggdata.sum_val+excluded.sum_val)/(sensor_aggdata.count_val+excluded.count_val), count_val=sensor_aggdata.count_val+excluded.count_val, all_vals=array_cat(sensor_aggdata.all_vals, excluded.all_vals);
寫入約10萬條/s。
pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 64 -j 64 -T 120
transaction type: ./ins.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 12753950
latency average = 0.602 ms
latency stddev = 2.733 ms
tps = 106272.985233 (including connections establishing)
tps = 106277.604416 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 set sid random(1,1000000)
0.601 insert into sensor_data(ts,sid,val) values (clock_timestamp(), :sid, random()*1000);
增量消費,並統計約4.4萬條/s。
pgbench -M prepared -n -r -P 1 -f ./lambda2.sql -c 1 -j 1 -T 1200
progress: 287.0 s, 1.0 tps, lat 2107.584 ms stddev 0.000
progress: 288.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 289.0 s, 100.1 tps, lat 29.854 ms stddev 213.634
progress: 290.0 s, 1855.0 tps, lat 0.540 ms stddev 5.677
progress: 291.0 s, 8447.0 tps, lat 0.118 ms stddev 0.005
postgres=# select * from sensor_aggdata limit 10;
sid | ts_group | sum_val | min_val | max_val | avg_val | count_val | all_vals
--------+------------+----------+---------+---------+---------+-----------+--------------------------------------------------------------------------------
6 | 2017051816 | 1842.71 | 42.47 | 577.09 | 307.12 | 6 | {42.47,559.47,577.09,193.62,75.74,394.32}
2 | 2017051816 | 5254.01 | 69.98 | 861.77 | 437.83 | 12 | {628.03,77.15,662.74,69.98,337.83,563.70,750.44,423.81,158.27,861.77,649.27,71.02}
226 | 2017051816 | 2756.42 | 144.00 | 680.45 | 344.55 | 8 | {350.57,144.00,194.23,352.52,680.45,302.66,420.01,311.98}
509 | 2017051816 | 6235.10 | 44.98 | 939.43 | 566.83 | 11 | {939.43,598.33,741.12,535.66,44.98,732.00,694.66,440.00,327.80,312.98,868.14}
20 | 2017051816 | 4684.00 | 7.01 | 878.64 | 425.82 | 11 | {209.70,288.67,76.35,544.31,289.33,7.01,841.21,878.64,418.05,651.01,479.72}
934042 | 2017051816 | 10210.41 | 46.44 | 945.59 | 486.21 | 21 | {235.86,656.24,450.73,945.59,932.06,256.10,46.44,903.74,694.43,713.79,523.25,325.82,333.67,603.01,743.63,137.48,238.60,321.65,466.50,70.49,611.33}
960 | 2017051816 | 3621.60 | 20.59 | 895.01 | 603.60 | 6 | {347.70,876.07,895.01,20.59,871.64,610.59}
81 | 2017051816 | 4209.38 | 459.06 | 949.42 | 701.56 | 6 | {716.38,949.42,706.20,459.06,613.36,764.96}
723065 | 2017051816 | 7176.00 | 12.37 | 983.84 | 512.57 | 14 | {869.29,715.48,323.42,595.29,983.84,700.06,716.37,741.55,137.88,12.37,334.74,951.94,46.85,46.92}
77 | 2017051816 | 5394.54 | 87.43 | 872.90 | 490.41 | 11 | {301.87,777.52,872.90,219.96,87.43,525.80,308.87,509.80,383.90,608.52,797.97}
(10 rows)
2. 流式計算
流式計算,使用PipelineDB,建立stream(即明細表),然後建立實時更新表,以及統計表。
1. 建立感測器明細資料stream。
create sequence seq; -- 建立PK序列
pipeline=# create stream sensor_data(
pk int8, -- 用於排序,取最新值的PK
ts timestamp, -- 時間戳
sid int, -- 感測器ID
val numeric(10,2) -- 值
);
CREATE STREAM
2. 建立實時更新感測器最新值的CONTINUOUS VIEW
請使用pipelinedb獨有的獲取分組最新值的聚合函式
keyed_max ( key, value )
Returns the value associated with the “highest” key.
keyed_min ( key, value )
Returns the value associated with the “lowest” key.
請勿使用(array_agg(ts order by pk desc))[1]
,pipelinedb對agg(order by)
支援不好
-- pipelinedb目前對agg(order by)支援不佳,測試寫入時報錯
CREATE CONTINUOUS VIEW sensor_lastdata1 AS
select sid, (array_agg(ts order by pk desc))[1] as last_ts, (array_agg(val order by pk desc))[1] as last_val
from sensor_data
group by sid;
-- 1. 請使用這個SQL代替上面的SQL
CREATE CONTINUOUS VIEW sensor_lastdata1 AS
select sid, keyed_max(pk, ts) as last_ts, keyed_max(pk, val) as last_val
from sensor_data
group by sid;
-- pipelinedb目前不支援window function,使用keyed_max, keyed_min代替。
CREATE CONTINUOUS VIEW sensor_lastdata2 AS
select sid,ts as last_ts,val as last_val from sensor_data
where row_number() over(partition by sid order by pk desc)=1;
ERROR: subqueries in continuous views cannot contain window functions
3. 建立實時統計感測器數值,以及明細聚合的CONTINUOUS VIEW
-- pipelinedb目前對agg(order by)支援不佳,測試寫入時報錯
CREATE CONTINUOUS VIEW sensor_aggdata1 AS
select
sid,
to_char(ts,`yyyymmddhh24`) as ts_group,
sum(val) as sum_val,
min(val) as min_val,
max(val) as max_val,
avg(val) as avg_val,
count(val) as count_val,
array_agg(val order by pk) as all_vals
from sensor_data
group by sid,to_char(ts,`yyyymmddhh24`);
-- 2. 請使用這個SQL代替上面的SQL
CREATE CONTINUOUS VIEW sensor_aggdata1 AS
select
sid,
to_char(ts,`yyyymmddhh24`) as ts_group,
sum(val) as sum_val,
min(val) as min_val,
max(val) as max_val,
avg(val) as avg_val,
count(val) as count_val,
jsonb_object_agg (pk, val) as all_vals
from sensor_data
group by sid,to_char(ts,`yyyymmddhh24`);
4. 啟用CONTINUOUS VIEW
pipeline=# activate sensor_lastdata1;
ACTIVATE
pipeline=# activate sensor_aggdata1;
ACTIVATE
壓測
vi ins.sql
set sid random(1,1000000)
insert into sensor_data(pk,ts,sid,val) values (nextval(`seq`), clock_timestamp(), :sid, random()*1000);
pipelinedb目前對agg(order by)支援不佳的報錯,如果你沒有使用替代SQL,會收到如下報錯。
/home/digoal/pgsql10/bin/pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 1 -j 1 -T 100
progress: 1.0 s, 12.0 tps, lat 1.302 ms stddev 0.455
WARNING: a background worker crashed while processing this batch
HINT: Some of the tuples inserted in this batch might have been lost.
progress: 2.0 s, 16.0 tps, lat 70.528 ms stddev 253.719
WARNING: a background worker crashed while processing this batch
HINT: Some of the tuples inserted in this batch might have been lost.
WARNING: a background worker crashed while processing this batch
HINT: Some of the tuples inserted in this batch might have been lost.
WARNING: a background worker crashed while processing this batch
HINT: Some of the tuples inserted in this batch might have been lost.
使用替代SQL,壓測結果:
1. 聚合values,壓測結果:
寫入速度12.7萬/s。
/home/digoal/pgsql10/bin/pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 256 -j 256 -T 100
transaction type: ./ins.sql
scaling factor: 1
query mode: prepared
number of clients: 256
number of threads: 256
duration: 100 s
number of transactions actually processed: 12840629
latency average = 1.994 ms
latency stddev = 14.671 ms
tps = 127857.131372 (including connections establishing)
tps = 127864.890658 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 set sid random(1,1000000)
1.997 insert into sensor_data(pk,ts,sid,val) values (nextval(`seq`), clock_timestamp(), :sid, random()*1000);
pipeline=# select * from sensor_aggdata1 limit 10;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------
sid | 444427
ts_group | 2017052410
sum_val | 4902.07
min_val | 18.69
max_val | 980.26
avg_val | 445.6427272727272727
count_val | 11
all_vals | {"41971591": 731.45, "42075280": 69.63, "42629210": 980.26, "45243895": 18.69, "45524545": 320.88, "46971341": 741.88, "47036195": 357.47, "47895869": 562.16, "49805560": 136.78, "51753795": 344.00, "53039367": 638.87}
2. 不聚合VALUES,壓測結果:
寫入速度20萬/s。
CREATE CONTINUOUS VIEW sensor_aggdata2 AS
select
sid,
to_char(ts,`yyyymmddhh24`) as ts_group,
sum(val) as sum_val,
min(val) as min_val,
max(val) as max_val,
avg(val) as avg_val,
count(val) as count_val
-- jsonb_object_agg (pk, val) as all_vals
from sensor_data
group by sid,to_char(ts,`yyyymmddhh24`);
/home/digoal/pgsql10/bin/pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 256 -j 256 -T 100
transaction type: ./ins.sql
scaling factor: 1
query mode: prepared
number of clients: 256
number of threads: 256
duration: 100 s
number of transactions actually processed: 20940292
latency average = 1.222 ms
latency stddev = 0.423 ms
tps = 208834.531839 (including connections establishing)
tps = 208854.792937 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 set sid random(1,1000000)
1.222 insert into sensor_data(pk,ts,sid,val) values (nextval(`seq`), clock_timestamp(), :sid, random()*1000);
pipeline=# select * from sensor_aggdata2;
sid | ts_group | sum_val | min_val | max_val | avg_val | count_val
------+------------+-------------+---------+---------+----------------------+-----------
196 | 2017051815 | 11462397.00 | 0.00 | 999.99 | 503.1780948200175593 | 22780
833 | 2017051815 | 11479990.49 | 0.07 | 999.99 | 498.4365443730461966 | 23032
700 | 2017051815 | 11205820.52 | 0.04 | 999.97 | 497.1967574762623125 | 22538
83 | 2017051815 | 11466423.01 | 0.01 | 999.93 | 501.3959075604530150 | 22869
526 | 2017051815 | 11389541.40 | 0.01 | 999.99 | 503.4496485877204615 | 22623
996 | 2017051815 | 11416373.92 | 0.03 | 999.99 | 502.1938996172964413 | 22733
262 | 2017051815 | 11458700.05 | 0.03 | 999.98 | 499.5509656465254163 | 22938
542 | 2017051815 | 11365373.33 | 0.00 | 999.95 | 499.6427366246098387 | 22747
......
3. 實時
實時的寫入明細,同步更新最終狀態。
(同步統計不推薦使用,對寫入的RT效能影響比較大)
實時更新感測器最終狀態表
create table sensor_lastdata(
sid int primary key,
last_ts timestamp,
last_val numeric(10,2)
);
壓測1,更新感測器實時狀態
vi ins.sql
set sid random(1,1000000)
insert into sensor_lastdata values (:sid, now(), random()*1000) on conflict (sid) do update set last_ts=excluded.last_ts,last_val=excluded.last_val;
效能,約18萬/s。
/home/digoal/pgsql10/bin/pgbench -M prepared -n -r -P 1 -f ./ins.sql -c 128 -j 128 -T 100
transaction type: ./ins.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 100 s
number of transactions actually processed: 18659587
latency average = 0.686 ms
latency stddev = 2.566 ms
tps = 186557.140033 (including connections establishing)
tps = 186565.458460 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 set sid random(1,1000000)
0.684 insert into sensor_lastdata values (:sid, now(), random()*1000) on conflict (sid) do update set last_ts=excluded.last_ts,last_val=excluded.last_val;
三種方案對比
效能對比
1. 寫入明細記錄的速度
lambda方式:10.6萬/s
流計算方式(含val明細聚合):12.78萬/s
流計算方式(不含val明細聚合):20.8萬/s
2. 更新最終狀態速度
lambda方式:5.98萬/s
實時方式:18.6萬/s
流計算方式:20.8萬/s
3. 統計速度
lambda方式(含val明細聚合):4.4萬/s
流計算方式(含val明細聚合):12.78萬/s
流計算方式(不含val明細聚合):20.8萬/s
優劣與適用場合對比
1. lambda方式
效能中規中矩,通過UDF + 增量排程,支援所有的統計模式。
目前這個方案有成熟的使用者案例(某大資料平臺),支援了每天數TB的資料準實時統計。
同時也期待PG社群開發這樣的功能:
delete from table order by pk limit xxx skip locked returning array_agg(ts),array_agg(val) group by sid;
這種QUERY將以最小的開銷,從資料中刪除並返回一批記錄。相比本例,也許能提升一倍效能。
2. 流計算方式
效能最高,使用也便利,推薦使用。
將來pipelinedb外掛化之後,使用起來就更加方便了。
3. 實時方式
如果只是用來更新最終狀態,建議使用,開發工作量最少,不需要排程。
參考
《PostgreSQL upsert功能(insert on conflict do)的用法》
《PostgreSQL 如何實現upsert與新舊資料自動分離》
《[轉載]postgresql 9.5版本之前實現upsert功能》
《流計算風雲再起 – PostgreSQL攜PipelineDB力挺IoT》
http://docs.pipelinedb.com/streams.html
相關文章
- Java如何使用實時流式計算處理?Java
- 流式處理新秀Flink原理與實踐
- 批處理最佳實踐 - Vlad Mihalcea
- 自然語言處理的最佳實踐自然語言處理
- Kappa:比Lambda更好更靈活的實時處理架構APP架構
- 20個異常處理的最佳實踐
- Java中的異常處理最佳實踐Java
- SparkStreaming實時流處理學習Spark
- 處理Java異常的9個最佳實踐Java
- Java異常處理的9個最佳實踐Java
- Spring Boot統一異常處理最佳實踐Spring Boot
- 處理Java異常的10個最佳實踐Java
- 物聯網路卡沒網怎麼處理
- 如何利用Redis實現延時處理Redis
- 實時資料處理:Kafka 和 FlinkKafka
- Spark Streaming 流式處理Spark
- TiDB 在攜程 | 實時標籤處理平臺優化實踐TiDB優化
- Apache Beam,批處理和流式處理的融合!Apache
- Java異常處理最佳實踐及陷阱防範Java
- Java8特性詳解 lambda表示式(二):流式處理中的lambdaJava
- 【物聯網】IoT與智慧時代
- Flink基礎:實時處理管道與ETL
- 雲原生最佳實踐系列 7:基於 OSS Object FC 實現非結構化檔案實時處理Object
- PostgreSQL流式處理應用實踐-二手商品實時歸類(非同步訊息notify/listen、閱後即焚)SQL非同步
- TDengine 使用者案例合集 | 智慧環保專案的時序資料處理難點與最佳化實踐
- Node.js 服務連線 MongoDB 處理最佳實踐Node.jsMongoDB
- java時間處理Java
- PHP 時間處理PHP
- WebComponent+WebGl的實時影象處理彈幕播放器Web播放器
- Kafka如何實現實時流處理 Part 1 - André MeloKafka
- flink使用Event_time處理實時資料
- 實時影片拍攝處理SDK解決方案
- 第32講:實時處理利器 mitmproxy 的使用MIT
- 金融行業現場故障處理實錄行業
- 聯發科MT8176處理器
- CnosDB:深入瞭解時序資料處理函式函式
- 興盛優選:時序資料如何高效處理?
- 大資料量處理實踐方案整理大資料
- 處理XML資料應用實踐XML