時空資料庫實踐(含紐約TAXI資料透視分析)-PostGIS+TimescaleDB=>PostgreSQL

德哥發表於2018-02-02

標籤

PostgreSQL , 時空資料庫 , 時序資料庫 , 空間資料庫 , timescaleDB , PostGIS


背景

現實社會中,很多業務產生的資料具有時序資料屬性(在時間維度上順序寫入,同時包括大量時間區間查詢統計的需求)。

例如業務的FEED資料,物聯網產生的時序資料(如氣象感測器、車輛軌跡、等),金融行業的實時資料等等。

PostgreSQL的UDF和BRIN(塊級索引)很適合時序資料的處理。具體有以下的兩個例子

《PostgreSQL 按需切片的實現(TimescaleDB外掛自動切片功能的plpgsql schemaless實現)》

《PostgreSQL 時序最佳實踐 – 證券交易系統資料庫設計 – 阿里雲RDS PostgreSQL最佳實踐》

pic

實際上PostgreSQL生態中,衍生了一個時序外掛:timescaleDB。專門用於處理時序資料。(timescale的改進,包括SQL優化器的改進(支援merge append,時間片聚合非常高效),rotate介面,自動分片等)

同時timescaleDB也非常受投資者的關注,已獲5000萬美金的投資,也間接說明時序資料庫在未來是非常受使用者歡迎的。

timescaleDB的優勢

首先,timescaleDB是自動切片的,對使用者無感知,在資料量非常龐大的時候,寫入效能不衰減。(主要指IOPS較低的磁碟,如果IOPS較好的磁碟PG在寫入大量資料後效能也是OK的。)

pic

其次,timescale改進了SQL優化器,增加了merge append的執行節點,同時在對小時間片進行group by時,可以不用HASH或GROUP整個資料範圍,而是分片計算,使得效率非常高。

最後,timescale增加了一些API,使得使用者在時序資料的寫入、維護、查詢都非常的高效、同時易於維護。

API如下

http://docs.timescale.com/v0.8/api

部署timescaleDB

以CentOS 7.x x64為例。

1、首先要安裝好PostgreSQL

參考 《PostgreSQL on Linux 最佳部署手冊》

export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256  
LIBS=-lpthread CFLAGS="-O3" make world -j 64  
LIBS=-lpthread CFLAGS="-O3" make install-world  

2、其次需要安裝cmake3

epel  
  
yum install -y cmake3  
  
ln -s /usr/bin/cmake3 /usr/bin/cmake  

3、編譯timescaleDB

git clone https://github.com/timescale/timescaledb/  
  
cd timescaledb  
git checkout release-0.8.0  
  
或  
  
wget https://github.com/timescale/timescaledb/archive/0.8.0.tar.gz  
  
  
  
export PATH=/home/digoal/pgsql10/bin:$PATH  
export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH  
  
# Bootstrap the build system  
./bootstrap  
  
cd ./build && make  
  
make install  
  
  
[  2%] Built target sqlupdatefile  
[  4%] Built target sqlfile  
[100%] Built target timescaledb  
Install the project...  
-- Install configuration: "Release"  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so  

4、配置postgresql.conf,在資料庫啟動時自動載入timescale lib庫。

vi $PGDATA/postgresql.conf  
shared_preload_libraries = `timescaledb`  
  
pg_ctl restart -m fast  

5、對需要使用timescaledb的資料庫,建立外掛.

psql  
psql (10.1)  
Type "help" for help.  
  
postgres=# create extension timescaledb ;  

6、timescaledb的相關引數

timescaledb.constraint_aware_append     
timescaledb.disable_optimizations       
timescaledb.optimize_non_hypertables    
timescaledb.restoring    
  
postgres=# show timescaledb.constraint_aware_append ;  
 timescaledb.constraint_aware_append   
-------------------------------------  
 on  
(1 row)  
  
postgres=# show timescaledb.disable_optimizations ;  
 timescaledb.disable_optimizations   
-----------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.optimize_non_hypertables ;  
 timescaledb.optimize_non_hypertables   
--------------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.restoring ;  
 timescaledb.restoring   
-----------------------  
 off  
(1 row)  

timescaleDB使用例子1 – 紐約TAXI資料透視分析

第一個例子是real-life New York City taxicab data ,

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

資料為真實的資料,來自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下載樣本資料

wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz  

2、解壓

tar -zxvf nyc_data.tar.gz   

3、建表,其中包括將普通錶轉換為時序儲存表的API create_hypertable 的使用。

psql -f nyc_data.sql  

擷取一些nyc_data.sql的內容如下:

cat nyc_data.sql  
  
-- 叫車資料: 包括時長、計費、路程、上車、下車經緯度、時間、人數等等。  
  
CREATE TABLE "rides"(  
    vendor_id TEXT,  
    pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    passenger_count NUMERIC,  
    trip_distance NUMERIC,  
    pickup_longitude  NUMERIC,  
    pickup_latitude   NUMERIC,  
    rate_code         INTEGER,  
    dropoff_longitude NUMERIC,  
    dropoff_latitude  NUMERIC,  
    payment_type INTEGER,  
    fare_amount NUMERIC,  
    extra NUMERIC,  
    mta_tax NUMERIC,  
    tip_amount NUMERIC,  
    tolls_amount NUMERIC,  
    improvement_surcharge NUMERIC,  
    total_amount NUMERIC  
);  

這句話,將rides轉換為時序表儲存

SELECT create_hypertable(`rides`, `pickup_datetime`, `payment_type`, 2, create_default_indexes=>FALSE);  

建立索引

CREATE INDEX ON rides (vendor_id, pickup_datetime desc);  
CREATE INDEX ON rides (pickup_datetime desc, vendor_id);  
CREATE INDEX ON rides (rate_code, pickup_datetime DESC);  
CREATE INDEX ON rides (passenger_count, pickup_datetime desc);  

4、匯入測試資料

psql -c "COPY rides FROM nyc_data_rides.csv CSV"  
COPY 10906858  

5、對已轉換為時序儲存表的rides執行一些測試SQL,效能比PostgreSQL普通表要好。

每天同車超過2人的交易,平均計費多少?

-- Average fare amount of rides with 2+ passengers by day  
  
SELECT date_trunc(`day`, pickup_datetime) as day, avg(fare_amount)  
  FROM rides  
  WHERE passenger_count > 1 AND pickup_datetime < `2016-01-08`  
  GROUP BY day ORDER BY day;  
  
        day         |         avg  
--------------------+---------------------  
2016-01-01 00:00:00 | 13.3990821679715529  
2016-01-02 00:00:00 | 13.0224687415181399  
2016-01-03 00:00:00 | 13.5382068607068607  
2016-01-04 00:00:00 | 12.9618895561740149  
2016-01-05 00:00:00 | 12.6614611935518309  
2016-01-06 00:00:00 | 12.5775245695086098  
2016-01-07 00:00:00 | 12.5868802584437019  
(7 rows)  

6、某些查詢的效能甚至超過20倍

每天有多少筆交易。

-- Total number of rides by day for first 5 days  
  
SELECT date_trunc(`day`, pickup_datetime) as day, COUNT(*) FROM rides  
  GROUP BY day ORDER BY day  
  LIMIT 5;  
  
        day         | count  
--------------------+--------  
2016-01-01 00:00:00 | 345037  
2016-01-02 00:00:00 | 312831  
2016-01-03 00:00:00 | 302878  
2016-01-04 00:00:00 | 316171  
2016-01-05 00:00:00 | 343251  
(5 rows)  

timescale增加了merge append的執行優化,因此在時間片上按小粒度聚合,效率非常高,資料量越大,效能提升的效果越明顯。

For example, TimescaleDB introduces a time-based “merge append” optimization to minimize the number of
groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

SELECT date_trunc(`minute`, time) AS minute, max(usage_user)  
  FROM cpu  
  WHERE time < `2017-01-01`  
  GROUP BY minute  
  ORDER BY minute DESC  
  LIMIT 5;  

7、執行一些timescaleDB特有的功能函式,例如time_bucket,這裡同樣會用到timescaleDB內建的一些加速演算法。

每5分鐘間隔為一個BUCKET,輸出每個間隔產生了多少筆訂單。

-- Number of rides by 5 minute intervals  
--   (using the TimescaleDB "time_bucket" function)  
  
SELECT time_bucket(`5 minute`, pickup_datetime) as five_min, count(*)  
  FROM rides  
  WHERE pickup_datetime < `2016-01-01 02:00`  
  GROUP BY five_min ORDER BY five_min;  
  
      five_min       | count  
---------------------+-------  
 2016-01-01 00:00:00 |   703  
 2016-01-01 00:05:00 |  1482  
 2016-01-01 00:10:00 |  1959  
 2016-01-01 00:15:00 |  2200  
 2016-01-01 00:20:00 |  2285  
 2016-01-01 00:25:00 |  2291  
 2016-01-01 00:30:00 |  2349  
 2016-01-01 00:35:00 |  2328  
 2016-01-01 00:40:00 |  2440  
 2016-01-01 00:45:00 |  2372  
 2016-01-01 00:50:00 |  2388  
 2016-01-01 00:55:00 |  2473  
 2016-01-01 01:00:00 |  2395  
 2016-01-01 01:05:00 |  2510  
 2016-01-01 01:10:00 |  2412  
 2016-01-01 01:15:00 |  2482  
 2016-01-01 01:20:00 |  2428  
 2016-01-01 01:25:00 |  2433  
 2016-01-01 01:30:00 |  2337  
 2016-01-01 01:35:00 |  2366  
 2016-01-01 01:40:00 |  2325  
 2016-01-01 01:45:00 |  2257  
 2016-01-01 01:50:00 |  2316  
 2016-01-01 01:55:00 |  2250  
(24 rows)  

8、執行一些統計分析SQL

每個城市的叫車交易量。

-- Join rides with rates to get more information on rate_code  
  
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE pickup_datetime < `2016-01-08`  
  GROUP BY rates.description ORDER BY rates.description;  
  
      description      | num_trips  
-----------------------+-----------  
 JFK                   |     54832  
 Nassau or Westchester |       967  
 Newark                |      4126  
 group ride            |        17  
 negotiated fare       |      7193  
 standard rate         |   2266401  
(6 rows)  

某些城市2016年1月的叫車統計(最長、短距離、平均人數、時長等)

-- Analysis of all JFK and EWR rides in Jan 2016  
  
SELECT rates.description, COUNT(vendor_id) as num_trips,  
    AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,  
    AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,  
    AVG(passenger_count) as avg_passengers  
  FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE rides.rate_code in (2,3) AND pickup_datetime < `2016-02-01`  
  GROUP BY rates.description ORDER BY rates.description;  
  
 description | num_trips | avg_trip_duration |      avg_total      |      avg_tip       | min_distance |    avg_distance     | max_distance |   avg_passengers  
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------  
 JFK         |    225019 | 00:45:46.822517   | 64.3278115181384683 | 7.3334228220728027 |         0.00 | 17.2602816651038357 |       221.00 | 1.7333869584346211  
 Newark      |     16822 | 00:35:16.157472   | 86.4633688027582927 | 9.5461657353465700 |         0.00 | 16.2706122934252764 |       177.23 | 1.7435501129473309  
(2 rows)  

9、資料自動分片與執行計劃

postgres=# d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  
  
其中一個分片的約束如下  
Check constraints:  
    "constraint_1" CHECK (pickup_datetime >= `2015-12-31 00:00:00`::timestamp without time zone AND pickup_datetime < `2016-01-30 00:00:00`::timestamp without time zone)  
    "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823)  
Inherits: rides  
-- Peek behind the scenes  
  
postgres=#  select count(*) from rides;  
  count     
----------  
 10906858  
(1 row)  
  
Time: 376.247 ms  
postgres=# explain select count(*) from rides;  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=254662.23..254662.24 rows=1 width=8)  
   ->  Gather  (cost=254661.71..254662.22 rows=5 width=8)  
         Workers Planned: 5  
         ->  Partial Aggregate  (cost=253661.71..253661.72 rows=1 width=8)  
               ->  Append  (cost=0.00..247468.57 rows=2477258 width=0)  
                     ->  Parallel Seq Scan on rides  (cost=0.00..0.00 rows=1 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_1_chunk  (cost=0.00..77989.57 rows=863657 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_2_chunk  (cost=0.00..150399.01 rows=1331101 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_3_chunk  (cost=0.00..6549.75 rows=112675 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_4_chunk  (cost=0.00..12530.24 rows=169824 width=0)  
(10 rows)  

10、也可以直接查分片

postgres=# select count(*) from  _timescaledb_internal._hyper_1_1_chunk;  
  count    
---------  
 3454961  
(1 row)  

分片對使用者完全透明

分片後設資料:

postgres=# dn  
         List of schemas  
         Name          |  Owner     
-----------------------+----------  
 _timescaledb_cache    | postgres  
 _timescaledb_catalog  | postgres  
 _timescaledb_internal | postgres  
 public                | postgres  
(4 rows)  

timescaleDB + PostGIS 雙劍合璧 – 時空資料庫

結合時序資料庫timescaleDB外掛,空間資料庫PostGIS外掛。PostgreSQL可以很好的處理空間資料。

1、建立空間資料庫PostGIS建立

create extension postgis;  

2、新增空間型別欄位

http://postgis.net/docs/manual-2.4/AddGeometryColumn.html

postgres=# SELECT AddGeometryColumn (`public`,`rides`,`pickup_geom`,2163,`POINT`,2);  
                   addgeometrycolumn                      
--------------------------------------------------------  
 public.rides.pickup_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=# SELECT AddGeometryColumn (`public`,`rides`,`dropoff_geom`,2163,`POINT`,2);  
                    addgeometrycolumn                      
---------------------------------------------------------  
 public.rides.dropoff_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=#   
postgres=# d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
 pickup_geom           | geometry(Point,2163)        |           |          |         | main     |              |   
 dropoff_geom          | geometry(Point,2163)        |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  

3、將資料更新到geometry欄位(實際儲存為兩個自動,分別表示經度和緯度。實際上不更新也沒關係,因為PG支援表示式索引,完全可以使用這兩個欄位,建立表示式空間索引)。

-- Generate the geometry points and write to table  
--   (Note: These calculations might take a few mins)  
  
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);  
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);  
  
  
vacuum full rides;  

4、時空分析舉例。

在(lat, long) (40.7589,-73.9851)附近400米範圍內,每30分鐘有多少輛車被叫(以上車位置來計算)。

-- Number of rides on New Years Eve originating within  
--   400m of Times Square, by 30 min buckets  
--   Note: Times Square is at (lat, long) (40.7589,-73.9851)  
  
SELECT time_bucket(`30 minutes`, pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq  
  FROM rides  
  WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400  
    AND pickup_datetime < `2016-01-01 14:00`  
  GROUP BY thirty_min ORDER BY thirty_min;  
  
     thirty_min      | near_times_sq  
---------------------+--------------  
 2016-01-01 00:00:00 |      74  
 2016-01-01 00:30:00 |     102  
 2016-01-01 01:00:00 |     120  
 2016-01-01 01:30:00 |      98  
 2016-01-01 02:00:00 |     112  
 2016-01-01 02:30:00 |     109  
 2016-01-01 03:00:00 |     163  
 2016-01-01 03:30:00 |     181  
 2016-01-01 04:00:00 |     214  
 2016-01-01 04:30:00 |     185  
 2016-01-01 05:00:00 |     158  
 2016-01-01 05:30:00 |     113  
 2016-01-01 06:00:00 |     102  
 2016-01-01 06:30:00 |      91  
 2016-01-01 07:00:00 |      88  
 2016-01-01 07:30:00 |      58  
 2016-01-01 08:00:00 |      72  
 2016-01-01 08:30:00 |      94  
 2016-01-01 09:00:00 |     115  
 2016-01-01 09:30:00 |     118  
 2016-01-01 10:00:00 |     135  
 2016-01-01 10:30:00 |     160  
 2016-01-01 11:00:00 |     212  
 2016-01-01 11:30:00 |     229  
 2016-01-01 12:00:00 |     244  
 2016-01-01 12:30:00 |     230  
 2016-01-01 13:00:00 |     235  
 2016-01-01 13:30:00 |     238  

例項2 – 感測器資料、天氣資料

http://docs.timescale.com/v0.8/tutorials/other-sample-datasets

不再贅述。

timescaleDB 常用API

http://docs.timescale.com/v0.8/api

1、建立時序表

create_hypertable()

Required Arguments

Name Description
main_table Identifier of table to convert to hypertable
time_column_name Name of the column containing time values

Optional Arguments

Name Description
partitioning_column Name of an additional column to partition by. If provided, number_partitions must be set.
number_partitions Number of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.
chunk_time_interval Interval in event time that each chunk covers. Must be > 0. Default is 1 month.
create_default_indexes Boolean whether to create default indexes on time/partitioning columns. Default is TRUE.
if_not_exists Boolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.
partitioning_func The function to use for calculating a value`s partition.

2、新增多級分片欄位

支援hash和interval分片

add_dimension()

Required Arguments

Name Description
main_table Identifier of hypertable to add the dimension to.
column_name Name of the column to partition by.

Optional Arguments

Name Description
number_partitions Number of hash partitions to use on column_name. Must be > 0.
interval_length Interval that each chunk covers. Must be > 0.
partitioning_func The function to use for calculating a value`s partition (see create_hypertable instructions).

3、刪除分片

刪除指定 時間點、多久 之前的分片

drop_chunks()

Required Arguments

Name Description
older_than Timestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

Name Description
table_name Hypertable name from which to drop chunks. If not supplied, all hypertables are affected.
schema_name Schema name of the hypertable from which to drop chunks. Defaults to public.
cascade Boolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4、設定分片時間區間

set_chunk_time_interval()

Required Arguments

Name Description
main_table Identifier of hypertable to update interval for.
chunk_time_interval Interval in event time that each new chunk covers. Must be > 0.

5、分析函式 – 第一條

first()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查詢所有感測器的最早上傳的溫度值。

SELECT device_id, first(temp, time)  
  FROM metrics  
  GROUP BY device_id;  

使用遞迴亦可實現:

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》

6、分析函式 – 最後一條

last()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查詢每5分鐘時間區間內,每個感測器的最新溫度值

SELECT device_id, time_bucket(`5 minutes`, time) as interval,  
  last(temp, time)  
  FROM metrics  
  WHERE time > now () - interval `1 day`  
  GROUP BY device_id, interval  
  ORDER BY interval DESC;  

使用遞迴亦可實現:

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》

7、分析函式 – 柱狀圖

histogram()

Required Arguments

Name Description
value A set of values to partition into a histogram
min The histogram’s lower bound used in bucketing
max The histogram’s upper bound used in bucketing
nbuckets The integer value for the number of histogram buckets (partitions)

例如,

電池電量20到60,均分為5個BUCKET區間,返回5+2個值的陣列(表示每個bucket區間的記錄數),頭尾分為別為邊界外的記錄數有多少。

SELECT device_id, histogram(battery_level, 20, 60, 5)  
  FROM readings  
  GROUP BY device_id  
  LIMIT 10;  
  
 device_id  |          histogram  
------------+------------------------------  
 demo000000 | {0,0,0,7,215,206,572}  
 demo000001 | {0,12,173,112,99,145,459}  
 demo000002 | {0,0,187,167,68,229,349}  
 demo000003 | {197,209,127,221,106,112,28}  
 demo000004 | {0,0,0,0,0,39,961}  
 demo000005 | {12,225,171,122,233,80,157}  
 demo000006 | {0,78,176,170,8,40,528}  
 demo000007 | {0,0,0,126,239,245,390}  
 demo000008 | {0,0,311,345,116,228,0}  
 demo000009 | {295,92,105,50,8,8,442}  

8、分析函式 – 時間區間

類似date_trunc,但是更強大,可以用任意interval進行時間截斷。方便使用者使用。

time_bucket()

Required Arguments

Name Description
bucket_width A PostgreSQL time interval for how long each bucket is (interval)
time The timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

Name Description
offset The time interval to offset all buckets by (interval)

9、資料概貌檢視函式 – 時序表概貌

hypertable_relation_size_pretty()

SELECT * FROM hypertable_relation_size_pretty(`conditions`);  
  
 table_size | index_size | toast_size | total_size  
------------+------------+------------+------------  
 1171 MB    | 1608 MB    | 176 kB     | 2779 MB  

10、資料概貌檢視函式 – 分片大小

chunk_relation_size_pretty()

SELECT * FROM chunk_relation_size_pretty(`conditions`);  
  
                chunk_table                 | table_size | index_size | total_size  
---------------------------------------------+------------+------------+------------  
 "_timescaledb_internal"."_hyper_1_1_chunk"  | 28 MB      | 36 MB      | 64 MB  
 "_timescaledb_internal"."_hyper_1_2_chunk"  | 57 MB      | 78 MB      | 134 MB  
 ...  

11、資料概貌檢視函式 – 索引大小

indexes_relation_size_pretty()

SELECT * FROM indexes_relation_size_pretty(`conditions`);  
  
             index_name_              | total_size  
--------------------------------------+------------  
 public.conditions_device_id_time_idx | 1143 MB  
 public.conditions_time_idx           | 465 MB  

12、匯出時序後設資料

https://raw.githubusercontent.com/timescale/timescaledb/master/scripts/dump_meta_data.sql

psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt  

小結

timescaleDB是一個非常好用的時序資料處理外掛,隱藏了分片邏輯(對使用者透明),同時提供了大量的API函式介面,以及效能優化。在時序場景使用很贊。

結合PostGIS外掛,PostgreSQL在時空處理這塊,如虎添翼。

參考

《PostgreSQL 按需切片的實現(TimescaleDB外掛自動切片功能的plpgsql schemaless實現)》

《時序資料庫有哪些特點? TimescaleDB時序資料庫介紹》

《PostgreSQL on Linux 最佳部署手冊》

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

http://docs.timescale.com/v0.8/introduction

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》


相關文章