時空資料庫實踐(含紐約TAXI資料透視分析)-PostGIS+TimescaleDB=>PostgreSQL

德哥發表於2018-02-02

資料庫SQL

背景

現實社會中，很多業務產生的資料具有時序資料屬性（在時間維度上順序寫入，同時包括大量時間區間查詢統計的需求）。

例如業務的FEED資料，物聯網產生的時序資料（如氣象感測器、車輛軌跡、等），金融行業的實時資料等等。

PostgreSQL的UDF和BRIN（塊級索引）很適合時序資料的處理。具體有以下的兩個例子

《PostgreSQL 按需切片的實現(TimescaleDB外掛自動切片功能的plpgsql schemaless實現)》

《PostgreSQL 時序最佳實踐 – 證券交易系統資料庫設計 – 阿里雲RDS PostgreSQL最佳實踐》

實際上PostgreSQL生態中，衍生了一個時序外掛：timescaleDB。專門用於處理時序資料。（timescale的改進，包括SQL優化器的改進(支援merge append，時間片聚合非常高效)，rotate介面，自動分片等）

同時timescaleDB也非常受投資者的關注，已獲5000萬美金的投資，也間接說明時序資料庫在未來是非常受使用者歡迎的。

timescaleDB的優勢

首先，timescaleDB是自動切片的，對使用者無感知，在資料量非常龐大的時候，寫入效能不衰減。（主要指IOPS較低的磁碟，如果IOPS較好的磁碟PG在寫入大量資料後效能也是OK的。）

其次，timescale改進了SQL優化器，增加了merge append的執行節點，同時在對小時間片進行group by時，可以不用HASH或GROUP整個資料範圍，而是分片計算，使得效率非常高。

最後，timescale增加了一些API，使得使用者在時序資料的寫入、維護、查詢都非常的高效、同時易於維護。

API如下

http://docs.timescale.com/v0.8/api

部署timescaleDB

以CentOS 7.x x64為例。

1、首先要安裝好PostgreSQL

參考《PostgreSQL on Linux 最佳部署手冊》

export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256  
LIBS=-lpthread CFLAGS="-O3" make world -j 64  
LIBS=-lpthread CFLAGS="-O3" make install-world

2、其次需要安裝cmake3

epel  
  
yum install -y cmake3  
  
ln -s /usr/bin/cmake3 /usr/bin/cmake

3、編譯timescaleDB

git clone https://github.com/timescale/timescaledb/  
  
cd timescaledb  
git checkout release-0.8.0  
  
或  
  
wget https://github.com/timescale/timescaledb/archive/0.8.0.tar.gz  
  
  
  
export PATH=/home/digoal/pgsql10/bin:$PATH  
export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH  
  
# Bootstrap the build system  
./bootstrap  
  
cd ./build && make  
  
make install  
  
  
[  2%] Built target sqlupdatefile  
[  4%] Built target sqlfile  
[100%] Built target timescaledb  
Install the project...  
-- Install configuration: "Release"  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so

4、配置postgresql.conf，在資料庫啟動時自動載入timescale lib庫。

vi $PGDATA/postgresql.conf  
shared_preload_libraries = `timescaledb`  
  
pg_ctl restart -m fast

5、對需要使用timescaledb的資料庫，建立外掛.

psql  
psql (10.1)  
Type "help" for help.  
  
postgres=# create extension timescaledb ;

6、timescaledb的相關引數

timescaledb.constraint_aware_append     
timescaledb.disable_optimizations       
timescaledb.optimize_non_hypertables    
timescaledb.restoring    
  
postgres=# show timescaledb.constraint_aware_append ;  
 timescaledb.constraint_aware_append   
-------------------------------------  
 on  
(1 row)  
  
postgres=# show timescaledb.disable_optimizations ;  
 timescaledb.disable_optimizations   
-----------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.optimize_non_hypertables ;  
 timescaledb.optimize_non_hypertables   
--------------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.restoring ;  
 timescaledb.restoring   
-----------------------  
 off  
(1 row)

timescaleDB使用例子1 – 紐約TAXI資料透視分析

第一個例子是real-life New York City taxicab data ，

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

資料為真實的資料，來自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下載樣本資料

wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz

2、解壓

tar -zxvf nyc_data.tar.gz

3、建表，其中包括將普通錶轉換為時序儲存表的API create_hypertable 的使用。

psql -f nyc_data.sql

擷取一些nyc_data.sql的內容如下：

cat nyc_data.sql  
  
-- 叫車資料: 包括時長、計費、路程、上車、下車經緯度、時間、人數等等。  
  
CREATE TABLE "rides"(  
    vendor_id TEXT,  
    pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    passenger_count NUMERIC,  
    trip_distance NUMERIC,  
    pickup_longitude  NUMERIC,  
    pickup_latitude   NUMERIC,  
    rate_code         INTEGER,  
    dropoff_longitude NUMERIC,  
    dropoff_latitude  NUMERIC,  
    payment_type INTEGER,  
    fare_amount NUMERIC,  
    extra NUMERIC,  
    mta_tax NUMERIC,  
    tip_amount NUMERIC,  
    tolls_amount NUMERIC,  
    improvement_surcharge NUMERIC,  
    total_amount NUMERIC  
);

這句話，將rides轉換為時序表儲存

SELECT create_hypertable(`rides`, `pickup_datetime`, `payment_type`, 2, create_default_indexes=>FALSE);

建立索引

CREATE INDEX ON rides (vendor_id, pickup_datetime desc);  
CREATE INDEX ON rides (pickup_datetime desc, vendor_id);  
CREATE INDEX ON rides (rate_code, pickup_datetime DESC);  
CREATE INDEX ON rides (passenger_count, pickup_datetime desc);

4、匯入測試資料

psql -c "COPY rides FROM nyc_data_rides.csv CSV"  
COPY 10906858

5、對已轉換為時序儲存表的rides執行一些測試SQL，效能比PostgreSQL普通表要好。

每天同車超過2人的交易，平均計費多少？

-- Average fare amount of rides with 2+ passengers by day  
  
SELECT date_trunc(`day`, pickup_datetime) as day, avg(fare_amount)  
  FROM rides  
  WHERE passenger_count > 1 AND pickup_datetime < `2016-01-08`  
  GROUP BY day ORDER BY day;  
  
        day         |         avg  
--------------------+---------------------  
2016-01-01 00:00:00 | 13.3990821679715529  
2016-01-02 00:00:00 | 13.0224687415181399  
2016-01-03 00:00:00 | 13.5382068607068607  
2016-01-04 00:00:00 | 12.9618895561740149  
2016-01-05 00:00:00 | 12.6614611935518309  
2016-01-06 00:00:00 | 12.5775245695086098  
2016-01-07 00:00:00 | 12.5868802584437019  
(7 rows)

6、某些查詢的效能甚至超過20倍

每天有多少筆交易。

-- Total number of rides by day for first 5 days  
  
SELECT date_trunc(`day`, pickup_datetime) as day, COUNT(*) FROM rides  
  GROUP BY day ORDER BY day  
  LIMIT 5;  
  
        day         | count  
--------------------+--------  
2016-01-01 00:00:00 | 345037  
2016-01-02 00:00:00 | 312831  
2016-01-03 00:00:00 | 302878  
2016-01-04 00:00:00 | 316171  
2016-01-05 00:00:00 | 343251  
(5 rows)

timescale增加了merge append的執行優化，因此在時間片上按小粒度聚合，效率非常高，資料量越大，效能提升的效果越明顯。

For example, TimescaleDB introduces a time-based “merge append” optimization to minimize the number of
groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

SELECT date_trunc(`minute`, time) AS minute, max(usage_user)  
  FROM cpu  
  WHERE time < `2017-01-01`  
  GROUP BY minute  
  ORDER BY minute DESC  
  LIMIT 5;

7、執行一些timescaleDB特有的功能函式，例如time_bucket，這裡同樣會用到timescaleDB內建的一些加速演算法。

每5分鐘間隔為一個BUCKET，輸出每個間隔產生了多少筆訂單。

-- Number of rides by 5 minute intervals  
--   (using the TimescaleDB "time_bucket" function)  
  
SELECT time_bucket(`5 minute`, pickup_datetime) as five_min, count(*)  
  FROM rides  
  WHERE pickup_datetime < `2016-01-01 02:00`  
  GROUP BY five_min ORDER BY five_min;  
  
      five_min       | count  
---------------------+-------  
 2016-01-01 00:00:00 |   703  
 2016-01-01 00:05:00 |  1482  
 2016-01-01 00:10:00 |  1959  
 2016-01-01 00:15:00 |  2200  
 2016-01-01 00:20:00 |  2285  
 2016-01-01 00:25:00 |  2291  
 2016-01-01 00:30:00 |  2349  
 2016-01-01 00:35:00 |  2328  
 2016-01-01 00:40:00 |  2440  
 2016-01-01 00:45:00 |  2372  
 2016-01-01 00:50:00 |  2388  
 2016-01-01 00:55:00 |  2473  
 2016-01-01 01:00:00 |  2395  
 2016-01-01 01:05:00 |  2510  
 2016-01-01 01:10:00 |  2412  
 2016-01-01 01:15:00 |  2482  
 2016-01-01 01:20:00 |  2428  
 2016-01-01 01:25:00 |  2433  
 2016-01-01 01:30:00 |  2337  
 2016-01-01 01:35:00 |  2366  
 2016-01-01 01:40:00 |  2325  
 2016-01-01 01:45:00 |  2257  
 2016-01-01 01:50:00 |  2316  
 2016-01-01 01:55:00 |  2250  
(24 rows)

8、執行一些統計分析SQL

每個城市的叫車交易量。

-- Join rides with rates to get more information on rate_code  
  
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE pickup_datetime < `2016-01-08`  
  GROUP BY rates.description ORDER BY rates.description;  
  
      description      | num_trips  
-----------------------+-----------  
 JFK                   |     54832  
 Nassau or Westchester |       967  
 Newark                |      4126  
 group ride            |        17  
 negotiated fare       |      7193  
 standard rate         |   2266401  
(6 rows)

某些城市2016年1月的叫車統計（最長、短距離、平均人數、時長等）

-- Analysis of all JFK and EWR rides in Jan 2016  
  
SELECT rates.description, COUNT(vendor_id) as num_trips,  
    AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,  
    AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,  
    AVG(passenger_count) as avg_passengers  
  FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE rides.rate_code in (2,3) AND pickup_datetime < `2016-02-01`  
  GROUP BY rates.description ORDER BY rates.description;  
  
 description | num_trips | avg_trip_duration |      avg_total      |      avg_tip       | min_distance |    avg_distance     | max_distance |   avg_passengers  
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------  
 JFK         |    225019 | 00:45:46.822517   | 64.3278115181384683 | 7.3334228220728027 |         0.00 | 17.2602816651038357 |       221.00 | 1.7333869584346211  
 Newark      |     16822 | 00:35:16.157472   | 86.4633688027582927 | 9.5461657353465700 |         0.00 | 16.2706122934252764 |       177.23 | 1.7435501129473309  
(2 rows)

9、資料自動分片與執行計劃

postgres=# d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  
  
其中一個分片的約束如下  
Check constraints:  
    "constraint_1" CHECK (pickup_datetime >= `2015-12-31 00:00:00`::timestamp without time zone AND pickup_datetime < `2016-01-30 00:00:00`::timestamp without time zone)  
    "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823)  
Inherits: rides

-- Peek behind the scenes  
  
postgres=#  select count(*) from rides;  
  count     
----------  
 10906858  
(1 row)  
  
Time: 376.247 ms  
postgres=# explain select count(*) from rides;  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=254662.23..254662.24 rows=1 width=8)  
   ->  Gather  (cost=254661.71..254662.22 rows=5 width=8)  
         Workers Planned: 5  
         ->  Partial Aggregate  (cost=253661.71..253661.72 rows=1 width=8)  
               ->  Append  (cost=0.00..247468.57 rows=2477258 width=0)  
                     ->  Parallel Seq Scan on rides  (cost=0.00..0.00 rows=1 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_1_chunk  (cost=0.00..77989.57 rows=863657 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_2_chunk  (cost=0.00..150399.01 rows=1331101 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_3_chunk  (cost=0.00..6549.75 rows=112675 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_4_chunk  (cost=0.00..12530.24 rows=169824 width=0)  
(10 rows)

10、也可以直接查分片

postgres=# select count(*) from  _timescaledb_internal._hyper_1_1_chunk;  
  count    
---------  
 3454961  
(1 row)

分片對使用者完全透明

分片後設資料：

postgres=# dn  
         List of schemas  
         Name          |  Owner     
-----------------------+----------  
 _timescaledb_cache    | postgres  
 _timescaledb_catalog  | postgres  
 _timescaledb_internal | postgres  
 public                | postgres  
(4 rows)

timescaleDB + PostGIS 雙劍合璧 – 時空資料庫

結合時序資料庫timescaleDB外掛，空間資料庫PostGIS外掛。PostgreSQL可以很好的處理空間資料。

1、建立空間資料庫PostGIS建立

create extension postgis;

2、新增空間型別欄位

http://postgis.net/docs/manual-2.4/AddGeometryColumn.html

postgres=# SELECT AddGeometryColumn (`public`,`rides`,`pickup_geom`,2163,`POINT`,2);  
                   addgeometrycolumn                      
--------------------------------------------------------  
 public.rides.pickup_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=# SELECT AddGeometryColumn (`public`,`rides`,`dropoff_geom`,2163,`POINT`,2);  
                    addgeometrycolumn                      
---------------------------------------------------------  
 public.rides.dropoff_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=#   
postgres=# d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
 pickup_geom           | geometry(Point,2163)        |           |          |         | main     |              |   
 dropoff_geom          | geometry(Point,2163)        |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk

3、將資料更新到geometry欄位（實際儲存為兩個自動，分別表示經度和緯度。實際上不更新也沒關係，因為PG支援表示式索引，完全可以使用這兩個欄位，建立表示式空間索引）。

-- Generate the geometry points and write to table  
--   (Note: These calculations might take a few mins)  
  
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);  
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);  
  
  
vacuum full rides;

4、時空分析舉例。

在(lat, long) (40.7589,-73.9851)附近400米範圍內，每30分鐘有多少輛車被叫（以上車位置來計算）。

-- Number of rides on New Years Eve originating within  
--   400m of Times Square, by 30 min buckets  
--   Note: Times Square is at (lat, long) (40.7589,-73.9851)  
  
SELECT time_bucket(`30 minutes`, pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq  
  FROM rides  
  WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400  
    AND pickup_datetime < `2016-01-01 14:00`  
  GROUP BY thirty_min ORDER BY thirty_min;  
  
     thirty_min      | near_times_sq  
---------------------+--------------  
 2016-01-01 00:00:00 |      74  
 2016-01-01 00:30:00 |     102  
 2016-01-01 01:00:00 |     120  
 2016-01-01 01:30:00 |      98  
 2016-01-01 02:00:00 |     112  
 2016-01-01 02:30:00 |     109  
 2016-01-01 03:00:00 |     163  
 2016-01-01 03:30:00 |     181  
 2016-01-01 04:00:00 |     214  
 2016-01-01 04:30:00 |     185  
 2016-01-01 05:00:00 |     158  
 2016-01-01 05:30:00 |     113  
 2016-01-01 06:00:00 |     102  
 2016-01-01 06:30:00 |      91  
 2016-01-01 07:00:00 |      88  
 2016-01-01 07:30:00 |      58  
 2016-01-01 08:00:00 |      72  
 2016-01-01 08:30:00 |      94  
 2016-01-01 09:00:00 |     115  
 2016-01-01 09:30:00 |     118  
 2016-01-01 10:00:00 |     135  
 2016-01-01 10:30:00 |     160  
 2016-01-01 11:00:00 |     212  
 2016-01-01 11:30:00 |     229  
 2016-01-01 12:00:00 |     244  
 2016-01-01 12:30:00 |     230  
 2016-01-01 13:00:00 |     235  
 2016-01-01 13:30:00 |     238

例項2 – 感測器資料、天氣資料

http://docs.timescale.com/v0.8/tutorials/other-sample-datasets

不再贅述。

timescaleDB 常用API

http://docs.timescale.com/v0.8/api

1、建立時序表

create_hypertable()

Required Arguments

Name	Description
main_table	Identifier of table to convert to hypertable
time_column_name	Name of the column containing time values

Optional Arguments

Name	Description
partitioning_column	Name of an additional column to partition by. If provided, number_partitions must be set.
number_partitions	Number of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.
chunk_time_interval	Interval in event time that each chunk covers. Must be > 0. Default is 1 month.
create_default_indexes	Boolean whether to create default indexes on time/partitioning columns. Default is TRUE.
if_not_exists	Boolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.
partitioning_func	The function to use for calculating a value`s partition.

2、新增多級分片欄位

支援hash和interval分片

add_dimension()

Required Arguments

Name	Description
main_table	Identifier of hypertable to add the dimension to.
column_name	Name of the column to partition by.

Optional Arguments

Name	Description
number_partitions	Number of hash partitions to use on column_name. Must be > 0.
interval_length	Interval that each chunk covers. Must be > 0.
partitioning_func	The function to use for calculating a value`s partition (see create_hypertable instructions).

3、刪除分片

刪除指定時間點、多久之前的分片

drop_chunks()

Required Arguments

Name	Description
older_than	Timestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

Name	Description
table_name	Hypertable name from which to drop chunks. If not supplied, all hypertables are affected.
schema_name	Schema name of the hypertable from which to drop chunks. Defaults to public.
cascade	Boolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4、設定分片時間區間

set_chunk_time_interval()

Required Arguments

Name	Description
main_table	Identifier of hypertable to update interval for.
chunk_time_interval	Interval in event time that each new chunk covers. Must be > 0.

5、分析函式 – 第一條

first()

Required Arguments

Name	Description
value	The value to return (anyelement)
time	The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如，查詢所有感測器的最早上傳的溫度值。

SELECT device_id, first(temp, time)  
  FROM metrics  
  GROUP BY device_id;

使用遞迴亦可實現：

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》

6、分析函式 – 最後一條

last()

Required Arguments

Name	Description
value	The value to return (anyelement)
time	The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如，查詢每5分鐘時間區間內，每個感測器的最新溫度值

SELECT device_id, time_bucket(`5 minutes`, time) as interval,  
  last(temp, time)  
  FROM metrics  
  WHERE time > now () - interval `1 day`  
  GROUP BY device_id, interval  
  ORDER BY interval DESC;

使用遞迴亦可實現：

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》

7、分析函式 – 柱狀圖

histogram()

Required Arguments

Name	Description
value	A set of values to partition into a histogram
min	The histogram’s lower bound used in bucketing
max	The histogram’s upper bound used in bucketing
nbuckets	The integer value for the number of histogram buckets (partitions)

例如，

電池電量20到60，均分為5個BUCKET區間，返回5+2個值的陣列（表示每個bucket區間的記錄數），頭尾分為別為邊界外的記錄數有多少。

SELECT device_id, histogram(battery_level, 20, 60, 5)  
  FROM readings  
  GROUP BY device_id  
  LIMIT 10;  
  
 device_id  |          histogram  
------------+------------------------------  
 demo000000 | {0,0,0,7,215,206,572}  
 demo000001 | {0,12,173,112,99,145,459}  
 demo000002 | {0,0,187,167,68,229,349}  
 demo000003 | {197,209,127,221,106,112,28}  
 demo000004 | {0,0,0,0,0,39,961}  
 demo000005 | {12,225,171,122,233,80,157}  
 demo000006 | {0,78,176,170,8,40,528}  
 demo000007 | {0,0,0,126,239,245,390}  
 demo000008 | {0,0,311,345,116,228,0}  
 demo000009 | {295,92,105,50,8,8,442}

8、分析函式 – 時間區間

類似date_trunc，但是更強大，可以用任意interval進行時間截斷。方便使用者使用。

time_bucket()

Required Arguments

Name	Description
bucket_width	A PostgreSQL time interval for how long each bucket is (interval)
time	The timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

Name	Description
offset	The time interval to offset all buckets by (interval)

9、資料概貌檢視函式 – 時序表概貌

hypertable_relation_size_pretty()

SELECT * FROM hypertable_relation_size_pretty(`conditions`);  
  
 table_size | index_size | toast_size | total_size  
------------+------------+------------+------------  
 1171 MB    | 1608 MB    | 176 kB     | 2779 MB

10、資料概貌檢視函式 – 分片大小

chunk_relation_size_pretty()

SELECT * FROM chunk_relation_size_pretty(`conditions`);  
  
                chunk_table                 | table_size | index_size | total_size  
---------------------------------------------+------------+------------+------------  
 "_timescaledb_internal"."_hyper_1_1_chunk"  | 28 MB      | 36 MB      | 64 MB  
 "_timescaledb_internal"."_hyper_1_2_chunk"  | 57 MB      | 78 MB      | 134 MB  
 ...

11、資料概貌檢視函式 – 索引大小

indexes_relation_size_pretty()

SELECT * FROM indexes_relation_size_pretty(`conditions`);  
  
             index_name_              | total_size  
--------------------------------------+------------  
 public.conditions_device_id_time_idx | 1143 MB  
 public.conditions_time_idx           | 465 MB

12、匯出時序後設資料

https://raw.githubusercontent.com/timescale/timescaledb/master/scripts/dump_meta_data.sql

psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt

小結

timescaleDB是一個非常好用的時序資料處理外掛，隱藏了分片邏輯（對使用者透明），同時提供了大量的API函式介面，以及效能優化。在時序場景使用很贊。

結合PostGIS外掛，PostgreSQL在時空處理這塊，如虎添翼。

參考

《PostgreSQL 按需切片的實現(TimescaleDB外掛自動切片功能的plpgsql schemaless實現)》

《時序資料庫有哪些特點? TimescaleDB時序資料庫介紹》

《PostgreSQL on Linux 最佳部署手冊》

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

http://docs.timescale.com/v0.8/introduction

《PostgrSQL 遞迴SQL的幾個應用 – 極客與正常人的思維》

PostgreSQL技術週刊第12期：PostgreSQL時空資料排程實踐
2019-01-05
SQL
奔跑吧，大屏-時間+空間實時四維資料透視
2017-04-17
第8章　紐約計程車軌跡的空間和時間資料分析
2015-10-23
PostgreSQL:表空間-->資料庫-->表
2018-09-24
SQL資料庫
PostgreSQL業務資料質量實時監控實踐
2017-12-09
SQL
Excel資料分析入門-資料透視表
2017-10-17
Excel
【安全告警資料分析之道：一】資料透視篇
2021-05-31
資料庫實踐
2012-06-01
資料庫
實時資料庫與時序資料庫
2020-11-28
資料庫
資料庫實踐丨MySQL多表join分析
2020-09-03
資料庫MySql
檢視資料庫表空間
2011-02-27
資料庫
PostgreSQL資料庫匯入大量資料時如何最佳化
2022-09-01
SQL資料庫
JUST京東城市時空資料引擎2.0架構實踐
2021-10-25
架構
時序資料庫分析-TimescaleDB時序資料庫介紹
2017-04-10
資料庫
資料視覺化實踐
2018-10-25
視覺化
如何透過SQLyog分析MySQL資料庫
2020-05-05
MySql資料庫
資料庫新兵：分散式實時分析記憶體資料庫eSight
2018-12-17
資料庫分散式記憶體
如何透過一條資料庫語句做資料分析
2019-12-26
資料庫
透過shell指令碼檢視資料庫表空間使用情況
2015-04-25
指令碼資料庫
Mysql資料實時同步實踐
2022-12-05
MySql
python-資料分析-Pandas-4、DataFrame-資料透視
2024-06-10
Python
DM資料庫操作實踐
2022-01-12
資料庫
向量資料庫落地實踐
2024-04-03
資料庫
Excel資料透視表怎麼做 Excel資料透視表技巧
2022-02-28
Excel
.NET雲原生應用實踐（三）：連線到PostgreSQL資料庫
2024-10-22
SQL資料庫
基石視覺化資料分析平臺設計實踐
2023-12-22
視覺化
基於雲原生的大資料實時分析方案實踐
2021-09-09
大資料
Uber永久定位系統實時資料分析過程實踐！
2018-08-20
postgresql 資料庫基本操作
2018-06-19
SQL資料庫
postgresql資料庫利用方式
2018-06-04
SQL資料庫
postgresql 資料庫基本管理
2024-06-09
SQL資料庫
資料庫_SQL-PostgreSQL
2024-07-14
資料庫SQL
PostgreSQL-資料庫命令
2024-08-19
SQL資料庫
資料庫學習:透過作業定時同步兩個資料庫(轉)
2007-08-11
資料庫
PostgreSQL資料庫將迎來崛起時刻HY
2022-03-21
SQL資料庫
ETL資料整合丨PostgreSQL資料遷移至Hive資料庫
2024-08-15
SQLHive資料庫
PHP最佳實踐之資料庫
2019-02-16
PHP資料庫
資料庫之鎖的實踐
2016-07-09
資料庫

時空資料庫實踐(含紐約TAXI資料透視分析)-PostGIS+TimescaleDB=>PostgreSQL

標籤

背景

timescaleDB的優勢

部署timescaleDB

timescaleDB使用例子1 – 紐約TAXI資料透視分析

分片對使用者完全透明

timescaleDB + PostGIS 雙劍合璧 – 時空資料庫

例項2 – 感測器資料、天氣資料

timescaleDB 常用API

1、建立時序表

2、新增多級分片欄位

3、刪除分片

4、設定分片時間區間

5、分析函式 – 第一條

6、分析函式 – 最後一條

7、分析函式 – 柱狀圖

8、分析函式 – 時間區間

9、資料概貌檢視函式 – 時序表概貌

10、資料概貌檢視函式 – 分片大小

11、資料概貌檢視函式 – 索引大小

12、匯出時序後設資料

小結

參考

相關文章