在 Kubernetes 上快速測試 Citus 分散式 PostgreSQL 叢集(分散式表，共置，引用表，列儲存)

為少發表於2022-03-24

原文網址 : https://www.cnblogs.com/hacker-linner/p/16049278.html

準備工作

這裡假設，你已經在 k8s 上部署好了基於 Citus 擴充套件的分散式 PostgreSQL 叢集。

檢視 Citus 叢集（kubectl get po -n citus），1 個 Coordinator(協調器) 節點 + 3 個 Worker(工作器) 節點。

NAME                  READY   STATUS    RESTARTS   AGE
citus-coordinator-0   2/2     Running   0          3h55m
citus-worker-0        2/2     Running   0          22m
citus-worker-1        2/2     Running   0          21m
citus-worker-2        2/2     Running   0          21m

進入 coordinator 節點(kubectl -n citus exec -it citus-coordinator-0 -- bash)，檢視活動的 worker 節點(psql 'host=citus-coordinator user=postgres' -c "SELECT * FROM citus_get_active_worker_nodes();")。

                      node_name                      | node_port 
-----------------------------------------------------+-----------
 citus-worker-1.citus-worker.citus.svc.cluster.local |      6432
 citus-worker-2.citus-worker.citus.svc.cluster.local |      6432
 citus-worker-0.citus-worker.citus.svc.cluster.local |      6432
(3 rows)

一旦擁有 Citus 叢集，就可以開始建立分散式表、引用表和使用列儲存。

建立分散式表

create_distributed_table 將在本地或工作節點之間透明地切分您的表。

進入命令列工具：psql 'host=citus-coordinator user=postgres'

建表

CREATE TABLE events (
  device_id bigint,
  event_id bigserial,
  event_time timestamptz default now(),
  data jsonb not null,
  PRIMARY KEY (device_id, event_id)
);

-- 將事件表分佈在本地或工作節點上的分片上
SELECT create_distributed_table('events', 'device_id');

執行此操作後，對特定裝置 ID 的查詢將有效地路由到單個工作節點，而跨裝置 ID 的查詢將在叢集中並行化。

插入一些事件

INSERT INTO events (device_id, data)
SELECT s % 100, ('{"measurement":'||random()||'}')::jsonb FROM generate_series(1,1000000) s;
-- INSERT 0 1000000

獲取裝置 1 的最後 3 個事件，路由到單個節點

命令列開啟計時：postgres=# \timing

SELECT * FROM events WHERE device_id = 1 ORDER BY event_time DESC, event_id DESC LIMIT 3;

 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         1 |   999901 | 2022-03-24 02:30:50.205478+00 | {"measurement": 0.8822990134507691}
         1 |   999801 | 2022-03-24 02:30:50.205478+00 | {"measurement": 0.5239176115816448}
         1 |   999701 | 2022-03-24 02:30:50.205478+00 | {"measurement": 0.9900647926398349}
(3 rows)

Time: 4.779 ms

解釋跨分片並行化的查詢的計劃，以下顯示了查詢其中一個分片的計劃以及如何完成跨分片的聚合

執行 sql 語句：

EXPLAIN (VERBOSE ON) SELECT count(*) FROM events;

                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=250.00..250.02 rows=1 width=8)
   Output: COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint)
   ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=8)
         Output: remote_scan.count
         Task Count: 32
         Tasks Shown: One of 32
         ->  Task
               Query: SELECT count(*) AS count FROM public.events_102008 events WHERE true
               Node: host=citus-worker-0.citus-worker.citus.svc.cluster.local port=6432 dbname=postgres
               ->  Aggregate  (cost=725.00..725.01 rows=1 width=8)
                     Output: count(*)
                     ->  Seq Scan on public.events_102008 events  (cost=0.00..650.00 rows=30000 width=0)
                           Output: device_id, event_id, event_time, data
(13 rows)

Time: 5.427 ms

使用共置(Co-location)建立分散式表

具有相同分佈列的分散式表可以位於同一位置，以實現分散式表之間的高效能分散式連線(join)和外來鍵。預設情況下，分散式表將根據分佈列的型別位於同一位置，但您可以使用 create_distributed_table 中的 colocate_with 引數顯式定義同一位置。

建表

CREATE TABLE devices (
  device_id bigint primary key,
  device_name text,
  device_type_id int
);
CREATE INDEX ON devices (device_type_id);

-- 將裝置表與事件表放在一起
SELECT create_distributed_table('devices', 'device_id', colocate_with := 'events');

插入裝置後設資料

INSERT INTO devices (device_id, device_name, device_type_id)
SELECT s, 'device-'||s, 55 FROM generate_series(0, 99) s;

可選：確保應用程式只能插入已知裝置的事件

ALTER TABLE events ADD CONSTRAINT device_id_fk
FOREIGN KEY (device_id) REFERENCES devices (device_id);

獲得跨分片並行的所有型別 55 裝置的平均測量值

SELECT avg((data->>'measurement')::double precision)
FROM events JOIN devices USING (device_id)
WHERE device_type_id = 55;

        avg         
--------------------
 0.4997412230952178
(1 row)

Time: 122.548 ms

Co-location 還可以幫助您擴充套件 INSERT..SELECT、儲存過程和分散式事務。

INSERT..SELECT
- https://docs.citusdata.com/en/stable/articles/aggregation.html
儲存過程
- https://www.citusdata.com/blog/2020/11/21/making-postgres-stored-procedures-9x-faster-in-citus/
分散式事務
- https://www.citusdata.com/blog/2017/06/02/scaling-complex-sql-transactions/

建立引用表

當您需要不包含分佈列的快速 join 或外來鍵時，您可以使用 create_reference_table 在叢集中的所有節點之間複製表。

建表

CREATE TABLE device_types (
  device_type_id int primary key,
  device_type_name text not null unique
);

跨所有節點複製表以在任何列上啟用外來鍵和 join

SELECT create_reference_table('device_types');

插入裝置型別

INSERT INTO device_types (device_type_id, device_type_name) VALUES (55, 'laptop');

可選：確保應用程式只能插入已知型別的裝置

ALTER TABLE devices ADD CONSTRAINT device_type_fk
FOREIGN KEY (device_type_id) REFERENCES device_types (device_type_id);

獲取型別名稱以膝上型電腦開頭的裝置的最後 3 個事件，跨分片並行

SELECT device_id, event_time, data->>'measurement' AS value, device_name, device_type_name
FROM events JOIN devices USING (device_id) JOIN device_types USING (device_type_id)
WHERE device_type_name LIKE 'laptop%' ORDER BY event_time DESC LIMIT 3;

device_id |          event_time           |        value        | device_name | device_type_name 
-----------+-------------------------------+---------------------+-------------+------------------
        31 | 2022-03-24 02:30:50.205478+00 | 0.9994211581289107  | device-31   | laptop
        31 | 2022-03-24 02:30:50.205478+00 | 0.13771543211483106 | device-31   | laptop
        88 | 2022-03-24 02:30:50.205478+00 | 0.5585740912470349  | device-88   | laptop
(3 rows)

Time: 96.537 ms

引用表使您能夠擴充套件複雜的資料模型並充分利用關聯式資料庫的功能。

使用列式儲存建立表

要在 PostgreSQL 資料庫中使用列式儲存，您只需將 USING columnar 新增到 CREATE TABLE 語句中，您的資料將使用列式訪問方法自動壓縮。

建表

CREATE TABLE events_columnar (
  device_id bigint,
  event_id bigserial,
  event_time timestamptz default now(),
  data jsonb not null
)
USING columnar;

插入一些資料

INSERT INTO events_columnar (device_id, data)
SELECT d, '{"hello":"columnar"}' FROM generate_series(1,10000000) d;

建立一個基於行的表進行比較

CREATE TABLE events_row AS SELECT * FROM events_columnar;

檢視錶大小

postgres=# \d+
                                                  List of relations
 Schema |             Name             |   Type   |  Owner   | Persistence | Access method |    Size    | Description 
--------+------------------------------+----------+----------+-------------+---------------+------------+-------------
 public | citus_tables                 | view     | postgres | permanent   |               | 0 bytes    | 
 public | device_types                 | table    | postgres | permanent   | heap          | 8192 bytes | 
 public | devices                      | table    | postgres | permanent   | heap          | 8192 bytes | 
 public | events                       | table    | postgres | permanent   | heap          | 8192 bytes | 
 public | events_columnar              | table    | postgres | permanent   | columnar      | 25 MB      | 
 public | events_columnar_event_id_seq | sequence | postgres | permanent   |               | 8192 bytes | 
 public | events_event_id_seq          | sequence | postgres | permanent   |               | 8192 bytes | 
 public | events_row                   | table    | postgres | permanent   | heap          | 806 MB     | 
(8 rows)

注意 events_row(806 MB) 與 events_columnar(25 MB) 的對比。壓縮了幾十倍，效果非常的驚人，大大節省了儲存空間。

您可以單獨使用列儲存，也可以在分散式表中使用，以結合壓縮和分散式查詢引擎的優勢。

使用列式儲存時，您應該只使用 COPY 或 INSERT..SELECT 批量載入資料以實現良好的壓縮。柱狀表目前不支援更新、刪除和外來鍵。但是，您可以使用分割槽表，其中較新的分割槽使用基於行的儲存，而較舊的分割槽使用列儲存進行壓縮。

Citus 分散式 PostgreSQL 叢集 - SQL Reference(查詢分散式表 SQL)
2022-03-30
分散式SQL
Citus 分散式 PostgreSQL 叢集 - SQL Reference(建立和修改分散式表 DDL)
2022-03-27
分散式SQL
分散式 PostgreSQL 叢集(Citus)，分散式表中的分佈列選擇最佳實踐
2022-03-21
分散式SQL
分散式 PostgreSQL 叢集(Citus)官方示例 - 實時儀表盤
2022-03-17
分散式SQL
分散式 PostgreSQL 叢集(Citus)官方安裝指南
2022-03-20
分散式SQL
Citus 分散式 PostgreSQL 叢集 - SQL Reference(查詢處理)
2022-03-31
分散式SQL
分散式PostgreSQL之Citus
2022-03-24
分散式SQL
分散式 PostgreSQL 叢集(Citus)官方示例 - 時間序列資料
2022-03-18
分散式SQL
Citus 分散式 PostgreSQL 叢集 - SQL Reference(手動查詢傳播)
2022-04-01
分散式SQL
Citus 分散式 PostgreSQL 叢集 - SQL Reference(SQL支援和變通方案)
2022-04-02
分散式SQL
Citus 分散式 PostgreSQL 叢集 - SQL Reference(攝取、修改資料 DML)
2022-03-28
分散式SQL
分散式 PostgreSQL 叢集(Citus)官方教程 - 遷移現有應用程式
2022-03-16
分散式SQL
分散式kv儲存系統之Etcd叢集
2021-01-30
分散式
分散式 PostgreSQL 叢集(Citus)官方示例 - 多租戶應用程式實戰
2022-03-15
分散式SQL
分散式 PostgreSQL - Citus 架構及概念
2022-03-06
分散式SQL架構
分散式 PostgreSQL - Citus 架構及概念
2022-03-07
分散式SQL架構
Kubernetes中分散式儲存Rook-Ceph部署快速演練
2021-02-25
分散式
分散式文件儲存資料庫之MongoDB分片叢集
2020-11-12
分散式資料庫MongoDB
ElasticSearch 分散式叢集
2021-03-01
Elasticsearch分散式
Longhorn，Kubernetes 雲原生分散式塊儲存
2021-08-16
分散式
使用 PostgreSQL 16.1 + Citus 12.1 作為多個微服務的分散式 Sharding 儲存後端
2023-12-26
SQL微服務分散式後端
崑崙分散式資料庫儲存叢集 Fullsync 機制
2022-04-07
分散式資料庫
HDFS分散式儲存
2018-10-15
分散式
Redis 分散式儲存
2019-08-06
Redis分散式
帶你快速瞭解 MongoDB 分散式叢集
2020-07-21
MongoDB分散式
Redis面試題及分散式叢集
2018-08-06
Redis面試題分散式
elasticsearch(三)---分散式叢集
2018-08-21
Elasticsearch分散式
HA分散式叢集搭建
2018-04-10
分散式
HDFS分散式叢集搭建
2024-07-07
分散式
golang分散式與叢集
2021-12-22
Golang分散式
hadoop分散式叢集搭建
2021-03-04
Hadoop分散式
一鍵在本地搭建執行Istio 1.0的分散式Kubernetes叢集
2018-08-02
分散式
白話分散式儲存測試（一）明確測試目標
2019-07-08
分散式
Citus 11(分散式 PostgreSQL) 文件貢獻與本地執行
2022-05-02
分散式SQL
Centos7下GlusterFS分散式儲存叢集環境部署記錄
2018-04-08
CentOS分散式
StackGres 資料庫平臺工程，使用 Citus + Patroni 建立生產級高可用分散式 PostgreSQL 分片叢集
2023-12-05
資料庫分散式SQL
Hadoop分散式叢集搭建_1
2018-05-17
Hadoop分散式
叢集和分散式區別
2018-03-03
分散式