PostgreSQL sharding : citus 系列1 - 多機部署(含OLTP(TPC-B)測試)- 含Citus MX模式...
標籤
PostgreSQL , citus , tpc-b , cn mx
背景
《(TPC-H測試 SF=10) PostgreSQL 11 vs 10 vs Deepgreen》 測試結果證明PG已經是一個HTAP資料庫,可以同時滿足OLTP,OLAP的需求。
citus是PG的一個sharding外掛,可以把PG變成一個分散式資料庫。目前在蘇寧有大量的生產應用跑在citus+pg的環境中。
本文是多機部署與簡單的測試。
後面的篇章陸續新增: HA,備份,容災。最佳實踐。效能測試。
期望citus可以具備良好的TP能力,同時AP能力可以像GPDB一樣擴充套件,使得PG如虎添翼。
重點看結果:
OLTP: citus(multi node) vs pg(single)
OLAP: citus(multi node) vs Greenplum(multi node)
環境
(ECS 32核,128G記憶體,2TB 雲盤) * 9
host all all xxx.xxx.xxx.228/32 trust # coordinator node
host all all xxx.xxx.xxx.224/32 trust # worker(data) node
host all all xxx.xxx.xxx.230/32 trust # worker(data) node
host all all xxx.xxx.xxx.231/32 trust # worker(data) node
host all all xxx.xxx.xxx.225/32 trust # worker(data) node
host all all xxx.xxx.xxx.227/32 trust # worker(data) node
host all all xxx.xxx.xxx.232/32 trust # worker(data) node
host all all xxx.xxx.xxx.226/32 trust # worker(data) node
host all all xxx.xxx.xxx.229/32 trust # worker(data) node
CentOS 7.x x64
PostgreSQL 10
Citus 7.5
一、OS 配置
《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(適合新使用者) - 珍藏級》
1 sysctl
注意某些引數,根據記憶體大小配置(已說明)
含義詳見
cat >>/etc/sysctl.conf <<EOOF
# add by digoal.zhou
fs.aio-max-nr = 1048576
fs.file-max = 76724600
# 可選:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p
# /data01/corefiles 事先建好,許可權777,如果是軟連結,對應的目錄修改為777
kernel.sem = 4096 2147483647 2147483646 512000
# 訊號量, ipcs -l 或 -u 檢視,每16個程式一組,每組訊號量需要17個訊號量。
kernel.shmall = 107374182
# 所有共享記憶體段相加大小限制 (建議記憶體的80%),單位為頁。
kernel.shmmax = 274877906944
# 最大單個共享記憶體段大小 (建議為記憶體一半), >9.2的版本已大幅降低共享記憶體的使用,單位為位元組。
kernel.shmmni = 819200
# 一共能生成多少共享記憶體段,每個PG資料庫叢集至少2個共享記憶體段
net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144
# The default setting of the socket receive buffer in bytes.
net.core.rmem_max = 4194304
# The maximum receive socket buffer size in bytes
net.core.wmem_default = 262144
# The default setting (in bytes) of the socket send buffer.
net.core.wmem_max = 4194304
# The maximum send socket buffer size in bytes.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
# 開啟SYN Cookies。當出現SYN等待佇列溢位時,啟用cookie來處理,可防範少量的SYN攻擊
net.ipv4.tcp_timestamps = 1
# 減少time_wait
net.ipv4.tcp_tw_recycle = 0
# 如果=1則開啟TCP連線中TIME-WAIT套接字的快速回收,但是NAT環境可能導致連線失敗,建議服務端關閉它
net.ipv4.tcp_tw_reuse = 1
# 開啟重用。允許將TIME-WAIT套接字重新用於新的TCP連線
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000
vm.dirty_background_bytes = 409600000
# 系統髒頁到達這個值,系統後臺刷髒頁排程程式 pdflush(或其他) 自動將(dirty_expire_centisecs/100)秒前的髒頁刷到磁碟
# 預設為10%,大記憶體機器建議調整為直接指定多少位元組
vm.dirty_expire_centisecs = 3000
# 比這個值老的髒頁,將被刷到磁碟。3000表示30秒。
vm.dirty_ratio = 95
# 如果系統程式刷髒頁太慢,使得系統髒頁超過記憶體 95 % 時,則使用者程式如果有寫磁碟的操作(如fsync, fdatasync等呼叫),則需要主動把系統髒頁刷出。
# 有效防止使用者程式刷髒頁,在單機多例項,並且使用CGROUP限制單例項IOPS的情況下非常有效。
vm.dirty_writeback_centisecs = 100
# pdflush(或其他)後臺刷髒頁程式的喚醒間隔, 100表示1秒。
vm.swappiness = 0
# 不使用交換分割槽
vm.mmap_min_addr = 65536
vm.overcommit_memory = 0
# 在分配記憶體時,允許少量over malloc, 如果設定為 1, 則認為總是有足夠的記憶體,記憶體較少的測試環境可以使用 1 .
vm.overcommit_ratio = 90
# 當overcommit_memory = 2 時,用於參與計算允許指派的記憶體大小。
vm.swappiness = 0
# 關閉交換分割槽
vm.zone_reclaim_mode = 0
# 禁用 numa, 或者在vmlinux中禁止.
net.ipv4.ip_local_port_range = 40000 65535
# 本地自動分配的TCP, UDP埠號範圍
fs.nr_open=20480000
# 單個程式允許開啟的檔案控制程式碼上限
# 以下引數請注意
# vm.extra_free_kbytes = 4096000
# vm.min_free_kbytes = 2097152 # vm.min_free_kbytes 建議每32G記憶體分配1G vm.min_free_kbytes
# 如果是小記憶體機器,以上兩個值不建議設定
# vm.nr_hugepages = 66536
# 建議shared buffer設定超過64GB時 使用大頁,頁大小 /proc/meminfo Hugepagesize
# vm.lowmem_reserve_ratio = 1 1 1
# 對於記憶體大於64G時,建議設定,否則建議預設值 256 256 32
EOOF
生效配置
sysctl -p
2 配置OS資源限制
cat >>/etc/security/limits.conf <<EOOF
# nofile超過1048576的話,一定要先將sysctl的fs.nr_open設定為更大的值,並生效後才能繼續設定nofile.
* soft nofile 1024000
* hard nofile 1024000
* soft nproc unlimited
* hard nproc unlimited
* soft core unlimited
* hard core unlimited
* soft memlock unlimited
* hard memlock unlimited
EOOF
最好再關注一下/etc/security/limits.d目錄中的檔案內容,會覆蓋/etc/security/limits.conf的配置。
已有程式的ulimit請檢視/proc/pid/limits,例如
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 11286 11286 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 11286 11286 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
如果你要啟動其他程式,建議退出SHELL再進一遍,確認ulimit環境配置已生效,再啟動。
3 配置OS防火牆
(建議按業務場景設定,這裡先清掉)
iptables -F
配置範例:
# 私有網段
-A INPUT -s 192.168.0.0/16 -j ACCEPT
-A INPUT -s 10.0.0.0/8 -j ACCEPT
-A INPUT -s 172.16.0.0/16 -j ACCEPT
4 selinux
如果沒有這方面的需求,建議禁用
# vi /etc/sysconfig/selinux
SELINUX=disabled
SELINUXTYPE=targeted
5 關閉透明大頁
grub2代,可以使用rc.local。
chmod +x /etc/rc.d/rc.local
cat >> /etc/rc.local <<EOOF
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
EOOF
當場生效:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
二、soft 配置
yum clean all
1、安裝EPEL
http://fedoraproject.org/wiki/EPEL
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
rpm -ivh epel-release-latest-7.noarch.rpm
2、安裝PostgreSQL yum
https://yum.postgresql.org/repopackages.php
wget https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm
rpm -ivh pgdg-centos10-10-2.noarch.rpm
yum install -y postgresql10*
3、citus
echo "nameserver 8.8.8.8" >>/etc/resolv.conf
curl https://install.citusdata.com/community/rpm.sh | sudo bash
yum install -y citus75_10*
disk
parted -s /dev/vdb mklabel gpt
parted -s /dev/vdb mkpart primary 1MiB 100%
mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -T largefile -L data01
mkdir /data01
echo "LABEL=data01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0" >> /etc/fstab
mount -a
mkdir /data01/pg10
chown postgres:postgres /data01/pg10
三、env 配置
cat >>/var/lib/pgsql/.bash_profile <<EOOOF
export PS1="\$USER@\`/bin/hostname -s\`-> "
export PGPORT=1921
export PGDATA=/data01/pg10/pg_root\$PGPORT
export LANG=en_US.utf8
export PGHOME=/usr/pgsql-10
export LD_LIBRARY_PATH=\$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:\$LD_LIBRARY_PATH
export DATE=\`date +"%Y%m%d%H%M"\`
export PATH=\$PGHOME/bin:\$PATH:.
export MANPATH=\$PGHOME/share/man:\$MANPATH
export PGHOST=\$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
EOOOF
cat >/var/lib/pgsql/env.sh <<EOOOF
export PS1="\$USER@\`/bin/hostname -s\`-> "
export PGPORT=1921
export PGDATA=/data01/pg10/pg_root\$PGPORT
export LANG=en_US.utf8
export PGHOME=/usr/pgsql-10
export LD_LIBRARY_PATH=\$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:\$LD_LIBRARY_PATH
export DATE=\`date +"%Y%m%d%H%M"\`
export PATH=\$PGHOME/bin:\$PATH:.
export MANPATH=\$PGHOME/share/man:\$MANPATH
export PGHOST=\$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
EOOOF
四、database 配置
1 init
su - postgres -c "initdb -D \$PGDATA -U postgres --locale=en_US.UTF8 -E UTF8"
2 pg_hba.conf
cat >>/data01/pg10/pg_root1921/pg_hba.conf <<EOF
# citus 相關配置
host all all xxx.xxx.xxx.228/32 trust
host all all xxx.xxx.xxx.224/32 trust
host all all xxx.xxx.xxx.230/32 trust
host all all xxx.xxx.xxx.231/32 trust
host all all xxx.xxx.xxx.225/32 trust
host all all xxx.xxx.xxx.227/32 trust
host all all xxx.xxx.xxx.232/32 trust
host all all xxx.xxx.xxx.226/32 trust
host all all xxx.xxx.xxx.229/32 trust
EOF
3 postgresql.auto.conf
cat >/data01/pg10/pg_root1921/postgresql.auto.conf <<EOOOF
listen_addresses = '0.0.0.0'
port = 1921
max_connections = 2000
superuser_reserved_connections = 3
unix_socket_directories = '/var/run/postgresql, /tmp, .'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
max_prepared_transactions = 2000
work_mem = 64MB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
shared_preload_libraries = 'citus,pg_stat_statements' # citus 相關配置
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_workers_per_gather = 8
max_parallel_workers = 24
wal_level = minimal
synchronous_commit = off
wal_writer_delay = 10ms
checkpoint_timeout = 35min
max_wal_size = 64GB
min_wal_size = 16GB
checkpoint_completion_target = 0.1
max_wal_senders = 0
random_page_cost = 1.1
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_line_prefix = '%m [%p] '
log_statement = 'ddl'
log_timezone = 'PRC'
track_io_timing = on
track_activity_query_size = 4096
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 1200000000
autovacuum_multixact_freeze_max_age = 1400000000
vacuum_freeze_table_age = 1150000000
vacuum_multixact_freeze_table_age = 1150000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'en_US.UTF8'
lc_monetary = 'en_US.UTF8'
lc_numeric = 'en_US.UTF8'
lc_time = 'en_US.UTF8'
default_text_search_config = 'pg_catalog.english'
citus.shard_count=128 # citus 相關配置
EOOOF
4 start
su - postgres -c "pg_ctl start"
五、citus 外掛配置
1 extension
su - postgres -c "psql -c 'create extension citus;'"
2 add worker (execute only on coordinator node)
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.224', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.230', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.231', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.225', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.227', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.232', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.226', 1921);\""
su - postgres -c "psql -c \"SELECT * from master_add_node('xxx.xxx.xxx.229', 1921);\""
postgres=# SELECT * FROM master_get_active_worker_nodes();
node_name | node_port
----------------+-----------
xxx.xxx.xxx.227 | 1921
xxx.xxx.xxx.229 | 1921
xxx.xxx.xxx.231 | 1921
xxx.xxx.xxx.225 | 1921
xxx.xxx.xxx.224 | 1921
xxx.xxx.xxx.226 | 1921
xxx.xxx.xxx.230 | 1921
xxx.xxx.xxx.232 | 1921
(8 rows)
六、PostgreSQL auto start
chmod +x /etc/rc.d/rc.local
chmod +x /etc/rc.local
cat >>/etc/rc.local <<EOF
su - postgres -c "pg_ctl start"
EOF
七、quick usage (only on coordinator node)
1、初始化1億資料
pgbench -i -s 1000
2、將tpc-b涉及的4張錶轉換為sharding表(128個shard)
set citus.shard_count =128;
postgres=# select create_distributed_table('pgbench_accounts','aid');
NOTICE: Copying data from local table...
create_distributed_table
--------------------------
(1 row)
select create_distributed_table('pgbench_branches','bid');
select create_distributed_table('pgbench_tellers','tid');
select create_distributed_table('pgbench_history','aid');
postgres=# SELECT pg_size_pretty(citus_total_relation_size('pgbench_accounts'));
pg_size_pretty
----------------
15 GB
(1 row)
八、pgbench OLTP(TPC-B) 壓測 - shard 8 VS 128 VS 本地庫
1 shard=128
1、只讀
pgbench -M prepared -v -r -P 1 -c 128 -j 128 -T 120 -S
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 19658912
latency average = 0.781 ms
latency stddev = 1.471 ms
tps = 163798.762192 (including connections establishing)
tps = 163838.927010 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set aid random(1, 100000 * :scale)
0.783 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
coordinator CPU (0% IDLE)
datanode cpu (91% IDLE)
2、讀寫
pgbench -M prepared -v -r -P 1 -c 128 -j 128 -T 120
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 1335463
latency average = 11.502 ms
latency stddev = 17.679 ms
tps = 11115.397826 (including connections establishing)
tps = 11118.596481 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.089 BEGIN;
0.985 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.522 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.979 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.468 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.485 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
6.970 END;
coordinator CPU (36% IDLE)
datanode cpu (91% IDLE)
2 shard=8
對比shard=8 (剛好每臺物理機每個PG例項一個shard)
drop table pgbench_accounts;
drop table pgbench_branches;
drop table pgbench_tellers;
drop table pgbench_history;
pgbench -i -s 1000
set citus.shard_count =8;
select create_distributed_table('pgbench_accounts','aid');
select create_distributed_table('pgbench_branches','bid');
select create_distributed_table('pgbench_tellers','tid');
select create_distributed_table('pgbench_history','aid');
1、只讀
pgbench -M prepared -v -r -P 1 -c 128 -j 128 -T 120 -S
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 19587094
latency average = 0.784 ms
latency stddev = 1.376 ms
tps = 163197.269113 (including connections establishing)
tps = 163240.601455 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set aid random(1, 100000 * :scale)
0.786 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
coordinator CPU (0% IDLE)
datanode cpu (92% IDLE)
2、讀寫
pgbench -M prepared -v -r -P 1 -c 128 -j 128 -T 120
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 1542327
latency average = 9.959 ms
latency stddev = 13.876 ms
tps = 12837.180912 (including connections establishing)
tps = 12840.837336 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.088 BEGIN;
0.982 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.525 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.970 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.266 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.485 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
5.639 END;
coordinator CPU (45% IDLE)
datanode cpu (91% IDLE)
3 本地節點(非citus節點, 跨ECS網路測試)
create database testdb;
pgbench -i -s 1000 testdb
1、只讀
pgbench -M prepared -v -r -P 1 -c 128 -j 128 -T 120 -h xxx.xxx.xxx.xxx -S testdb
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 58786788
latency average = 0.261 ms
latency stddev = 0.077 ms
tps = 489836.391089 (including connections establishing)
tps = 490056.663695 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.261 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
CPU (0% IDLE)
2、讀寫
pgbench -M prepared -v -r -P 1 -c 32 -j 32 -T 120 -h xxx.xxx.xxx.xxx testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 4695712
latency average = 3.270 ms
latency stddev = 1.897 ms
tps = 39108.352039 (including connections establishing)
tps = 39120.606464 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.277 BEGIN;
0.378 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.302 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.443 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.711 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.564 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.592 END;
CPU (0% IDLE)
OLTP(TPC-B)效能對比小結
環境 | test case | TPS | coordinator 節點CPU資源消耗 | worker節點CPU資源消耗 |
---|---|---|---|---|
citus(1+8) shard=8 | 1億 tpc-b 只讀 | 16.3萬 | (0% IDLE) | (92% IDLE) |
citus(1+8) shard=128 | 1億 tpc-b 只讀 | 16.4萬 | (0% IDLE) | (91% IDLE) |
local PG | 1億 tpc-b 只讀 | 49萬 | - | (0% IDLE) |
citus(1+8) shard=8 | 1億 tpc-b 讀寫 | 1.28萬 | (45% IDLE) | (91% IDLE) |
citus(1+8) shard=128 | 1億 tpc-b 讀寫 | 1.11萬 | (36% IDLE) | (91% IDLE) |
local PG | 1億 tpc-b 讀寫 | 3.98萬 | - | (0% IDLE) |
九、簡單AP測試
主要測試AP能力,以及執行計劃。
1、建表,寫入25.6億資料
create table test (id int, c1 int, c2 int, c3 int, info text, crt_time timestamp);
select create_distributed_table('test','id');
insert into test select id, random()*100, random()*100, random()*100, md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);
insert into test select * from test;
......
insert into test select * from test;
postgres=# select pg_size_pretty(citus_relation_size('test'));
pg_size_pretty
----------------
222 GB
(1 row)
Time: 44.249 ms
2、執行計劃
limit 下推
postgres=# explain select * from test limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.00 rows=0 width=0)
-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0)
Task Count: 128
Tasks Shown: One of 128
-> Task
Node: host=172.24.211.224 port=1921 dbname=postgres
-> Limit (cost=0.00..0.02 rows=1 width=57)
-> Seq Scan on test_103068 test (cost=0.00..428722.72 rows=20067872 width=57)
(8 rows)
優化器識別分割槽鍵,同時where條件下推
postgres=# explain select * from test where id=1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Custom Scan (Citus Router) (cost=0.00..0.00 rows=0 width=0)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=172.24.211.232 port=1921 dbname=postgres
-> Gather (cost=1000.00..274755.04 rows=226 width=57)
Workers Planned: 5
-> Parallel Seq Scan on test_103075 test (cost=0.00..273732.44 rows=45 width=57)
Filter: (id = 1)
(9 rows)
目標端依舊可以使用並行,25.6億記錄,COUNT大概4.2秒。
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from test ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..0.00 rows=0 width=0) (actual time=4153.954..4153.955 rows=1 loops=1)
Output: COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint)
-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=4153.893..4153.911 rows=128 loops=1)
Output: remote_scan.count
Task Count: 128
Tasks Shown: One of 128
-> Task
Node: host=172.24.211.224 port=1921 dbname=postgres
-> Finalize Aggregate (cost=273906.25..273906.26 rows=1 width=8) (actual time=1149.156..1149.156 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=228044
-> Gather (cost=273905.73..273906.24 rows=5 width=8) (actual time=1149.078..1194.615 rows=6 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 5
Workers Launched: 5
Buffers: shared hit=228044
-> Partial Aggregate (cost=272905.73..272905.74 rows=1 width=8) (actual time=1145.740..1145.740 rows=1 loops=6)
Output: PARTIAL count(*)
Buffers: shared hit=228044
Worker 0: actual time=1144.474..1144.474 rows=1 loops=1
Buffers: shared hit=38016
Worker 1: actual time=1144.908..1144.909 rows=1 loops=1
Buffers: shared hit=38102
Worker 2: actual time=1145.096..1145.096 rows=1 loops=1
Buffers: shared hit=37952
Worker 3: actual time=1145.389..1145.389 rows=1 loops=1
Buffers: shared hit=38037
Worker 4: actual time=1145.818..1145.818 rows=1 loops=1
Buffers: shared hit=38199
-> Parallel Seq Scan on public.test_103068 test (cost=0.00..263933.38 rows=3588938 width=0) (actual time=0.016..747.409 rows=3344640 loops=6)
Buffers: shared hit=228044
Worker 0: actual time=0.015..747.436 rows=3345408 loops=1
Buffers: shared hit=38016
Worker 1: actual time=0.015..747.257 rows=3352944 loops=1
Buffers: shared hit=38102
Worker 2: actual time=0.016..748.296 rows=3339776 loops=1
Buffers: shared hit=37952
Worker 3: actual time=0.016..747.316 rows=3347256 loops=1
Buffers: shared hit=38037
Worker 4: actual time=0.017..747.354 rows=3361512 loops=1
Buffers: shared hit=38199
Planning time: 0.327 ms
Execution time: 1194.669 ms
Planning time: 2.564 ms
Execution time: 4154.213 ms
(45 rows)
非分佈鍵的GROUP BY,會分成兩個階段執行,類似Greenplum
《HybridDB PostgreSQL "Sort、Group、distinct 聚合、JOIN" 不懼怕資料傾斜的黑科技和原理 - 多階段聚合》
多階段聚合的原理可參考:
《PostgreSQL Oracle 相容性之 - 自定義並行聚合函式 PARALLEL_ENABLE AGGREGATE》
《Postgres-XC customized aggregate introduction》
《PostgreSQL aggregate function customize》
postgres=# explain select count(*) from test group by c1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.worker_column_2
-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0)
Task Count: 128
Tasks Shown: One of 128
-> Task
Node: host=172.24.211.224 port=1921 dbname=postgres
-> Finalize GroupAggregate (cost=289321.80..289326.60 rows=101 width=12)
Group Key: c1
-> Sort (cost=289321.80..289323.06 rows=505 width=12)
Sort Key: c1
-> Gather (cost=289247.61..289299.12 rows=505 width=12)
Workers Planned: 5
-> Partial HashAggregate (cost=288247.61..288248.62 rows=101 width=12)
Group Key: c1
-> Parallel Seq Scan on test_103068 test (cost=0.00..268179.74 rows=4013574 width=4)
(16 rows)
非分佈鍵的分組聚合,25.6億資料,耗時約7秒
postgres=# select c1,count(*) from test group by c1 order by count(*) desc limit 10;
c1 | count
----+----------
93 | 25761280
63 | 25748992
86 | 25743104
55 | 25741056
36 | 25730048
33 | 25724928
53 | 25722880
84 | 25701632
81 | 25699072
57 | 25692928
(10 rows)
Time: 6948.230 ms (00:06.948)
postgres=# select count(*) from test;
count
------------
2560000000
(1 row)
Time: 3806.089 ms (00:03.806)
每秒的掃描速度約58.4GB
postgres=# select 222/3.8
postgres-# ;
?column?
---------------------
58.4210526315789474
(1 row)
TPC-H 後面再測試。
主要瓶頸,IO(SSD雲盤的最大問題還是讀寫頻寬上,大概只有200多MB/s,本地SSD盤可以達到4GB/s)
OLTP 優化
2階段,單條SQL涉及9個SHARD.
su - postgres -c "echo \"alter system set citus.multi_shard_commit_protocol='2pc'; select pg_reload_conf();\"|psql -f -"
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 1436979
latency average = 10.691 ms
latency stddev = 10.136 ms
tps = 11954.269803 (including connections establishing)
tps = 11957.251498 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(1,1000000000)
10.691 insert into test values (:id),(:id+1),(:id+2),(:id+3),(:id+4),(:id+5),(:id+6),(:id+7),(:id+8);
1階段(但是最後還是要等所有SHARD操作完成後再commit,只不過不是2PC而已,所以可以看到worker上很多idle in transaction的狀態)
su - postgres -c "echo \"alter system set citus.multi_shard_commit_protocol='1pc'; select pg_reload_conf();\"|psql -f -"
1階段,單條SQL涉及9個shard
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2449586
latency average = 6.269 ms
latency stddev = 4.968 ms
tps = 20407.507503 (including connections establishing)
tps = 20412.584813 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(1,1000000000)
6.269 insert into test values (:id),(:id+1),(:id+2),(:id+3),(:id+4),(:id+5),(:id+6),(:id+7),(:id+8);
1階段,涉及2個shard。單個QUERY涉及的SHARD越多,效率越低。
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 7894772
latency average = 1.945 ms
latency stddev = 2.312 ms
tps = 65769.534466 (including connections establishing)
tps = 65784.907845 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(1,1000000000)
1.946 insert into test values (:id),(:id),(:id),(:id),(:id),(:id),(:id),(:id),(:id+8);
1階段,涉及1個shard。效率最高。(如果客戶端能夠拿到分佈演算法,分佈鍵,鍵值,並按SHARD將多次寫入封裝在一條SQL中,可以輕鬆達到幾百萬行/s的寫入效能。)
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 22410273
latency average = 0.685 ms
latency stddev = 8.487 ms
tps = 186717.144028 (including connections establishing)
tps = 186761.579753 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(1,1000000000)
0.688 insert into test values (:id),(:id),(:id),(:id),(:id),(:id),(:id),(:id),(:id);
單個QUERY,單個shard,單條記錄。
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 29441284
latency average = 0.522 ms
latency stddev = 0.762 ms
tps = 245299.791043 (including connections establishing)
tps = 245362.383416 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set id random(1,1000000000)
0.522 insert into test values (:id);
OLTP 優化小結
1、儘量不要使用運算子(包括函式),建議使用常量。
因為所有運算子(包括函式)的計算都在coordinator節點完成。
2、不需要2PC的話儘量不要使用。
su - postgres -c "echo \"alter system set citus.multi_shard_commit_protocol='1pc'; select pg_reload_conf();\"|psql -f -"
3、批量使用單條SQL批量操作(多SQL沒有意義)。如果客戶端能夠拿到分佈演算法,分佈鍵,鍵值,並按SHARD將多次寫入封裝在一條SQL中,可以輕鬆達到幾百萬行/s的寫入效能。
CN MX : OLTP 讀、寫能力擴充套件
1、讀能力擴充套件,增加coordinator節點。通過物理流複製 可以複製若干個COORDINATOR 節點(TP場景worker節點的資源使用率較低,coordinator節點可以在每個worker節點上放一個。),只讀QUERY可以分攤到不同的coordinator節點執行。
2、寫能力擴充套件,增加coordinator節點。可以使用mx功能(隱藏功能,實際上讀能力也可以使用這種方法擴充套件)。
CN節點執行
su - postgres -c "echo \"alter system set citus.replication_model='streaming'; select pg_reload_conf();\"|psql -f -"
新增要同步後設資料的WORKER
select * from master_add_node('xxxxx.224',1921);
select * from master_add_node('xxxxx.230',1921);
開啟同步到後設資料。
select start_metadata_sync_to_node('xxxxx.224',1921);
select start_metadata_sync_to_node('xxxxx.230',1921);
包含後設資料的節點,hasmetadata標記位TRUE。
postgres=# select * from pg_dist_node;
nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster
--------+---------+----------------+----------+----------+-------------+----------+----------+-------------
3 | 3 | xxxxx.231 | 1921 | default | f | t | primary | default
4 | 4 | xxxxx.225 | 1921 | default | f | t | primary | default
5 | 5 | xxxxx.227 | 1921 | default | f | t | primary | default
6 | 6 | xxxxx.232 | 1921 | default | f | t | primary | default
7 | 7 | xxxxx.226 | 1921 | default | f | t | primary | default
8 | 8 | xxxxx.229 | 1921 | default | f | t | primary | default
2 | 2 | xxxxx.230 | 1921 | default | t | t | primary | default
1 | 1 | xxxxx.224 | 1921 | default | t | t | primary | default
(8 rows)
僅針對開啟同步後,建立的SHARD表,在worker節點才有後設資料,之前已經建立的SHARD表,不會自動把後設資料同步過去。
對應的worker可以當成cn來使用。
postgres=# explain select * from test5;
QUERY PLAN
-----------------------------------------------------------------------------------
Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0)
Task Count: 128
Tasks Shown: One of 128
-> Task
Node: host=172.24.211.224 port=1921 dbname=postgres
-> Seq Scan on test5_103740 test5 (cost=0.00..22.70 rows=1270 width=36)
(6 rows)
postgres=# explain select * from test6;
QUERY PLAN
-----------------------------------------------------------------------------------
Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0)
Task Count: 128
Tasks Shown: One of 128
-> Task
Node: host=172.24.211.224 port=1921 dbname=postgres
-> Seq Scan on test6_103868 test6 (cost=0.00..22.70 rows=1270 width=36)
(6 rows)
啟用CN MX後,1+8 的叢集,讀寫能力如何?
(首先啟用MX後,實際上所有節點都可以作為寫節點,可用性大幅度提升。另一方面,只要做好每個節點的HA,全域性視角的可用性可靠性也可以有保障。)
1 TPC-B (select only)
pgbench -M prepared -n -r -P 1 -c 96 -j 96 -T 120 -S
CN tps: 127378
WORKERs tps: 58473+55709+63161+63510+62721+66301+59628+65486
總 tps: 622367
worker平均剩餘cpu 2.5%
2 TPC-B (read write)
CN: pgbench -M prepared -n -r -P 1 -c 96 -j 96 -T 120
WORKER: pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 120
CN tps: 18207
WORKERs tps: 7467+7368+7287+7391+7316+7442+7661+7440
總 tps: 77579
worker平均剩餘cpu 14%
citus 讀寫能力擴充套件架構
1、cn流複製,擴充套件讀能力。
2、citus MX特性,指定worker節點攜帶metadata,擴充套件讀寫能力。
3、流複製,異地容災。使用不同的dns,解決兩套叢集IP不同的問題。擴充套件容災能力。
citus reference
https://docs.citusdata.com/en/v7.5/develop/api.html
配合haproxy,lvs,citus mx功能(前面已提到如何配置),可以實現對業務完全透明的讀寫負載均衡。
https://severalnines.com/blog/postgresql-load-balancing-using-haproxy-keepalived
http://docs.citusdata.com/en/stable/cloud/additional_mx.html
配合jdbc或libpq的mulit-host功能,citus mx功能,可以實現對業務完全透明的讀寫負載均衡。
《PostgreSQL jdbc multi-host 配置與簡單HA、Load Balance實現》
《PostgreSQL libpq multi-host 配置與簡單HA實現》
《PostgreSQL 10.0 preview 功能增強 - libpq支援多主機連線(failover,LB)讓資料庫HA和應用配合更緊密》
相關文章
- 分散式PostgreSQL之Citus分散式SQL
- 使用 PostgreSQL 16.1 + Citus 12.1 作為多個微服務的分散式 Sharding 儲存後端SQL微服務分散式後端
- 試駕 Citus 11.0 beta
- 分散式 PostgreSQL - Citus 架構及概念分散式SQL架構
- 分散式 PostgreSQL - Citus 架構及概念分散式SQL架構
- 分散式 PostgreSQL 叢集(Citus)官方安裝指南分散式SQL
- Kunlun-Storage vs PostgreSQL OLTP 測試SQL
- 分散式 PostgreSQL 叢集(Citus)官方示例 - 多租戶應用程式實戰分散式SQL
- ASP.NET Core + SaasKit + PostgreSQL + Citus 的多租戶應用程式架構示例ASP.NETSQL架構
- Citus 11(分散式 PostgreSQL) 文件貢獻與本地執行分散式SQL
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(查詢處理)分散式SQL
- 在 Kubernetes 上快速測試 Citus 分散式 PostgreSQL 叢集(分散式表,共置,引用表,列儲存)分散式SQL
- 分散式 PostgreSQL 叢集(Citus)官方示例 - 實時儀表盤分散式SQL
- 分散式 PostgreSQL 叢集(Citus)官方示例 - 時間序列資料分散式SQL
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(手動查詢傳播)分散式SQL
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(SQL支援和變通方案)分散式SQL
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(攝取、修改資料 DML)分散式SQL
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(查詢分散式表 SQL)分散式SQL
- Citus 11 for Postgres 完全開源,可從任何節點查詢(Citus 官方部落格)
- Citus 分散式 PostgreSQL 叢集 - SQL Reference(建立和修改分散式表 DDL)分散式SQL
- PostgreSQL/Citus 常見報錯問題分析(一)ERROR: direct joins between distributedSQLError
- 分散式 PostgreSQL 叢集(Citus)官方教程 - 遷移現有應用程式分散式SQL
- Oracle 12C Sharding部署和測試Oracle
- 基於python的ftp(含資料庫、含多執行緒、含socketserver)PythonFTP資料庫執行緒Server
- Windows開啟關閉測試模式的方法(含開啟測試模式失敗的解決辦法)Windows模式
- 軟體測試用例編寫(含思路)
- 使用容器快速在阿里雲 ECS 多節點上搭建 Citus 12.1 叢集阿里
- 探索 Python/Django 支援分散式多租戶資料庫,如 Postgres+CitusPythonDjango分散式資料庫
- 分散式 PostgreSQL 叢集(Citus),分散式表中的分佈列選擇最佳實踐分散式SQL
- Sharding-JDBC測試ChatGPTJDBCChatGPT
- 蘇寧citus分散式資料庫應用實踐分散式資料庫
- StackGres 資料庫平臺工程,使用 Citus + Patroni 建立生產級高可用分散式 PostgreSQL 分片叢集資料庫分散式SQL
- 從零學Netty(四)Reactor模式(含demo)NettyReact模式
- Linux shell中2>&1的含義Linux
- Django-Multitenant,分散式多租戶資料庫專案實戰(Python/Django+Postgres+Citus)DjangoNaN分散式資料庫Python
- AIGC的“含科量”與“含資量”AIGC
- 機器學習導圖系列(4):演算法(含61公式)機器學習演算法公式
- 擴充套件我們的分析處理服務(Smartly.io):使用 Citus 對 PostgreSQL 資料庫進行分片套件SQL資料庫