使用PostgreSQL邏輯訂閱實現multi-master

德哥發表於2017-06-24

SQLAST

背景

很多業務要求多活，但是多活中最難搞定的實際上是資料庫，大多數業務通過分流，例如將資料根據UID切分到不同的IDC，同一個UID的資料永遠只會寫到一個IDC中，然後通過資料複製技術，將對應的資料複製到其他的IDC。

這種形態的多活是比較安全的，即同一條記錄不會被多個IDC執行DML。如果同一條記錄涉及到多個IDC的DML，那麼資料庫的一致性會受到極大挑戰，比如同一條記錄被多個IDC更新，到底以哪個為準？

同時多活還要解決另一個問題，資料的迴圈問題，例如A寫入一條資料，需要複製到B，B寫入的資料也要複製給A。如果沒有手段防止迴圈的話，一條記錄就可以把多活資料庫弄趴下。

multi-master的兩個重大課題是解決衝突以及死迴圈的問題。

PostgreSQL 10引入了邏輯訂閱的功能，可以輕鬆的實現單向複製，同時為雙向複製(multi-master)提供了天然的支援，本文將介紹如何利用邏輯訂閱實現multi-master。

衝突的產生

例如某個表的PK欄位為pk, A,B,C三個節點，都更新了同一條pk，到底以誰為準呢？

又或者A,B,C都寫入了同一個pk值的一條記錄，到底以誰的為準呢？

解決衝突

解決衝突的方法很多，例如以某個欄位的值來判斷（例如時間欄位，以最大或最小為準）。

針對不同的操作，解法不一。

1、INSERT和UPDATE

建議業務層面保證PK不衝突。如果業務層面無法保證不衝突，則可以這樣實現conflict handler：

對需要實現multi-master的表，新增3個欄位，分別為：

NODE ID，事務時間、語句時間。然後我們就可以實現這些conflict handler方法。

1、以某個欄位(事務時間、語句時間、或者某個使用者選擇的欄位)的最大、最小值為準，  
  
2、以某個節點為準，比如1號節點為上級單位的節點，當發生衝突是，底下的節點都服從他。  
  
3、使用自定義handler。  
  
即複製過程中，如果出現衝突，自動呼叫對應的handler對應的function來處理。    
    
pglogical, bdr, xDB 都有類似的機制。    
  
本文使用觸發器來實現自定義handler。

可以通過HOOK實現，在核心中INSERT、UPDATE階段應用以上規則。

也可以使用觸發器來實現，本文的例子就是使用觸發器來實現的。

2、DELETE

無衝突

PS

目前PG 10在遇到衝突（比如INSERT遇到PK異常，或者任何其他異常），會終止WAL apply，此時需要人為介入，例如SKIP對應的WAL。

使用本文的衝突handler後，可以規避此問題。否則我們需要這樣來處理衝突：

1. 通過修改訂閱端的資料，解決衝突。例如insert違反了唯一約束時，可以刪除訂閱端造成唯一約束衝突的記錄先DELETE掉。然後使用ALTER SUBSCRIPTION name ENABLE讓訂閱繼續。

如果是雙向複製（multi-master）則不建議使用方法1。

2. 在訂閱端呼叫pg_replication_origin_advance(node_name text, pos pg_lsn)函式，node_name就是subscription name，pos指重新開始的LSN，從而跳過有衝突的事務。

pg_replication_origin_advance(node_name text, pos pg_lsn)             
      
Set replication progress for the given node to the given position.       
      
This primarily is useful for setting up the initial position or a new position after configuration changes and similar.       
      
Be aware that careless use of this function can lead to inconsistently replicated data.

當前的lsn通過pg_replication_origin_status.remote_lsn檢視。

https://www.postgresql.org/docs/devel/static/view-pg-replication-origin-status.html

死迴圈的產生

例如A寫入了一條記錄，產生一筆WAL，B通過這筆WAL將這條記錄複製到了B，B又會產生一筆WAL，這筆WAL又會被複制到A，然後A通過這筆WAL又會寫入一條記錄，然後又會產生WAL，迴圈往復。

解決死迴圈

如果不解決死迴圈的問題，multi-master的某個節點插入一條記錄，這條記錄複製到另一個節點後，還會迴流到某節點，無限迴圈。

解決死迴圈的問題方法比較簡單，

1、對需要雙向複製的表，新增一個欄位（無預設值），表示這條記錄是來自PEER節點，還是本地插入、更新、刪除的。

為了方便溯源，也可以加一個gtid欄位用來分辨記錄是來自哪個節點的。(在觸發器中體現)

2、對需要雙向複製的表，新增一個觸發器函式，觸發器中需要以下邏輯：

insert or update    
    
  if pid <> replic pid then        -- 本地發起的insert    
    NEW.islocal = true          
  else                             -- 對端節點發起的insert    
    if NEW.islocal = false then    -- 對端也是別人那裡複製過來的    
      return NULL;    
    else                           -- 對端是自己產生的    
      NEW.islocal = false          -- 置為對端發來的    
    end if;    
  end if;    
    
  return NEW;

示例

環境

3個PostgreSQL 10例項。分別對應埠

multi-master 實施步驟

1、建立測試表

測試表的原始欄位如下

create table mm(    
  id int primary key,   -- 需要多向複製的表，建議必須有PK，對於multi-master，建議使用序列start val錯開。    
  info text,            -- 原始表結構內的欄位    
  mod_time timestamp    -- 原始表結構內的欄位    
);

為了解決衝突和死迴圈的問題，我們新增了若干欄位

create table mm(    
  id int primary key,   -- 需要多向複製的表，建議必須有PK，對於multi-master，建議使用序列start val錯開。    
  info text,            -- 原始表結構內的欄位    
  mod_time timestamp,   -- 原始表結構內的欄位    
  -- 新增如下欄位  
  mm_islocal boolean,      -- 表示這條記錄是本地發生的，還是複製過來的，本地為true，peer為false   
  mm_nodeid int2,          -- 表示這條記錄是哪個節點(insert,update,delete)的。  
  mm_last_xacttime timestamp,  -- 表示這條記錄被(insert,update)的事務時間, now()  
  mm_last_stattime timestamp   -- 表示這條記錄被(insert,update)的語句時間, clock_timestamp()  
);

PS，寫入的衝突，業務層是有辦法規避的，例如

使用序列start val錯開PK

create table mm(    
  id int primary key default nextval(`seq`::regclass),   -- 需要多向複製的表，建議必須有PK，對於multi-master，建議使用序列start val錯開。    
  info text,            -- 原始表結構內的欄位    
  mod_time timestamp,   -- 原始表結構內的欄位    
  -- 新增如下欄位  
  mm_islocal boolean,      -- 表示這條記錄是本地發生的，還是複製過來的，本地為true，peer為false   
  mm_nodeid int2,          -- 表示這條記錄是哪個節點(insert,update,delete)的。  
  mm_last_xacttime timestamp,  -- 表示這條記錄被(insert,update)的事務時間, now()  
  mm_last_stattime timestamp   -- 表示這條記錄被(insert,update)的語句時間, clock_timestamp()  
);    
    
    
1922    
create sequence seq increment by 16 start with 1;    
    
    
1923    
create sequence seq increment by 16 start with 2;    
    
    
1924    
create sequence seq increment by 16 start with 3;

2、建立觸發器函式，解決死迴圈、衝突問題

為了實現不同的conflict handler，我們需要給觸發器函式輸入引數，設計如下

引數0，當前節點號  
  
引數1：使用什麼作為conflict的解決標誌，取值範圍(nodeid, last_xacttime, last_stattime, udf)  
  
引數2：對應的解決辦法，不同的標誌，對應不同的辦法(nodeid(1,2,3), last_xact,stattime(first,last), udf(自己定義)...)

create or replace function tg() returns trigger as $$    
declare    
  replica_pids int[];    
begin    
  select array(select pid from pg_stat_activity where application_name ~ `logical replication worker`) into replica_pids;    
    
  -- 解決死迴圈  
  
  if array_position(replica_pids, pg_backend_pid()) is null then          -- 本地發起的insert或update    
    NEW.mm_islocal = true ;    
    NEW.mm_nodeid = TG_ARGV[0];  
    NEW.mm_last_xacttime = now();  
    NEW.mm_last_stattime = clock_timestamp();  
  else                                                                    -- 不是本地節點發起的insert或update    
    if NEW.mm_islocal = false then                                           -- 判斷是否來自peer節點的本地insertupdate  
      return null;                                                        -- 否，跳過    
    else    
      NEW.mm_islocal = false;         -- 是，繼續，並標記為非本地發起的  
    end if;    
  end if;    
  
  -- 順利過關，說明沒有發生迴圈。  
  
  -- 解決衝突  
  -- tg_argv[0]: nodeid  
  -- tg_argv[1]: nodeid, last_xacttime, last_stattime, mod_time  
  -- tg_argv[2]: nodeid(1,2,3), last_xacttime(first,last), last_stattime(first,last), mod_time(first,last)  
  -- tg_argv[3]: nodeid(1,2,3), 當tg_argv[1] <> nodeid並且old,new欄位值一樣時，使用哪個節點優先  
  
  case TG_OP  
  when `INSERT` then  
    --    insert衝突需要用hook解決，替換為insert on conflict  
    --    所以本文建議，採用serial的start val和increment by 來解決pk衝突的問題  
    RETURN NEW;  
  when `UPDATE` then  
    case TG_ARGV[1]  
    when `nodeid` then  
      if OLD.mm_nodeid::text = TG_ARGV[2] and NEW.mm_nodeid <> OLD.mm_nodeid then   
        return NULL;  
      end if;  
    when `last_xacttime` then  
      case TG_ARGV[2]  
      when `first` then  
        if OLD.mm_last_xacttime < NEW.mm_last_xacttime or (OLD.mm_last_xacttime = NEW.mm_last_xacttime and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      when `last` then  
        if OLD.mm_last_xacttime > NEW.mm_last_xacttime or (OLD.mm_last_xacttime = NEW.mm_last_xacttime and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      end case;  
    when `last_stattime` then  
      case TG_ARGV[2]  
      when `first` then  
        if OLD.mm_last_xacttime < NEW.mm_last_stattime or (OLD.mm_last_stattime = NEW.mm_last_stattime and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      when `last` then  
        if OLD.mm_last_stattime > NEW.mm_last_stattime or (OLD.mm_last_stattime = NEW.mm_last_stattime and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      end case;  
    when `mod_time` then  -- 自定義部分  
      case TG_ARGV[2]  
      when `first` then  
        if OLD.mm_last_xacttime < NEW.mod_time or (OLD.mod_time = NEW.mod_time and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      when `last` then  
        if OLD.mm_last_xacttime > NEW.mod_time or (OLD.mod_time = NEW.mod_time and OLD.mm_nodeid::text = TG_ARGV[3] and NEW.mm_nodeid <> OLD.mm_nodeid) then   
	  return null;  
        end if;  
      end case;  
    end case;  
  end case;  
    
  return NEW;    
  
end;    
$$ language plpgsql strict;

PS，使用pg_stat_activity可以得到logical replication worker的pids.

postgres=# select * from pg_stat_activity where pid=24505;    
-[ RECORD 1 ]----+--------------------------------------------------    
datid            | 13158    
datname          | postgres    
pid              | 24505    
usesysid         | 10    
usename          | postgres    
application_name | logical replication worker for subscription 45109    
client_addr      |     
client_hostname  |     
client_port      |     
backend_start    | 2017-06-22 15:50:33.285954+08    
xact_start       |     
query_start      |     
state_change     | 2017-06-22 16:40:04.983325+08    
wait_event_type  | Activity    
wait_event       | LogicalApplyMain    
state            | idle    
backend_xid      |     
backend_xmin     |     
query            |     
backend_type     | background worker

3、建立觸發器

可以這樣來建立，分別針對不同的conflict handler

本地節點為1號，優先策略為2號節點  
create trigger tg before insert or update on mm for each row execute procedure tg(1, nodeid, 2);    
  
解釋

本地節點為1號，優先策略為last_xacttime, first，當時間相等時優先節點為1號  
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, last_xacttime, first, 1);    
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, last_stattime, first, 1);   
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, mod_time, first, 1);   
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, last_xacttime, last, 1);    
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, last_stattime, last, 1);   
  
-- create trigger tg before insert or update on mm for each row execute procedure tg(1, mod_time, last, 1);

以last_xacttime，last，優先2號節點為例

1922  
create trigger tg before insert or update on mm for each row execute procedure tg(1, last_xacttime, last, 2);    
  
1923  
create trigger tg before insert or update on mm for each row execute procedure tg(2, last_xacttime, last, 2);    
  
1924  
create trigger tg before insert or update on mm for each row execute procedure tg(3, last_xacttime, last, 2);

以last_xacttime，first，優先2號節點為例

4、讓觸發器在所有連線中，包括replica程式中都生效

alter table mm enable always trigger tg;

5、建立釋出

CREATE PUBLICATION pub1 FOR TABLE mm with (publish = `insert, delete, update`);    
    
-- alter publication pub1 add table mm;

前面的操作需要在所有例項執行。

6、建立訂閱，不同的例項操作分別如下

1922    
CREATE SUBSCRIPTION sub1922_1923 CONNECTION `host=127.0.0.1 port=1923 user=postgres dbname=postgres` PUBLICATION pub1;    
CREATE SUBSCRIPTION sub1922_1924 CONNECTION `host=127.0.0.1 port=1924 user=postgres dbname=postgres` PUBLICATION pub1;    
    
1923    
CREATE SUBSCRIPTION sub1923_1922 CONNECTION `host=127.0.0.1 port=1922 user=postgres dbname=postgres` PUBLICATION pub1;    
CREATE SUBSCRIPTION sub1923_1924 CONNECTION `host=127.0.0.1 port=1924 user=postgres dbname=postgres` PUBLICATION pub1;    
    
1924    
CREATE SUBSCRIPTION sub1924_1922 CONNECTION `host=127.0.0.1 port=1922 user=postgres dbname=postgres` PUBLICATION pub1;    
CREATE SUBSCRIPTION sub1924_1923 CONNECTION `host=127.0.0.1 port=1923 user=postgres dbname=postgres` PUBLICATION pub1;

7、壓測方法

為了避免插入衝突導致複製中斷，使用以下測試方法，3個例項插入的資料確保PK值是不一樣的，（實際生產，可以使用序列的start value來錯開）。

更新、刪除則覆蓋所有的值範圍。

test1.sql    
    
set id1 random(1,30000)    
set id2 random(1,30000)    
set id3 random(1,30000)    
set id4 random(1,30000)    
set id5 random(1,30000)    
insert into mm select 3*(random()*10000)::int, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id1;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id2;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id3;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id4;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id5;    
delete from mm where id=:id+1;  
delete from mm where id=:id;  
  
    
test2.sql    
    
set id1 random(1,30000)    
set id2 random(1,30000)    
set id3 random(1,30000)    
set id4 random(1,30000)    
set id5 random(1,30000)    
insert into mm select 3*(random()*10000)::int+1, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+1, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+1, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+1, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+1, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id1;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id2;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id3;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id4;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id5;    
delete from mm where id=:id+1;  
delete from mm where id=:id;  
    
    
test3.sql    
    
set id1 random(1,30000)    
set id2 random(1,30000)    
set id3 random(1,30000)    
set id4 random(1,30000)    
set id5 random(1,30000)    
insert into mm select 3*(random()*10000)::int+2, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+2, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+2, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+2, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
insert into mm select 3*(random()*10000)::int+2, md5(random()::text), now() on conflict(id) do update set info=excluded.info,mod_time=clock_timestamp();    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id1;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id2;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id3;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id4;    
update mm set info=md5(random()::text),mod_time=clock_timestamp() where id=:id5;    
delete from mm where id=:id+1;  
delete from mm where id=:id;

8、三個節點同時壓測

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 10 -j 10 -T 120 -p 1922  &  
    
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 10 -j 10 -T 120 -p 1923  &  
    
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 10 -j 10 -T 120 -p 1924  &

9、驗證

1922    
postgres=# select sum(hashtext(t.id||t.info||t.mod_time)) from mm t;    
      sum           
----------------    
 -1378199912066    
(1 row)    
    
1923    
postgres=# select sum(hashtext(t.id||t.info||t.mod_time)) from mm t;    
      sum           
----------------    
 -1378199912066    
(1 row)    
    
1924    
postgres=# select sum(hashtext(t.id||t.info||t.mod_time)) from mm t;    
      sum           
----------------    
 -1378199912066    
(1 row)

10、測試刪除

多節點複製成功    
    
最終驗證結果一致

11、清理戰場

-- 1922
drop table mm;
drop publication pub1;
alter subscription sub1922_1923 disable ;
alter subscription sub1922_1924 disable ;
drop subscription sub1922_1923;
drop subscription sub1922_1924;

-- 1923
drop table mm;
drop publication pub1;
alter subscription sub1923_1922 disable ;
alter subscription sub1923_1924 disable ;
drop subscription sub1923_1922;
drop subscription sub1923_1924;

-- 1924
drop table mm;
drop publication pub1;
alter subscription sub1924_1922 disable ;
alter subscription sub1924_1923 disable ;
drop subscription sub1924_1922;
drop subscription sub1924_1923;