mysql semi-sync的演化

myownstars發表於2015-02-08
5.5引入semi-sync,當master事務提交後,由dump將對應binlog傳給slaves,至少收到一個slave的ACK確認,master才返回給使用者執行緒;
注意事項
1 slave ACK只代表io_thread已記錄relay_log,並不意味著sql_thread已經執行;
2 master的事務commit後才傳輸給slave,如果此時master crash,會出現主備資料不一致;
3 dump thread既要負責傳輸binlog,又負責接收slave的ACK,且兩者不能並行,效率很低;
4 dump thread讀取binlog時獲取LOCK_log,mutex期間任何執行緒不得對binlog進行讀寫;


為此後續版本不斷改進
1  after_sync
5.7引入rpl_semi_sync_master_wait_point引數 ,DBA可選擇master 在哪個階段等待來自slave的ACK,要麼按照以前的方法(after_commit),要麼在master事務flush binlog之後但是commit storage engine之前;
AFTER_SYNC (the default): The master writes each transaction to its binary log and the slave, and syncs the binary log to disk. The master waits for slave acknowledgment of transaction receipt after the sync. Upon receiving acknowledgment, the master commits the transaction to the storage engine and returns a result to the client, which then can proceed.
AFTER_COMMIT: The master writes each transaction to its binary log and the slave, syncs the binary log, and commits the transaction to the storage engine. The master waits for slave acknowledgment of transaction receipt after the commit. Upon receiving acknowledgment, the master returns a result to the client, which then can proceed.
假定master上有兩個客戶端連線clienta和clientb,
clienta提交一個事務,pre-5.7 mysql將其依次寫入redo,binlog和redo(commit),然後semi-sync,接收到slave ack後才能返回給clienta;
clientb便可在redo(commit)之後看到clienta提交的事務資料,這領先於clienta一步,從而造成連線間的資料不一致;
after_sync則避免了這種問題,clienta提交一個事務,mysql將其依次寫入redo和binlog,然後semi-sync,等收到slave ack後才進行redo(commit),然後返回給clienta;

after_commit另外一個問題,若master在redo(commit)和semi-sync期間crash,此時主備資料並不一致;
after_sync至少能保證redo(commit)成功的事務都已同步到slave,比之改進了半步;

2 ack collector thread
5.7引入此獨立執行緒,此時的dump thread只負責讀取併傳送binlog event,slave ACK的接收由ACK collector thread負責;
dump thread不必等待ack確認便可繼續傳送event,類似TCP的滑動視窗協議;
master維護一個semisync slave列表,即便ack thread宕掉,該列表仍然存在;
dump thread透過呼叫transmit_start時將slave註冊到master,如果slave支援semisync則新增到semisync slave列表;
ack thread透過select()監聽semisync slave列表;

Ack_receiver Class用於維護ACK執行緒
該執行緒有3種狀態
enum status { ST_UP, ST_STOPPING, ST_DOWN };
    ST_UP    means ack receive thread is created and is working.
    ST_DOWN  means ack receive thread is destroyed.
    ST_STOPPING means a user is disabling semisync master, and ack receive thread is being destroyed.
- m_slaves
    A slave vector which includes slaves' useful information here.
    DEFINITION:
    Slave_vector m_slaves
- m_mutex
    m_slaves and m_status are shared between user sessions(dump threads) and ack thread. So they should be protected by a mutex.
- add_slave()
    Add a new semisync slave to slave list.
    DEFINITION:
    bool add_slave(THD *thd);
    LOGIC:
    initialze slave information.
    acquire m_mutex
    add the slave's information into m_slaves.
    send a signal to ack receive thread. It may be waiting for a signal.
    release m_mutex
- remove_slave()
    remove a semisync slave from slave list.
    DEFINITION:
    void remove_slave(THD *thd)
    LOGIC:
    acquire m_mutex
    remove thd of the slave from m_slaves.
    release m_mutex
- run()
    The handle function of receive thread.
    DEFINITION:
    void run();
    LOGIC:
    initialize pthread related things
    while (1)
    {
      acquire m_mutex
      if m_status is ST_STOPPING then break the loop.
      wait any semisync slave to be added if slave list empty.
      call select to listen on sockets, timeout is 1s.
      restart and continue the loop if error or timeout happens.
      receive and report acks to semisync master.
      release m_mutex
    }
    de-initialize pthread related things
    Note: Giving select a timeout makes other threads can add/remove slaves
          or stop ack receive thread when there is no ack.


3 解除dump thread的LOCK_log mutex
當前dump執行緒的工作邏輯如下:
前臺執行緒寫binlog
    acquire LOCK_log
    write log event to binlog
    release LOCK_log
    signal update
dump執行緒
  while client is not killed:
      acquire LOCK_log
      read event from binlog
      release LOCK_log
      if EOF was reached in the previous read:
        acquire LOCK_log
        wait for update signal
        read event from binlog
        release LOCK_log
當某個dump執行緒讀取binlog時,它會獲取LOCK_log mutex,期間會阻塞任何針對該binlog的讀寫請求;

移除LOCK_log
event只新增到當前binlog的尾部,所以讀取其他部位的event不需要鎖;
唯一的顧慮是當前臺執行緒寫binlog時,dump thread可能會讀取到incomplete event;
為此MYSQL_BIN_LOG引入一個變數binlog_end_pos,記錄當前binlog的last event的位置資訊,dump thread只讀取這之前的event;
write thread:新增完event後更新此變數,
read thread:只讀取binlog_end_pos之前的event,
該變數由LOCK_binlog_end_pos保護,讀寫時均需要;
此時dump thread的邏輯如下
  dump thread design:
    end_position = 0
    while client is not killed:
      if current read position == end_position:
        acquire lock_binlog_end
        while end_position == binlog_end and client is not killed:
          wait for update signal
        release lock_binlog_end
      if client is killed:
        break
      read event from binlog
http://dev.mysql.com/worklog/task/?id=5721#tabs-5721-5


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/15480802/viewspace-1430221/,如需轉載,請註明出處,否則將追究法律責任。

相關文章