Redo組提交

Redo提交流程大致如下

lock log->mutex

write redo log buffer to disk

unlock log->mutex

fsync

Fsync寫磁碟耗時較長且不佔用log->mutex，也就是其執行期間其他執行緒可以write log buffer；

假定一次fsync需要10ms，而寫buffer只需要1ms，則fsync執行期間最多可以有10條redo record寫入buffer，則下次呼叫fsync時可一次性寫10條記錄；

binlog組提交

Mysql 從 5.0 開始支援2PC，在程式碼實現時為了保證Binlog中的事務順序和事務commit順序一致，放棄了Group Commit。

如果Binlog順序不一致，那麼備庫就無法確保和主庫有一致的資料。這個問題直到 mysql 5.5 才開始部分修復，到 mysql 5.6 完全修復。

在 mysql 5.5 中，只有當 sync_binlog = 0 時，才能使用 group commit，在 mysql 5.6中都可以進行 group commit。

2PC下的事務提交流程

1. Prepare Innodb:

a) Write prepare record to Innodb's log buffer

b) Sync log file to disk -- redo組提交

c) Take prepare_commit_mutex

2. "Prepare" binary log:

a) Write transaction to binary log

b) Sync binary log based on sync_binlog

3. Commit Innodb:

a) Write commit record to log

b) Release prepare_commit_mutex

c) Sync log file to disk

d) Innodb locks are released

4. "Commit" binary log:

a) Nothing necessary to do here.

不足

為保證binlog按順序寫，prepare redo階段獲取prepare_commit_mutex，直到sync redo前才釋放；一次只能有一個事務可獲取該mutex，阻礙了group commit；

另外，一次完整的事務提交需要呼叫3次fsync，效率很低；

改進

1 減少fsync

crash recovery時先檢視redo log，找出prepared但沒有commited或aborted的事務列表，然後檢查binlog，如果binlog記錄了該事務，則將其commit否則rollback；

由此可見，事務恢復時取決於binlog是否有記錄，因此commit innodb時無須呼叫立即fsync(此時binlog已寫入，就算crash也能保證事務提交)；

2 細化binlog commit程式碼，實現組提交

在Oracle MySQL 5.6.15中，binlog group commit模組被重寫，這個過程分為3個stage：flush/sync/commit;

5.6的2PC提交流程如下

1. Ask binary log (i.e. coordinator to prepare

a) Request to release locks earlier

b) Prepare Innodb (Callback to (2.a))

2. Prepare Innodb:

a) Write prepare record to Innodb log buffer

b) Sync log file to disk

3. Ask binary log (i.e. coordinator) to commit

a) Lock access to flush stage

b) Write a set of transactions to the binary log

c) Unlock access to flush stage

d) Lock access to sync stage

e) Flush the binary log to disk

f) Unlock access to sync stage

g) Lock access to commit stage

h) Commit Innodb (Callback to (4.a))

i) Unlock access to commit stage

4. Commit Innodb

a) Write a commit record to Innodb log buffer

binlog提交被細化為3個處理階段，每一階段都有lock保護(此時redo已經呼叫fsync，事務尚未提交)；

這3個階段負責批次讀取binlog並呼叫fsync，而後以同樣順序提交事務(可選)；

第一個進入處理階段的事務擔當Leader的角色，剩餘的為follower，後者釋放所有的latch並等待，直至leader完成commit；

leader獲取所有排隊等待的事務並處理，進入下一個處理階段時，如果佇列為空則仍是leader，否則降級為follower；

1. Flush Stage

leader會不斷讀取flush queue直到佇列為空或者超時，這樣允許處理過程中新加入的事務也能得到及時處理；

leader將排隊的事務寫入binlog buffer，當佇列為空時則進入下一階段；

超時機制避免了事務長時間等待，

2. Sync Stage

呼叫fsyc，一次重新整理多個事務；

3. Commit Stage

提交事務，保證所有事務提交順序同寫入binlog一致(innodb hot backup)；為了提升效能，也可選擇不按次序提交；

程式碼實現

Binlog原本實現了handlerton介面，包括commit()/rollback()等方法，5.6引入新機制

public class MYSQL_BIN_LOG: public TC_LOG

{

int open_connection(THD* thd);

int close_connection(THD* thd);

int commit(THD *thd, bool all);

int rollback(THD *thd, bool all);

int savepoint_set(THD* thd, SAVEPOINT *sv);

int savepoint_release(THD* thd, SAVEPOINT *sv);

int savepoint_rollback(THD* thd, SAVEPOINT *sv);

};

int MYSQL_BIN_LOG::commit(THD *thd, bool all)

{

/* Call batch_commit(). */

}

int MYSQL_BIN_LOG::batch_commit(THD* thd, bool all)

{

--將事務加入flush queue，第一個事務為leader，follower阻塞直至完成commit;

if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))

return finish_commit(thd->commit_error);

--將事務寫入binlog

THD *flush_queue= NULL; /* Gets a pointer to the flush_queue */

error= flush_stage_queue(&wait_queue);

if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log, &LOCK_sync))

return finish_commit(thd->commit_error);

--依據sync_binlog選項呼叫fsync，5.5卻只能將sync_binlog=0

THD *sync_queue= NULL; /* Gets a pointer to the sync_queue */

error= sync_stage_queue(&sync_queue);

--根據opt_binlog_order_commits，可以按binlog寫入順序提交事務，也可以讓執行緒呼叫handlerton->commit各自提交；

if (opt_binlog_order_commits)

{

if (change_stage(thd, Stage_manager::COMMIT_STAGE,

final_queue, &LOCK_sync, &LOCK_commit))

return finish_commit(thd);

THD *commit_queue= NULL;

error= commit_stage_queue(&commit_queue);

mysql_mutex_unlock(&LOCK_commit);

final_queue= commit_queue;

}

else

{

final_queue= sync_queue;

mysql_mutex_unlock(&LOCK_sync);

}

--通知follower，要麼提交事務(opt_binlog_order_commits=false)要麼通知客戶端；

stage_manager.signal_done(final_queue);

return finish_commit(thd);

}

參考資料

http://dev.mysql.com/worklog/task/?id=5223

http://mysqlmusings.blogspot.co.uk/2012/06/binary-log-group-commit-in-mysql-56.html?_sm_au_=iDV88W54k66P05L7

後記：
看到淘寶分享的一篇帖子，在5.6的基礎上還有最佳化的空間，要保證一個事務能成功恢復，只需要保證在binlog commit前將對應事務的redo entry寫入磁碟即可，則redo commit/sync完全可以從redo prepare後移到binlog prepare，將其放於flush stage和commit stage之間，將原本N次的log_sys->mutex獲取次數降為1次，fsync也變為1次；
問題
每個事務都要保證其Prepare的事務被write/fsync到redo log檔案。儘管某個事務可能會幫助其他事務完成redo 寫入，但這種行為是隨機的，並且依然會產生明顯的log_sys->mutex開銷。
最佳化
從XA恢復的邏輯我們可以知道，只要保證InnoDB Prepare的redo日誌在寫Binlog前完成write/sync即可。因此我們對Group Commit的第一個stage的邏輯做了些許修改，大概描述如下：
Step1. InnoDB Prepare，記錄當前的LSN到thd中；
注：原本此階段需要獲取log->mutex進行的寫檔案取消，延遲到下一階段；在原有fsync組提交的基礎上實現寫檔案組提交。
Step2. 進入Group Commit的flush stage；Leader蒐集佇列，同時算出佇列中最大的LSN。
Step3. 將InnoDB的redo log write/fsync到指定的LSN
Step4. 寫Binlog並進行隨後的工作(sync Binlog, InnoDB commit , etc)
透過延遲寫redo log的方式，顯式的為redo log做了一次組寫入，並減少了log_sys->mutex的競爭。
目前官方MySQL已經根據我們report的bug#73202鎖提供的思路，對5.7.6的程式碼進行了最佳化，對應的Release Note如下：
When using InnoDB with binary logging enabled, concurrent transactions written in the InnoDB redo log are now grouped together before synchronizing to disk when innodb_flush_log_at_trx_commit is set to 1, which reduces the amount of synchronization operations. This can lead to improved performance.
簡單測試了下，使用sysbench, update_non_index.lua, 100張表，每張10w行記錄，innodb_flush_log_at_trx_commit=2, sync_binlog=1000，關閉Gtid
併發執行緒原生修改後
32 25600 27000
64 30000 35000
128 33000 39000
256 29800 38000

mysql 5.6 binlog組提交實現原理

相關文章