原創轉載請註明出處

本文使用引擎INNODB版本MYSQL5.7.13
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
也就是雙1設定，

本文使用透過對mysql 進行trace，gdb 自制工具infobin，以及
select * from sys.session where command<>'SLEEP' \G完成。
這裡只是涉及binlog的寫入問題和innodb層無關

1、問題由來
線上上環境一次同事刪除了1000W的資料，在commit的時候，整個資料庫
大約有幾十秒的全部hang住的情況，本文就是為了深入剖析這個問題。

2、可能需要的知識
http://blog.itpub.net/7728585/viewspace-2133589/ MYSQL 什麼時候同步binlog檔案以及它的臨時檔案在哪裡？
http://blog.itpub.net/7728585/viewspace-2133188/ 解析MYSQL BINLOG 二進位制格式(1)--準備工作
http://blog.itpub.net/7728585/viewspace-2133189/ 解析MYSQL BINLOG 二進位制格式(2)--FORMAT_DESCRIPTION_EVENT
http://blog.itpub.net/7728585/viewspace-2133321/ 解析MYSQL BINLOG 二進位制格式(3)--QUERY_EVENT
http://blog.itpub.net/7728585/viewspace-2133429/ 解析MYSQL BINLOG 二進位制格式(4)--TABLE_MAP_EVENT
http://blog.itpub.net/7728585/viewspace-2133463/ 解析MYSQL BINLOG 二進位制格式(5)--WRITE_ROW_EVENT
http://blog.itpub.net/7728585/viewspace-2133469/ 解析MYSQL BINLOG 二進位制格式(6)--UPDATE_ROW_EVENT/DELETE_ROW_EVENT
http://blog.itpub.net/7728585/viewspace-2133502/ 解析MYSQL BINLOG 二進位制格式(7)--Xid_log_event/XID_EVENT
http://blog.itpub.net/7728585/viewspace-2133506/ 解析MYSQL BINLOG二進位制格式(8)--GTID_LOG_EVENT/ANONYMOUS_GTID_LOG_EVENT及其他
http://blog.itpub.net/7728585/viewspace-2133534/ 解析MYSQL BINLOG二進位制格式(9)--infobin解析binlog幫助文件
http://blog.itpub.net/7728585/viewspace-2133537/ 解析MYSQL BINLOG二進位制格式(10)--問題解答

3、簡單說明commit的時候做了什麼以及為什麼要上鎖
在<>一文中有如下的圖

其實這幅圖中我已經寫了需要一把鎖，但是沒仔細說，這把鎖影響的是全部當前資料庫的需要寫binlog的全部語句,我們已經知道對於大事物(超過)的commit期間需要將臨時
檔案的中的binlog event 全部複製到binlog中，並且是一個事物的event必須是連續的，也就是一個事物的event中不能包含其他事物的event那麼這期間需要一把mutex鎖，為什
麼要這把鎖，這是在多執行緒/程式併發對同一個檔案進行修改，也就是這裡的binlog檔案修改的，一種保護臨界資源的方法，如果沒有這把鎖就會出現binlog中的檔案出現錯亂的情況。
注意這個鎖是mysql層的而非innodb層。

4、模擬方法

對錶test 進行刪除不提交，觀察臨時檔案大小。
關於生成臨時檔案的原理參考：
(http://blog.itpub.net/7728585/viewspace-2133589/ MYSQL 什麼時候同步binlog檔案以及它的臨時檔案在哪裡？)

對刪除delete test表commit期間馬上對testsort和testsort2表插入一條資料，
這個時候可以觀察到這兩個插入，會被堵塞知道delete test的事物提交完成
才完成。
mysql> select count(*) from test;
+----------+
| count(*) |
+----------+
| 2621440 |
+----------+
1 row in set (3.14 sec)

及如下：

可以看到一個簡單的insert居然花了10多秒，commit也是做了14秒。原因就在於需要將MLq9INFu這個臨時檔案同步在binlog這個過程
其他需要寫入binlog的會話是堵塞的。

刪除期間會看到臨時檔案不斷增大如下：
[root@testmy ~]# lsof |grep /tmp/ML
mysqld 2095 mysql 84u REG 8,3 328560640 14680133 /tmp/MLq9INFu (deleted)
[root@testmy ~]# lsof |grep /tmp/ML
mysqld 2095 mysql 84u REG 8,3 346337280 14680133 /tmp/MLq9INFu (deleted)
[root@testmy ~]# lsof |grep /tmp/ML
mysqld 2095 mysql 84u REG 8,3 367812608 14680133 /tmp/MLq9INFu (deleted)

到了commit後這個檔案變為0，但是任然存在因為這個會話執行緒還在：
[root@testmy ~]# lsof |grep /tmp/ML
mysqld 2095 mysql 84u REG 8,3 0 14680133 /tmp/MLq9INFu (deleted)

5、使用infobin 對binlog檔案中進行驗證

使用./infobin test.000211>log.log 進行分析
(可以使用mysqlbinlog但是不直觀
工具可以參考
http://blog.itpub.net/7728585/viewspace-2133534/ 解析MYSQL BINLOG二進位制格式(9)--infobin解析binlog幫助文件
獲得)

-----delete from test; 的binlog event

>Gtid Event:Pos:368030739(0X15efb413) N_pos:368030804(0X15efb454) Time:1487197597 Event_size:65(bytes)
Gtid:4a6f2a67-5d87-11e6-a6bd-0c29a879a3:1158184
-->Query Event:Pos:368030804(0X15efb454) N_Pos:368030876(0X15efb49c) Time:1487197355 Event_size:72(bytes)
Exe_time:0 Use_db:test Statment(35b-trun):BEGIN /*!Trx begin!*/ Gno:1158184
---->Map Event:Pos368030876(0X15efb49c) N_pos:368031047(0X15efb547) Time:1487197355 Event_size:171(bytes)
TABLE_ID:129 DB_NAME:test TABLE_NAME:test Gno:1158184
------>Delete Event:Pos:368031047(0X15efb547) N_pos:368039196(0X15efd51c) Time:1487197355 Event_size:8149(bytes)
Dml on table: test.test table_id:129 Gno:1158184
....(中間省略)
------>Delete Event:Pos:736052116(0X2bdf4394) N_pos:736054135(0X2bdf4b77) Time:1487197355 Event_size:2019(bytes)
Dml on table: test.test table_id:129 Gno:1158184
>Xid Event:Pos:736054135(0X2bdf4b77) N_Pos:736054166(0X2bdf4b96) Time:1487197597 Event_size:31(bytes)
COMMIT; /*!Trx end*/ Gno:1158184

-----insert into testsort values(1);的binlog

>Gtid Event:Pos:736054166(0X2bdf4b96) N_pos:736054231(0X2bdf4bd7) Time:1487197599 Event_size:65(bytes)
Gtid:4a6f2a67-5d87-11e6-a6bd-0c29a879a3:1158185
-->Query Event:Pos:736054231(0X2bdf4bd7) N_Pos:736054303(0X2bdf4c1f) Time:1487197599 Event_size:72(bytes)
Exe_time:0 Use_db:test Statment(35b-trun):BEGIN /*!Trx begin!*/ Gno:1158185
---->Map Event:Pos736054303(0X2bdf4c1f) N_pos:736054358(0X2bdf4c56) Time:1487197599 Event_size:55(bytes)
TABLE_ID:187 DB_NAME:test TABLE_NAME:testsort Gno:1158185
------>Insert Event:Pos:736054358(0X2bdf4c56) N_pos:736054410(0X2bdf4c8a) Time:1487197599 Event_size:52(bytes)
Dml on table: test.testsort table_id:187 Gno:1158185
>Xid Event:Pos:736054410(0X2bdf4c8a) N_Pos:736054441(0X2bdf4ca9) Time:1487197599 Event_size:31(bytes)
COMMIT; /*!Trx end*/ Gno:1158185

------insert into testsort2 values(1);的binlog

>Gtid Event:Pos:736054441(0X2bdf4ca9) N_pos:736054506(0X2bdf4cea) Time:1487197600 Event_size:65(bytes)
Gtid:4a6f2a67-5d87-11e6-a6bd-0c29a879a3:1158186
-->Query Event:Pos:736054506(0X2bdf4cea) N_Pos:736054578(0X2bdf4d32) Time:1487197600 Event_size:72(bytes)
Exe_time:0 Use_db:test Statment(35b-trun):BEGIN /*!Trx begin!*/ Gno:1158186
---->Map Event:Pos736054578(0X2bdf4d32) N_pos:736054634(0X2bdf4d6a) Time:1487197600 Event_size:56(bytes)
TABLE_ID:188 DB_NAME:test TABLE_NAME:testsort2 Gno:1158186
------>Insert Event:Pos:736054634(0X2bdf4d6a) N_pos:736054686(0X2bdf4d9e) Time:1487197600 Event_size:52(bytes)
Dml on table: test.testsort2 table_id:188 Gno:1158186
>Xid Event:Pos:736054686(0X2bdf4d9e) N_Pos:736054717(0X2bdf4dbd) Time:1487197600 Event_size:31(bytes)
COMMIT; /*!Trx end*/ Gno:1158186

可以看到確實insert into testsort values(1)和insert into testsort2 values(1);的event確實在delete後才進入
binlog file

6、分析等待點。
那麼到底在等待什麼呢？這是另外一次測試的sys.session輸出

mysql> select * from sys.session where command<>'SLEEP' \G
*************************** 1. row ***************************
thd_id: 28
conn_id: 3
user: root@localhost
db: test
command: Query
state: starting
time: 13
current_statement: commit
statement_latency: 12.55 s
progress: NULL
lock_latency: 0 ps
rows_examined: 0
rows_sent: 0
rows_affected: 0
tmp_tables: 0
tmp_disk_tables: 0
full_scan: NO
last_statement: NULL
last_statement_latency: NULL
current_memory: 179.02 KiB
last_wait: wait/io/file/sql/binlog
last_wait_latency: Still Waiting
source: binlog.cc:8469
trx_latency: 3.22 m
trx_state: ACTIVE
trx_autocommit: NO
pid: 5821
program_name: mysql
*************************** 2. row ***************************
thd_id: 3031
conn_id: 3006
user: root@localhost
db: test
command: Query
state: query end
time: 12
current_statement: insert into testsort3(id1,id2, ... OR((RAND()*100000)),'gaopeng')
statement_latency: 11.82 s
progress: NULL
lock_latency: 29.18 ms
rows_examined: 0
rows_sent: 0
rows_affected: 0
tmp_tables: 0
tmp_disk_tables: 0
full_scan: NO
last_statement: NULL
last_statement_latency: NULL
current_memory: 556.53 KiB
last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done
last_wait_latency: Still Waiting
source: binlog.cc:1894
trx_latency: 12.97 s
trx_state: ACTIVE
trx_autocommit: YES
pid: 15312
program_name: mysql

不難看出current_statement: commit 等待是
last_wait: wait/io/file/sql/binlog
last_wait_latency: Still Waiting
source: binlog.cc:8469
而被堵塞的insert等待是
last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done
last_wait_latency: Still Waiting
source: binlog.cc:1894

這是commit 等待的binlog.cc:8469的內容
/**
On *pure non-transactional* workloads there is a small window
in time where a concurrent rotate might be able to close
the file before the sync is actually done. In that case,
ignore the bad file descriptor errors.

Transactional workloads (InnoDB) are not affected since the
the rotation will not happen until all transactions have
committed to the storage engine, thence decreased the XID
counters.(重點)

TODO: fix this properly even for non-transactional storage
engines.
*/
if (DBUG_EVALUATE_IF("simulate_error_during_sync_binlog_file", 1,
mysql_file_sync(log_file.file,
MYF(MY_WME | MY_IGNORE_BADFD))))
從原始碼註釋上可以看出原因再來看看他掉用int my_sync(File fd, myf my_flags)
既然加入檔案描述符fd，應該就是在寫binlog file

下面是insert等待的source: binlog.cc:1894的內容
mysql_cond_wait(&m_cond_done, &m_lock_done);
這個地方就是拿不到這把鎖，他是Stage_manager::enroll_for呼叫的函式
而 Stage_manager原始碼註釋是：
/**
Class for maintaining the commit stages for binary log group commit.
*/
而enroll_for方法的描述為：
/**
Enroll a set of sessions for a stage.
This will queue the session thread for writing and flushing.
If the thread being queued is assigned as stage leader, it will
return immediately.
If wait_if_follower is true the thread is not the stage leader,
the thread will be wait for the queue to be processed by the
leader before it returns.(重點)
In DBUG-ON version the follower marks is preempt status as ready.
@param stage Stage identifier for the queue to append to.
@param first Queue to append.
@param stage_mutex
Pointer to the currently held stage mutex, or NULL if
we're not in a stage.
@retval true Thread is stage leader.
@retval false Thread was not stage leader and processing has been done.
*/

那麼確實commit後binary log確實有一個寫入的策略，會根據是否為leader進行寫入，
如果不是leader則等到leader全部寫完才進行寫入，多個等待寫入binlog的會話計入
一個佇列中，這貌似就是並行了。

我使用GDB將斷點打到
Stage_manager::enroll_for
MYSQL_BIN_LOG::sync_binlog_file
發現整個流程如下：

A會話COMMIT-->拿到鎖-->進行binlog寫-->commit完成
B會話COMMIT-->等待鎖--------------------------->拿到鎖-->進行binlog寫-->commit完成

7、如何避免

沒有什麼好的辦法如果使用innodb雙1的情況下，只能每次修改少量的資料，增加修改次數
前面也說了binlog生成量大約為你修改資料量的2/3如果update是4/3那麼就要評估你的I/O
效能，如果順序寫入每秒寫入為X需要修改的資料量大約為Y
那麼
S(秒)=(Y*(2/3))/(X/2)
控制S到一個你認為可以接受的範圍內。其實可以透過copy來試試你的順序的I/O效能。

比如我們寫這樣一個儲存過程：
DELIMITER //
CREATE PROCEDURE d_move(in beid int,in cunt int,in rws int,in maxid int)
begin
declare num int;
declare nowid int;
set num = 1;
set nowid = beid;
while num <= cunt do
insert into test_vest_hist select * from test_vest where id>=nowid and id<nowid+rws and="" id<nowid+rws="" id<maxid;="" ="" select="" concat(now(),'="" ',row_count(),'="" insert!'); delete from test_vest where id>=nowid and id<nowid+rws and="" id<maxid;="" ="" select="" concat(now(),'="" ',row_count(),'="" delete!'); select concat(now(),' ',num,' times delete finsh!');
set nowid = nowid+rws;
set num = num+1;
select sleep(2);
end while;
end //
將rws控制到一個可以接受的範圍比如5W,當然最好進行sleep(),這樣給innodb引擎一個刷入髒資料的時間減少壓力。

如果sync_log不設定為1，這個我做了trace確實就不會在commit的進行binlog寫入了，但是這是不安全的，commit結束
後日志不落地是不安全的。

作者微信：

mysql 大事物commit慢造成全庫堵塞問題

相關文章