生產環境有一套3個節點的MySQL InnoDB Cluster,MySQL的版本為Server version: 8.0.35 MySQL Community Server - GPL, 早上突然收到Zabbix的告警,其中一個節點出現空間告警:"/data: Disk space is low (used > 80%)"
檢查分析後發現是因為MySQL的binlog沒有清理導致空間報警,如下所示(binlog太多,省略了部分內容)
$ du -sh *
78G bin_logs
12K conf
7.3G data
132K logs
4.0K mysql.pid
0 mysql_temp
4.0K slow_logs
$ cd bin_logs
$ ls -lrt
total 81783492
-rw-r----- 1 mysql mysql 909802606 Jan 9 00:10 mysql_binlog.000011
-rw-r----- 1 mysql mysql 984228474 Jan 10 00:10 mysql_binlog.000012
-rw-r----- 1 mysql mysql 980302993 Jan 11 00:10 mysql_binlog.000013
-rw-r----- 1 mysql mysql 985147519 Jan 12 00:10 mysql_binlog.000014
-rw-r----- 1 mysql mysql 962232624 Jan 13 00:10 mysql_binlog.000015
....................................................
....................................................
-rw-r----- 1 mysql mysql 891703862 Apr 7 00:10 mysql_binlog.000101
-rw-r----- 1 mysql mysql 920700543 Apr 8 00:10 mysql_binlog.000102
-rw-r----- 1 mysql mysql 921675101 Apr 9 00:10 mysql_binlog.000103
-rw-r----- 1 mysql mysql 3854 Apr 9 00:10 mysql_binlog.index
-rw-r----- 1 mysql mysql 330341220 Apr 9 09:30 mysql_binlog.000104
檢查三個節點(姑且用node1, node2, node3表示),發現其它兩個節點(node1,node2)的binlog都正常清理了,而唯獨節點node3的binlog一直沒有被清理。三個節點的MySQL系統變數是一致的,如下所示,
mysql> show variables like 'binlog_expire_logs_seconds';
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| binlog_expire_logs_seconds | 1209600 |
+----------------------------+---------+
1 row in set (0.00 sec)
mysql> show variables like 'expire_logs_days';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| expire_logs_days | 0 |
+------------------+-------+
1 row in set (0.00 sec)
mysql>
binlog_expire_logs_seconds設定為1209600,也就是14天,超過14天的binlog應該被清理掉,但是檢查發現node3上超過14天的binlog並沒有被清理。那麼為什麼節點node3上的過期的binlog沒有被清理呢?
首先,這三個節點的MySQL資料庫版本是一致的,出現問題的節點node3是MySQL InnoDB叢集的備庫,我們在這個節點上做Xtrabackup備份。
其次,我們檢查MySQL錯誤日誌,其實每天作業做Xtrabackup備份時,都會收到類似下面告警郵件
2024-04-09T00:10:03.078285+08:00 357726 [Warning] [MY-013712] [Server] No suitable 'keyring_component_metadata_query' service implementation found to fulfill the request.
2024-04-09T00:10:12.066464+08:00 357726 [Warning] [MY-014054] [Server] Could not purge binary logs since another session is executing LOCK INSTANCE FOR BACKUP. Wait for that session to release the lock.
出現這個告警資訊,之前也查過,官方文件提示,當對MySQL例項發出LOCK INSTANCE FOR BACKUP語句期間,PURGE BINARY LOGS是不允許的,因為它會從伺服器中刪除檔案(刪除binlog檔案),違反了備份鎖(backup lock)的規則。從MySQL 8.0.28開始,這個是不允許的。具體的官方文件[1]資訊如下所示:
PURGE BINARY LOGS should not be issued while a LOCK INSTANCE FOR BACKUP statement is in effect for the instance, because it contravenes the rules of the backup lock by removing files from the server. From MySQL 8.0.28, this is disallowed.
因為Xtrabackup備份時,它會執行FLUSH NO_WRITE_TO_BINLOG BINARY LOGS命令,它會輪換二進位制日誌(rotate the binlog),這實質上將建立一個新的二進位制日誌,而在 輪換二進位制日誌時,就會觸發PURGE BINARY LOGS,而由於當前MySQL版本為MySQL 8.0.35,在MySQL進行Xtrabackup時,由於LOCK INSTANCE FOR BACKUP鎖,導致PURGE BINARY LOGS是不允許的。所以Xtrabackup期間,我們會收到這個告警,即使binlog出現了輪換,但是也不會觸發清理過期的binlog。之前分析過這個告警,但是覺得它沒有產生什麼異常或問題,就忽略了這個告警。
另外,我們要弄清楚清理過期binlog的原理機制,即使我們設定了系統變數binlog_expire_logs_seconds,不是說一旦binlog的建立時間超過了系統變的binlog_expire_logs_seconds的閾值,就會立即觸發MySQL的相關執行緒去清理過期的binlog。清理過期的binlog,一般發生在MySQL啟動時或重新整理二進位制日誌時進行。官方文件[2]的描述如下:
Binary log files are automatically removed after the server's binary log expiration period. Removal of the files can take place at startup and when the binary log is flushed.
我檢查系統變數max_binlog_size的大小,如下所示,也就是1G,
mysql> show variables like 'max_binlog_size';
+-----------------+------------+
| Variable_name | Value |
+-----------------+------------+
| max_binlog_size | 1073741824 |
+-----------------+------------+
1 row in set (0.01 sec)
mysql>
由於這個系統一天產生的binlog小於1G(可以從上面的binlog的輸出資訊看出,例如,binlog的大小,binlog的建立時間),也就是說,一天24小時內,MySQL的binlog不會由於binlog過大而觸發binlog的切換,也就是說只有Xtrabackup時才觸發了binlog的切換,而又由於限制問題,導致過期的binlog不能被清理,久而久之,過期的binlog一直不能清理,直到觸發空間告警,問題才被暴露出來。
而其它幾套MySQL InnoDB Cluster也有這個問題,但是它們的過期的binlog為什麼被清理了呢? 因為它們一天產生的binlog的大小,超過了系統變數max_binlog_size的大小,在其它時間點會輪換/切換binlog,從而觸發了PURGE BINARY LOGS。所以就沒有遇到這樣的問題。
參考資料
1: https://dev.mysql.com/doc/refman/8.0/en/lock-instance-for-backup.html
[2]2: https://dev.mysql.com/doc/refman/8.0/en/purge-binary-logs.html