InnoDB: ###### Diagnostic info printed to the standard error stream

2021-11-19T14:52:21.671328+08:00 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

2021-11-19T14:52:21.671407+08:00 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: ut0ut.cc:532 thread 139928164968192

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

06:52:21 UTC - mysqld got signal 6 ;

Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

Thread 139932014196480 在 buf0flu.cc line 1357 等待一個訊號量，需要在buf0buf.cc line 781 建立一個申請SX-lock的 RW-latch ，但是這時thread id 139928049043200

在這上面已經有獨佔鎖了，

而Thread 139928049043200 在 btr0cur.cc line 233 等待一個訊號量，需要在buf0buf.cc line 781 建立一個申請X-lock (wait_ex)的 RW-latch ，可是此時Thread 139928049043200

實際上是一直在等待獨佔鎖（wait exclusive），即是說該RW-latch 目前仍未被釋放。

[Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 139932014196480 has waited at buf0flu.cc line 1357 for 784 seconds the semaphore:

SX-lock on RW-latch at 0x7f439b4df988 created in file buf0buf.cc line 781

a writer (thread id 139928049043200) has reserved it in mode exclusive

2021-11-19T14:51:51.381694+08:00 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 139928049043200 has waited at btr0cur.cc line 233 for 923 seconds the semaphore:

X-lock (wait_ex) on RW-latch at 0x7f43a36c0108 created in file buf0buf.cc line 781

a writer (thread id 139928049043200) has reserved it in mode wait exclusive

先說buf0buf.cc ，其就是指的database buffer pool , line 781 對應的是 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING); 是建立一個讀寫鎖，其所屬的函式是buf_block_init()，其是在初始buffer pool時初始化一個buffer 控制塊，即是指對資料庫buffer 進行鎖控制的。

而buf0flu.cc 是用於mysql 的buffer pool flush的。line 1357 是 rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE)，看起來是在flush buffer時需要申請的rw_lock。

btr0cur.cc 的描述是索引樹遊標，作用是所有行操作對一個b-tree或記錄的改變都必須透過這個模組，同時，每一次對聚合索引記錄的修改或插入，都會在這裡寫入undo 日誌記錄。

而第233 行是獲取btree block的過程，可以看到他需要去申請RW_X_LATCH.

232 get_block = btr_block_get(page_id_t(page_id.space(), left_page_no),

233 page_size, RW_X_LATCH, cursor->index, mtr);

這裡看起來就是對錶的更改操作才會觸發，同時在error log中也可以看到Thread 139928049043200 的TRANSACTION 787138017 是一直在進行rollback，這可以進一步印證以上情況。

---TRANSACTION 787138017, ACTIVE 4387 sec rollback <<<<<<<

mysql tables in use 2, locked 1

ROLLING BACK 1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 7891101

MySQL thread id 11963922, OS thread handle 139928049043200, query id 1348226447 30.88.75.45 msvcmdata query end

INSERT INTO `xxl_job_log_copy1` SELECT * FROM `msvcm`.`xxl_job_log`

Trx read view will not see trx with id >= 787138017, sees < 787138017

同時也存在另外一條類似sql正在進行inserting,此時的buffer pool確實會相對緊張。

---TRANSACTION 787233964, ACTIVE 3040 sec inserting

mysql tables in use 2, locked 1

1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 8544482

MySQL thread id 11964240, OS thread handle 139933323695872, query id 1348381276 30.88.75.45 msvcmdata executing

INSERT INTO `xxl_job_log_copy2` SELECT * FROM `msvcm`.`xxl_job_log`

Trx read view will not see trx with id >= 787233964, sees < 787138017

從innodb status 看起，當時Pending flushes buffer pool 的情況不斷上漲。

Pending flushes (fsync) log: 1; buffer pool: 30

Pending flushes (fsync) log: 1; buffer pool: 33

Pending flushes (fsync) log: 1; buffer pool: 31

Pending flushes (fsync) log: 1; buffer pool: 35

Pending flushes (fsync) log: 0; buffer pool: 37

Pending flushes (fsync) log: 0; buffer pool: 48

Pending flushes (fsync) log: 1; buffer pool: 54

Pending flushes (fsync) log: 1; buffer pool: 62

Pending flushes (fsync) log: 0; buffer pool: 68

Pending flushes (fsync) log: 0; buffer pool: 63

Pending flushes (fsync) log: 1; buffer pool: 63

Pending flushes (fsync) log: 0; buffer pool: 67

Pending flushes (fsync) log: 0; buffer pool: 71

Pending flushes (fsync) log: 1; buffer pool: 70

Pending flushes (fsync) log: 1; buffer pool: 71

Pending flushes (fsync) log: 1; buffer pool: 72

Pending flushes (fsync) log: 1; buffer pool: 73

Pending flushes (fsync) log: 0; buffer pool: 76

Pending flushes (fsync) log: 1; buffer pool: 75

Pending flushes (fsync) log: 0; buffer pool: 75

Pending flushes (fsync) log: 1; buffer pool: 75

Pending flushes (fsync) log: 1; buffer pool: 76

Pending flushes (fsync) log: 0; buffer pool: 76

Pending flushes (fsync) log: 1; buffer pool: 78

Pending flushes (fsync) log: 0; buffer pool: 82

Pending flushes (fsync) log: 0; buffer pool: 85

Pending flushes (fsync) log: 1; buffer pool: 89

Pending flushes (fsync) log: 1; buffer pool: 95

Pending flushes (fsync) log: 1; buffer pool: 97

Pending flushes (fsync) log: 0; buffer pool: 102

主要是因為存在兩條比較大的insert select 全表掃描的語句，一條在進行insert,一條在進行rollback，加上當時buffer pool相對比較小，buffer pool的資源緊張，資料庫在flush buffer及modify buffer 時對buffer 的latch發生爭用且長時間等待，從而導致訊號量超時。

目前我暫時未找到相關的已知bug描述，這類問題解決方案就是避免SQL長時間佔用Latch：

1，修改select查詢避免全表掃描，避免Latch長期被佔用。

2，加查詢條件，適當的加索引，讓select執行更快，也避免一個select鎖的資料更少。

3，適當加大buffer pool 及 buffer pool instance，每個buffer pool都有自己獨立的Latch，避免latch競爭。

4.避免長時間回滾

[ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds

相關文章