故障分析 | MySQL 資料庫升級後,資料庫怎麼卡住了

愛可生雲資料庫發表於2022-06-09

作者:王順

愛可生 DBA 團隊成員,在公司負責專案中處理資料庫問題,喜歡學習技術,鑽研技術問題。

本文來源:原創投稿

*愛可生開源社群出品,原創內容未經授權不得隨意使用,轉載請聯絡小編並註明來源。


問題背景:

第一天晚上 MySQL 資料庫升級,版本從5.6.22到5.7.32,升級過程很順利,業務驗證正常。

兩天後業務人員反饋資料庫卡住了,sql 執行下去資料庫沒反應。

現象:

連上資料庫看看發現了什麼……

mysql> show processlist;
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id    | User        | Host                | db    | Command     | Time   | State                                                         | Info                                                                                                 |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     1 | system user |                     | NULL  | Connect     | 138205 | Waiting for master to send event                              | NULL                                                                                                 |
|     2 | system user |                     | NULL  | Connect     | 137998 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                 |
|  6891 | user05      | %:37029             | db01  | Query       | 110410 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                           |
| 11528 | user05      | %:44043             | db01  | Query       |  91406 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                           |
| 12126 | user02      | %:44966             | db02  | Query       |  89331 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY TRUNCATE PARTITION P20210810                        |
| 20896 | repl        | 172.24.64.88:54437  | NULL  | Binlog Dump |  53695 | Master has sent all binlog to slave; waiting for more updates   | NULL                                                                                               |
| 26520 | user03      | 172.24.64.197:45044 | db03  | Sleep       |     80 |                                                               | NULL                                                                                               |
| 26886 | user04      | 172.24.64.73:39607  | db04  | Execute     |  29567 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                 |
| 26887 | user04      | 172.24.64.73:39608  | db04  | Execute     |  29565 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                               |
......                                                                                                                                                          |
| 28069 | user02      | %:41400             | db02  | Query       |  24872 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY ADD PARTITION (PARTITION P20210811 VALUES IN ('202108 |
......                                                                                                                                     |
| 33666 | user05      | 10.1.16.10:47068    | db04  | Sleep       |   2275 |                                                               | NULL                                                                                                 |
| 33788 | user04      | 172.24.64.73:50227  | db04  | Execute     |   1833 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                                 |
| 33791 | user04      | 172.24.64.73:50236  | db04  | Execute     |   1826 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                   |
......                                                                                                                                      |
| 34095 | user05      | 10.1.16.10:54601    | db04  | Query       |    710 | Waiting for table metadata lock                               | SELECT * FROM `tb_par_user_accid_day` LIMIT 0, 1000                                                  |
......                                                     | show processlist                                                                                     |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
70 rows in set (0.00 sec)

通過輸出資訊可以看到,有多個 session 出現 Waiting for table metadata lock ,檢視執行最久的兩個 sq 分別是 i d為 6891 和 11528 的兩個 insert 語句。

insert 涉及的 schema 是db01 ,出現 Waiting for table metadata lock 是 db02/db04 。

很奇怪,執行語句來自不同的 schema ,是誰持有 MDL 鎖堵住後面的語句?想不通……

Waiting for table metadata lock 看到 MDL 鎖等待資訊,需要檢視一下原因。

mysql> select * from performance_schema.metadata_locks;
Empty set (0.00 sec)

返回空,沒有查到 MDL 鎖相關資訊,檢查環境發現 MDL 相關 instrument 並沒有開啟。

備註:

臨時生效

修改performance_schema.setup_instrume nts表,但例項重啟後,又會恢復為預設值。

UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME = 'wait/lock/metadata/sql/mdl';

永久生效

在配置檔案中設定

[mysqld]

performance-schema-instrument='wait/lock/metadata/sql/mdl=ON'

繼續檢查大事務

經過多次執行 select * from INNODB_TRX 檢視正在執行的事務,可以看到兩個正在執行的 insert 語句一直執行未提交。

mysql> select * from INNODB_TRX\G
*************************** 1. row ***************************
                    trx_id: 328588354171408
                 trx_state: RUNNING
               trx_started: 2021-08-11 09:06:34
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 1755570
       trx_mysql_thread_id: 11528
                 trx_query: INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                    
      ……忽略欄位資訊          
       ,copartner_code          
       trx_operation_state: NULL
         trx_tables_in_use: 15
         trx_tables_locked: 3442
          trx_lock_structs: 1755570
     trx_lock_memory_bytes: 159899856
           trx_rows_locked: 91807791
         trx_rows_modified: 0
   trx_concurrency_tickets: 3903
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
                    trx_id: 328588354177792
                 trx_state: RUNNING
               trx_started: 2021-08-11 03:49:50
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 1856514
       trx_mysql_thread_id: 6891
                 trx_query: INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                   
      ……忽略欄位資訊             
       ,copartner_code       
       trx_operation_state: fetching rows
         trx_tables_in_use: 15
         trx_tables_locked: 3597
          trx_lock_structs: 1856514
     trx_lock_memory_bytes: 169681104
           trx_rows_locked: 101871982
         trx_rows_modified: 0
   trx_concurrency_tickets: 2338
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
2 rows in set, 2 warnings (0.06 sec)

看看完整的 sql 語句是啥?

執行 show full processlist ;看一下完整的執行語句。前面的排查只執行了 show processlist 沒有看到完成的語句。

mysql> show full processlist;

| Id    | User        | Host                | db                 | Command     | Time   | State                                                         | Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
|     1 | system user |                     | NULL               | Connect     | 141084 | Waiting for master to send event                              | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|     2 | system user |                     | NULL               | Connect     | 140877 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|  6891 | user05      | %:37029             | db01               | Query       | 113289 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (                   
       ……忽略欄位資訊                                                                   
     )
    SELECT                      
       ……忽略欄位資訊                                                                
   FROM
   (
   SELECT DISTINCT
        t1.cust_no           as cust_no              
      ……忽略欄位資訊  ID                      
       ,t2.PRODPRCname             as PRODPRCname                     
   FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
            on t1.user_no=t2.user_no
        left join db04.TB_PAR_USER_ACCID_DAY t3
            on t1.user_no=t3.id_no
        left join db04.TB_PAR_USER_EXTEND_INFO t4
            on t1.user_no=t4.user_no
        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
            on t4.USER_ATTR2=t5.net_id
        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
            on t1.user_no=t6.id_no
        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
            on t1.user_no=t7.user_no
        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
            on t1.user_no=t8.id_no
        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
            on t8.contract_no=t9.contract_no
        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
            on t1.USER_STAT_CODE=t10.run_code
        left join db04.TB_PAR_USER_DEVELOP t11
            on t11.user_no=t1.user_no
        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
            on t11.develop_dep=t12.group_id
        left join db04.TB_RES_LOGIN_NO t13
            on t11.develop_name=t13.system_user_code
        left join db04.TB_RES_STAFF_INFO t14
            on t13.system_user_id=t14.STAFF_ID
        /*group by 
        ……忽略欄位資訊                             
  ) t
   /*GROUP BY
       ……忽略欄位資訊
       
| 11528 | user05        | %:44043             | db01               | Query       |  94285 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (                    
       ……忽略欄位資訊                                                                            
     )
    SELECT                  
       ……忽略欄位資訊                                                                       
   FROM
   (
   SELECT DISTINCT
                t1.cust_no           as cust_no                  
      ……忽略欄位資訊              
             ,t2.PRODPRCname             as PRODPRCname                    
   FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
            on t1.user_no=t2.user_no
        left join db04.TB_PAR_USER_ACCID_DAY t3
            on t1.user_no=t3.id_no
        left join db04.TB_PAR_USER_EXTEND_INFO t4
            on t1.user_no=t4.user_no
        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
            on t4.USER_ATTR2=t5.net_id
        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
            on t1.user_no=t6.id_no
        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
            on t1.user_no=t7.user_no
        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
            on t1.user_no=t8.id_no
        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
            on t8.contract_no=t9.contract_no
        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
            on t1.USER_STAT_CODE=t10.run_code
        left join db04.TB_PAR_USER_DEVELOP t11
            on t11.user_no=t1.user_no
        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
            on t11.develop_dep=t12.group_id
        left join db04.TB_RES_LOGIN_NO t13
            on t11.develop_name=t13.system_user_code
        left join db04.TB_RES_STAFF_INFO t14
            on t13.system_user_id=t14.STAFF_ID

        /*group by 
        ……忽略欄位資訊                     
  ) t
   /*GROUP BY
        ……忽略欄位資訊            

通過完整的 sql 我們可以看到,inster 語句中出現了多表關連子查詢,查詢涉及的庫有 db02 和 db04 。

從業務人員那瞭解到這兩條 insert 是跑批任務,涉及的查詢語句在升級前20多分鐘能跑出資料,升級後1天多都沒出資料,這又是什麼原因?

先恢復業務

已確認兩個跑批任務可以先終止重新再跑,終止不會影響線上業務。

kill 掉兩個 insert,kill 過程很快,兩個語句的查詢語句並沒有跑出資料,insert 一直沒有開始執行。kill 執行成功後,MDL 鎖等待現象很快就消失了。

原因分析

根據前面業務人員的描述,升級前直接執行 insert 語句中的查詢語句20分鐘可以出資料,資料庫升級後查詢跑不出資料。測試環境升級後並沒有驗證這兩個跑批。

想對比同一條 sql 在 5.6 和 5.7 環境的執行計劃的區別,因 5.6 環境已經升級,無法進行比對。

以前遇到過同樣 5.6 升級 5.7 後,sql 執行慢的是因為派生表問題。

5.7 預設 derived_merge = on ,會將派生表合併,然後在執行 where 條件過濾,但是這樣會造成的現象是多個未經過濾的表直接合並,資料量過大 sql 執行速度變慢。

官方文件:https://dev.mysql.com/doc/ref...

解決方案

  1. 是在資料庫中關閉這個優化器配置
  2. 是改寫業務 sql

因需要儘快恢復業務使用和業務人員協商關閉 derived_merge 項,再次驗證跑批執行正常。

相關文章