故障分析 | 是誰偷走了我的 IO

愛可生雲資料庫發表於2022-05-03

作者:孫橋

愛可生DBA團隊成員,主要負責MySQL和DMP平臺日常的維持工作,對資料庫自動化運維存有濃厚興趣。

本文來源:原創投稿

*愛可生開源社群出品,原創內容未經授權不得隨意使用,轉載請聯絡小編並註明來源。


前言

對於剛接觸 MySQL 資料庫不久的小夥伴來說,或多或少會遇到一些棘手的小問題;比如在資料庫磁碟無損壞的情況下,我們經常會遇到資料庫伺服器的磁碟IO壓力很大,但由於伺服器上只有 MySQL 程式,我們第一想法肯定是某個連線使用者在連到資料庫後,揹著我們在瞎搞著什麼;然而直接通過 show processlist 命令,看到的結果不一定準確。下面模擬一個簡單場景,讓我們確切地定位問題所在。

場景模擬,製造 IO 壓力

## 生成一張大表,用於測試[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 prepare
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Creating table 'sbtest1'...
Inserting 15000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...## 對生成的表進行寫操作
[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 run
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 1
Report intermediate results every 2 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 2s ] thds: 1 tps: 24.47 qps: 148.33 (r/w/o: 0.00/98.89/49.44) lat (ms,95%): 86.00 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 53.02 qps: 318.62 (r/w/o: 0.00/212.58/106.04) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 1 tps: 60.50 qps: 363.50 (r/w/o: 0.00/242.50/121.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 1 tps: 65.50 qps: 391.02 (r/w/o: 0.00/260.01/131.01) lat (ms,95%): 39.65 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 1 tps: 55.50 qps: 334.51 (r/w/o: 0.00/223.51/111.00) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 1 tps: 57.00 qps: 340.99 (r/w/o: 0.00/227.00/114.00) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00

問題分析

1、通過top命令發現io wait比較高,咦,What is going on?(假裝不知道我在壓測(●'◡'●))

top - 17:05:59 up 179 days,  7:05,  2 users,  load average: 0.84, 0.70, 0.93
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.3 us,  0.7 sy,  0.0 ni, 95.3 id,  2.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  6.9 us,  3.8 sy,  0.0 ni, 60.3 id, 29.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  2.4 us,  2.0 sy,  0.0 ni, 90.8 id,  4.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  1.7 us,  1.3 sy,  0.0 ni, 95.3 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  1.4 us,  1.0 sy,  0.0 ni, 95.6 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  3.1 us,  1.4 sy,  0.0 ni, 81.4 id, 14.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.7 us,  1.0 sy,  0.0 ni, 96.6 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  1.0 us,  0.3 sy,  0.0 ni, 96.6 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.3 us,  1.0 sy,  0.0 ni, 94.6 id,  4.1 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  2.4 us,  1.0 sy,  0.0 ni, 91.2 id,  5.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 14.7 us,  2.7 sy,  0.0 ni, 82.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  2.0 us,  1.0 sy,  0.0 ni, 96.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 12137024 total,   185984 free,  3987936 used,  7963104 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7150088 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                            
 8858 actiont+  20   0 7249940   1.1g  11892 S  30.0  9.3 210:38.75 mysqld                                                                                             
26447 tidb      20   0 4000364   1.0g   4740 S   5.3  9.0  10018:23 TiFlashMain                                                                                        
22761 actiont+  20   0 2526348  36188   9756 S   4.0  0.3   1197:43 ustats                                                                                             
29004 root      20   0 3784084 103676   2012 S   2.0  0.9   1802:29 beam.s

2、通過iostat命令進一步分析,我們看到磁碟在頻繁的寫操作,維持在26M/s左右(%uitl值比較大,一般來說,值大於75%是表明磁碟壓力比較大了)

[root@tidb-tikv01 ~]# iostat  -xm 1 
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01)  04/22/2022      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.21    0.00    0.44    0.18    0.10   98.07

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.01     0.18    3.53   22.99     0.08     0.25    25.84     0.19    7.16    5.64    7.40   0.71   1.88
scd0              0.00     0.00    0.00    0.00     0.00     0.00     6.00     0.00    0.50    0.50    0.00   0.50   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.22    0.00    1.19    8.28    0.09   88.23

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  328.00 1800.00     5.12    26.24    30.17     3.18    1.49    4.68    0.91   0.41  88.30
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.63    0.00    1.03    6.68    0.09   90.57

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  352.00 1918.00     5.50    27.82    30.06     2.96    1.31    3.21    0.96   0.37  83.60
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.54    0.00    1.96    8.77    0.09   83.65

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  300.00 1682.00     4.69    25.68    31.38     3.09    1.55    3.68    1.17   0.45  88.60
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

3、通過iotop命令可以看到與MySQL相關執行緒ID 22233的IO消耗為52.59%,讀為2.56M/s,同時執行緒ID 8879的寫為27.31M/s(一般來說,IO列值越大,說明IO壓力最有可能與其相關)

Total DISK READ :       5.29 M/s | Total DISK WRITE :      29.42 M/s
Actual DISK READ:       5.29 M/s | Actual DISK WRITE:      26.83 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                  
22233 be/4 actionte    2.56 M/s  473.07 K/s  0.00 % 52.59 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8879 be/4 actionte    0.00 B/s   27.31 M/s  0.00 % 17.76 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8887 be/4 actionte    0.00 B/s    0.00 B/s  0.00 % 10.10 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8907 be/4 actionte  116.33 K/s    0.00 B/s  0.00 %  6.27 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8930 be/4 actionte    7.76 K/s    0.00 B/s  0.00 %  3.45 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8910 be/4 actionte 1217.57 K/s    0.00 B/s  0.00 %  3.33 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8911 be/4 actionte  372.25 K/s    0.00 B/s  0.00 %  1.53 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 2946 be/4 actionte    0.00 B/s 1985.34 B/s  0.00 %  1.52 % ustats
 8909 be/4 actionte  418.78 K/s    0.00 B/s  0.00 %  1.08 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306

4、將上面獲取的系統執行緒ID帶入到MySQL中,我們便可以查到連線使用者sun一直在頻繁寫操作,現在我很懷疑你,但我先保持沉默

[root@tidb-tikv01 ~]# /opt/mysql/base/8.0.21/bin/mysql -S /opt/mysql/data/3306/mysqld.sock -pxx
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 15041
Server version: 8.0.21 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from performance_schema.threads where THREAD_OS_ID=22233\G
*************************** 1. row ***************************
          THREAD_ID: 14880
               NAME: thread/sql/one_connection
               TYPE: FOREGROUND
     PROCESSLIST_ID: 14830           ### ID為148830連線在瘋狂的進行寫操作,show processlist可以看到正在進行的連線
   PROCESSLIST_USER: sun
   PROCESSLIST_HOST: 10.186.61.16
     PROCESSLIST_DB: testdb
PROCESSLIST_COMMAND: Query
   PROCESSLIST_TIME: 0
  PROCESSLIST_STATE: updating
   PROCESSLIST_INFO: UPDATE sbtest1 SET k=k+1 WHERE id=7899611
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: TCP/IP
       THREAD_OS_ID: 22233          ### THREAD_OS_ID為linux系統中的執行緒ID
     RESOURCE_GROUP: USR_default
1 row in set (0.00 sec)

mysql> select * from performance_schema.threads where THREAD_OS_ID=8879\G
*************************** 1. row ***************************
          THREAD_ID: 21
               NAME: thread/innodb/page_flush_coordinator_thread ### MySQL自身刷髒資料執行緒,因為使用者的頻繁寫操作,導致該執行緒一直進行落盤操作
               TYPE: BACKGROUND
     PROCESSLIST_ID: NULL
   PROCESSLIST_USER: NULL
   PROCESSLIST_HOST: NULL
     PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: NULL
   PROCESSLIST_TIME: 264523
  PROCESSLIST_STATE: NULL
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: NULL
       THREAD_OS_ID: 8879
     RESOURCE_GROUP: SYS_default
1 row in set (0.00 sec)

mysql> show processlist;
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id    | User            | Host               | db     | Command          | Time   | State                                                         | Info                                                                                                 |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     5 | event_scheduler | localhost          | NULL   | Daemon           | 264544 | Waiting on empty queue                                        | NULL                                                                                                 |
|     8 | universe_op     | 127.0.0.1:38362    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|     9 | universe_op     | 127.0.0.1:38374    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    10 | universe_op     | 127.0.0.1:38376    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    11 | universe_op     | 127.0.0.1:38380    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    12 | universe_op     | 127.0.0.1:38378    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    13 | universe_op     | 127.0.0.1:38384    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    14 | universe_op     | 127.0.0.1:38382    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    15 | universe_op     | 10.186.61.45:50322 | NULL   | Binlog Dump GTID | 264542 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |
|    17 | universe_op     | 127.0.0.1:38454    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    18 | universe_op     | 127.0.0.1:38456    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    19 | universe_op     | 127.0.0.1:38458    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    20 | universe_op     | 127.0.0.1:38460    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    21 | universe_op     | 127.0.0.1:38462    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    22 | universe_op     | 127.0.0.1:38464    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    23 | universe_op     | 127.0.0.1:38452    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    24 | universe_op     | 127.0.0.1:38466    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|   142 | universe_op     | 127.0.0.1:33390    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
| 14830 | sun             | 10.186.61.16:60730 | testdb | Query            |      0 | update                                                        | INSERT INTO sbtest1 (id, k, c, pad) VALUES (7493994, 6518751, '15795471662-35813531247-93886706048-2 |
| 15041 | root            | localhost          | NULL   | Query            |      0 | starting                                                      | show processlist                                                                                     |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
20 rows in set (0.00 sec)

5、通過kill掉14830後,發現磁碟壓力果然over了,你這“小陽人”( ̄ε(# ̄)☆╰╮o( ̄皿 ̄///)

mysql> kill 14830;
Query OK, 0 rows affected (0.00 sec)

mysql> exit
Bye
Segmentation fault
[root@tidb-tikv01 ~]# iostat -xm 1
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01)  04/22/2022      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.21    0.00    0.44    0.18    0.10   98.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.01     0.18    3.83   24.55     0.09     0.27    26.12     0.19    6.78    5.46    6.99   0.69   1.95
scd0              0.00     0.00    0.00    0.00     0.00     0.00     6.00     0.00    0.50    0.50    0.00   0.50   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.85    0.00    0.59    0.00    0.00   97.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00   61.00     0.00     0.46    15.52     0.01    0.20    0.00    0.20   0.18   1.10
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.36    0.00    0.84    0.08    0.00   95.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    10.00    1.00   53.00     0.02     0.48    18.81     2.87    0.22    1.00    0.21   1.31   7.10
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

備註說明:

生產環境可不要隨隨便便kill哦,出問題是要被鞭屍的呦~

總結

對於一些新入門的小可愛來說,以上方法可以循序漸進的幫助你定位到是哪些連線執行緒把你的伺服器IO折騰高的,可以讓你更加了解Linux系統與MySQL之間的聯動關係。老鳥嘛,emmm...

相關文章