GTID環境中手動修復主從故障一例(Error 1146)

aaron8219發表於2018-07-17
 
Preface
 
    In my last test of pt-heartbeat,both of master and slave were out of disk.And the mysql client was hang.In order to resolve the issue,I`ve tryed to fix the replicaiton environment without using mysqldump to reconfigure the slave.Let`s see the details.
 
Procedure
 
I dropped test tables in database “sysbench” to release the disk space on master.
 1 [root@zlm2 08:56:51 /data/mysql/mysql3306/logs]
 2 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=1 --table-size=10000000 --mysql-storage-engine=innodb cleanup
 3 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
 4 
 5 Dropping table `sbtest1`...
 6 
 7 (zlm@192.168.1.101 3306)[(none)]>use sysbench;
 8 Reading table information for completion of table and column names
 9 You can turn off this feature to get a quicker startup with -A
10 
11 Database changed
12 (zlm@192.168.1.101 3306)[sysbench]>show tables;
13 +--------------------+
14 | Tables_in_sysbench |
15 +--------------------+
16 | hb                 |
17 | sbtest2            |
18 | sbtest3            |
19 | sbtest4            |
20 | sbtest5            |
21 +--------------------+
22 5 rows in set (0.00 sec)
23 
24 //Only sbtest1 was deleted.It`s not enough.
25 
26 [root@zlm2 08:59:04 ~/sysbench-1.0/src/lua]
27 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=5 --table-size=10000000 --mysql-storage-engine=innodb cleanup
28 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
29 
30 Dropping table `sbtest1`...
31 Dropping table `sbtest2`...
32 Dropping table `sbtest3`...
33 Dropping table `sbtest4`...
34 Dropping table `sbtest5`...
35 
36 [root@zlm2 08:59:09 ~/sysbench-1.0/src/lua]
37 #df -h
38 Filesystem               Size  Used Avail Use% Mounted on
39 /dev/mapper/centos-root  8.4G  6.9G  1.5G  83% / //I`d got 27% free space.
40 devtmpfs                 488M     0  488M   0% /dev
41 tmpfs                    497M     0  497M   0% /dev/shm
42 tmpfs                    497M  6.6M  491M   2% /run
43 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
44 /dev/sda1                497M  118M  379M  24% /boot
45 none                      87G   80G  6.6G  93% /vagrant
46 
47 (zlm@192.168.1.101 3306)[(none)]>drop database sysbench;
48 Query OK, 1 row affected (0.04 sec)
49 
50 //Further more,I dropped the "sysbench".

 

The slave hung still and disk space was full.
 1 [root@zlm3 08:55:17 ~]
 2 #df -h
 3 Filesystem               Size  Used Avail Use% Mounted on
 4 /dev/mapper/centos-root  8.4G  8.4G   20K 100% /
 5 devtmpfs                 488M     0  488M   0% /dev
 6 tmpfs                    497M     0  497M   0% /dev/shm
 7 tmpfs                    497M  6.5M  491M   2% /run
 8 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
 9 /dev/sda1                497M  118M  379M  24% /boot
10 none                      87G   80G  6.6G  93% /vagrant
11 
12 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
13 ^C^C -- query aborted
14 
15 
16 ^Z
17 [6]+  Stopped                 mysql
18 
19 [root@zlm3 08:55:58 ~]
20 #pkill mysqld
21 
22 [root@zlm3 08:56:04 ~]
23 #./mysqld.sh 
24 
25 [root@zlm3 08:56:07 ~]
26 #mysql
27 ERROR 2003 (HY000): Can`t connect to MySQL server on `192.168.1.102` (111)
28 
29 [root@zlm3 09:00:35 ~]
30 #cd /data/mysql/mysql3306/data
31 
32 [root@zlm3 09:00:46 /data/mysql/mysql3306/data]
33 #cat error.log |tail -n 30
34 2018-07-19T08:57:02.581937+01:00 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 17039436409
35 2018-07-19T08:57:02.581958+01:00 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 17039436418
36 2018-07-19T08:57:02.581963+01:00 0 [Note] InnoDB: Database was not shutdown normally!
37 2018-07-19T08:57:02.581965+01:00 0 [Note] InnoDB: Starting crash recovery.
38 2018-07-19T08:57:02.696292+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state.
39 2018-07-19T08:57:02.700688+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state.
40 2018-07-19T08:57:02.700814+01:00 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
41 2018-07-19T08:57:02.700821+01:00 0 [Note] InnoDB: Trx id counter is 31700480
42 2018-07-19T08:57:02.701719+01:00 0 [Note] InnoDB: Last MySQL binlog file position 0 99139927, file name mysql-bin.000016
43 2018-07-19T08:57:02.805965+01:00 0 [Note] InnoDB: Ignoring tablespace `zlm`.`sbtest2` because the DISCARD flag is set .
44 2018-07-19T08:57:02.806462+01:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables
45 2018-07-19T08:57:02.807316+01:00 0 [Note] InnoDB: Setting file `./ibtmp1` size to 12 MB. Physically writing the file full; Please wait ...
46 2018-07-19T08:57:02.807568+01:00 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
47 2018-07-19T08:57:02.807594+01:00 0 [Note] InnoDB: Rollback of non-prepared transactions completed
48 2018-07-19T08:57:02.871396+01:00 0 [Warning] InnoDB: Retry attempts for writing partial data failed.
49 2018-07-19T08:57:02.871423+01:00 0 [ERROR] InnoDB: Write to file ./ibtmp1failed at offset 11534336, 1048576 bytes should have been written, only 0 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
50 2018-07-19T08:57:02.871441+01:00 0 [ERROR] InnoDB: Error number 28 means `No space left on device`
51 2018-07-19T08:57:02.871446+01:00 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
52 2018-07-19T08:57:02.871451+01:00 0 [ERROR] InnoDB: Could not set the file size of `./ibtmp1`. Probably out of disk space
53 2018-07-19T08:57:02.871456+01:00 0 [ERROR] InnoDB: Unable to create the shared innodb_temporary
54 2018-07-19T08:57:02.871459+01:00 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
55 2018-07-19T08:57:03.273011+01:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
56 2018-07-19T08:57:03.273029+01:00 0 [ERROR] Plugin `InnoDB` init function returned error.
57 2018-07-19T08:57:03.273033+01:00 0 [ERROR] Plugin `InnoDB` registration as a STORAGE ENGINE failed.
58 2018-07-19T08:57:03.273037+01:00 0 [ERROR] Failed to initialize builtin plugins.
59 2018-07-19T08:57:03.273040+01:00 0 [ERROR] Aborting
60 
61 2018-07-19T08:57:03.273046+01:00 0 [Note] Binlog end
62 2018-07-19T08:57:03.273389+01:00 0 [Note] mysqld: Shutdown complete
63 
64 //The mysqld process could not run again because of no free disk space.

 

I decided to drop all the binlogs on slave to release the disk space.
 1 [root@zlm3 09:01:34 /data/mysql/mysql3306]
 2 #cd logs
 3 
 4 [root@zlm3 09:01:35 /data/mysql/mysql3306/logs]
 5 #ls -l
 6 total 2691436
 7 -rw-r----- 1 mysql mysql   4333745 Jul 13 11:07 mysql-bin.000001
 8 -rw-r----- 1 mysql mysql       653 Jul 16 09:28 mysql-bin.000002
 9 -rw-r----- 1 mysql mysql 268435663 Jul 16 10:03 mysql-bin.000003
10 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:22 mysql-bin.000004
11 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:32 mysql-bin.000005
12 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:48 mysql-bin.000006
13 -rw-r----- 1 mysql mysql 268435798 Jul 16 11:29 mysql-bin.000007
14 -rw-r----- 1 mysql mysql  35214498 Jul 16 11:59 mysql-bin.000008
15 -rw-r----- 1 mysql mysql       249 Jul 16 11:59 mysql-bin.000009
16 -rw-r----- 1 mysql mysql 199908536 Jul 18 11:27 mysql-bin.000010
17 -rw-r----- 1 mysql mysql       249 Jul 18 11:27 mysql-bin.000011
18 -rw-r----- 1 mysql mysql 268474318 Jul 19 04:20 mysql-bin.000012
19 -rw-r----- 1 mysql mysql 268817026 Jul 19 04:24 mysql-bin.000013
20 -rw-r----- 1 mysql mysql 268802986 Jul 19 04:28 mysql-bin.000014
21 -rw-r----- 1 mysql mysql 268826152 Jul 19 04:32 mysql-bin.000015
22 -rw-r----- 1 mysql mysql  99422208 Jul 19 08:56 mysql-bin.000016
23 -rw-r----- 1 mysql mysql       704 Jul 19 04:32 mysql-bin.index
24 
25 [root@zlm3 09:01:36 /data/mysql/mysql3306/logs]
26 #rm -f *
27 
28 [root@zlm3 09:01:38 /data/mysql/mysql3306/logs]
29 #ls -l
30 total 0
31 
32 [root@zlm3 09:01:48 ~]
33 #df -h
34 Filesystem               Size  Used Avail Use% Mounted on
35 /dev/mapper/centos-root  8.4G  4.5G  4.0G  53% / //The free disk space had been reduced to 47%.
36 devtmpfs                 488M     0  488M   0% /dev
37 tmpfs                    497M     0  497M   0% /dev/shm
38 tmpfs                    497M  6.5M  491M   2% /run
39 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
40 /dev/sda1                497M  118M  379M  24% /boot
41 none                      87G   80G  6.6G  93% /vagrant

 

Ran the mysqld again and dropped the database “sysbench” on slave.
 1 [root@zlm3 09:01:42 /data/mysql/mysql3306/logs]
 2 #sh /root/mysqld.sh
 3 
 4 [root@zlm3 09:01:51 /data/mysql/mysql3306/logs]
 5 #ps aux|grep mysqld
 6 mysql     4346  7.0 17.8 1109984 181576 pts/0  Sl   09:01   0:00 mysqld --defaults-file=/data/mysql/mysql3306/my.cnf
 7 root      4380  0.0  0.0 112640   960 pts/0    R+   09:01   0:00 grep --color=auto mysqld
 8 
 9 [root@zlm3 09:01:55 /data/mysql/mysql3306/logs]
10 #mysql
11 Welcome to the MySQL monitor.  Commands end with ; or g.
12 Your MySQL connection id is 2
13 Server version: 5.7.21-log MySQL Community Server (GPL)
14 
15 Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
16 
17 Oracle is a registered trademark of Oracle Corporation and/or its
18 affiliates. Other names may be trademarks of their respective
19 owners.
20 
21 Type `help;` or `h` for help. Type `c` to clear the current input statement.
22 
23 (zlm@192.168.1.102 3306)[(none)]>drop database sysbench;
24 Query OK, 6 rows affected (0.11 sec)

 

Started the replication threads of slave.

 1 (zlm@192.168.1.102 3306)[(none)]>start slave;
 2 Query OK, 0 rows affected (0.00 sec)
 3 
 4 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
 5 *************************** 1. row ***************************
 6                Slave_IO_State: Waiting for master to send event
 7                   Master_Host: 192.168.1.101
 8                   Master_User: repl
 9                   Master_Port: 3306
10                 Connect_Retry: 60
11               Master_Log_File: mysql-bin.000019
12           Read_Master_Log_Pos: 155765801
13                Relay_Log_File: relay-bin.000068
14                 Relay_Log_Pos: 398
15         Relay_Master_Log_File: mysql-bin.000019
16              Slave_IO_Running: Yes
17             Slave_SQL_Running: No
18               Replicate_Do_DB: 
19           Replicate_Ignore_DB: 
20            Replicate_Do_Table: 
21        Replicate_Ignore_Table: 
22       Replicate_Wild_Do_Table: 
23   Replicate_Wild_Ignore_Table: 
24                    Last_Errno: 1146
25                    Last_Error: Error executing row event: `Table `sysbench.sbtest1` doesn`t exist`
26                  Skip_Counter: 0
27           Exec_Master_Log_Pos: 99148246
28               Relay_Log_Space: 56618146
29               Until_Condition: None
30                Until_Log_File: 
31                 Until_Log_Pos: 0
32            Master_SSL_Allowed: No
33            Master_SSL_CA_File: 
34            Master_SSL_CA_Path: 
35               Master_SSL_Cert: 
36             Master_SSL_Cipher: 
37                Master_SSL_Key: 
38         Seconds_Behind_Master: NULL
39 Master_SSL_Verify_Server_Cert: No
40                 Last_IO_Errno: 0
41                 Last_IO_Error: 
42                Last_SQL_Errno: 1146
43                Last_SQL_Error: Error executing row event: `Table `sysbench.sbtest1` doesn`t exist` //Since the database had been droppted.This error was notable.
44   Replicate_Ignore_Server_Ids: 
45              Master_Server_Id: 1013306
46                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
47              Master_Info_File: mysql.slave_master_info
48                     SQL_Delay: 0
49           SQL_Remaining_Delay: NULL
50       Slave_SQL_Running_State: 
51            Master_Retry_Count: 86400
52                   Master_Bind: 
53       Last_IO_Error_Timestamp: 
54      Last_SQL_Error_Timestamp: 180719 09:02:07
55                Master_SSL_Crl: 
56            Master_SSL_Crlpath: 
57            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021 //It was stuck on transaction 3714549(which contained error).
58             Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3714548,
59 5c77c31b-4add-11e8-81e2-080027de0e0e:1-168
60                 Auto_Position: 1
61          Replicate_Rewrite_DB: 
62                  Channel_Name: 
63            Master_TLS_Version: 
64 1 row in set (0.00 sec)
65 
66 [root@zlm3 09:10:42 ~]
67 #perror 1146
68 MySQL error code 1146 (ER_NO_SUCH_TABLE): Table `%-.192s.%-.192s` doesn`t exist
69 
70 //Error 1146 indicated the absence of table "sbtest1" in "sysbench" database.
71 //Obviously,the slave was replaying the operations relevant to this table on master.The table even the database had been dropped.
72 //How could I do next step?Do I have to generate a new mysqldump file and reconfigure the slave again?
73 //There`s One thing I`m rather sure that there were no other transactions generated in the whole course except the operations on "sysbench" database.
74 //Since I`d drop "sysbentch" database on both master and slave.Maybe I can fix the issue easily.

 

Checked the Executed_Gtid_Set on master.

1 (zlm@192.168.1.101 3306)[(none)]>show master status;
2 +------------------+-----------+--------------+------------------+------------------------------------------------+
3 | File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                              |
4 +------------------+-----------+--------------+------------------+------------------------------------------------+
5 | mysql-bin.000019 | 155765801 |              |                  | 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 |
6 +------------------+-----------+--------------+------------------+------------------------------------------------+
7 1 row in set (0.00 sec)
8 
9 //The executed gtid was upto "3730021".

 

Tryed to fix the replica of master.

  1 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
  2 *************************** 1. row ***************************
  3                Slave_IO_State: Waiting for master to send event
  4                   Master_Host: 192.168.1.101
  5                   Master_User: repl
  6                   Master_Port: 3306
  7                 Connect_Retry: 60
  8               Master_Log_File: mysql-bin.000019
  9           Read_Master_Log_Pos: 155765801
 10                Relay_Log_File: relay-bin.000068
 11                 Relay_Log_Pos: 398
 12         Relay_Master_Log_File: mysql-bin.000019
 13              Slave_IO_Running: Yes
 14             Slave_SQL_Running: No
 15               Replicate_Do_DB: 
 16           Replicate_Ignore_DB: 
 17            Replicate_Do_Table: 
 18        Replicate_Ignore_Table: 
 19       Replicate_Wild_Do_Table: 
 20   Replicate_Wild_Ignore_Table: 
 21                    Last_Errno: 1146
 22                    Last_Error: Error executing row event: `Table `sysbench.sbtest1` doesn`t exist`
 23                  Skip_Counter: 0
 24           Exec_Master_Log_Pos: 99148246
 25               Relay_Log_Space: 56618627
 26               Until_Condition: None
 27                Until_Log_File: 
 28                 Until_Log_Pos: 0
 29            Master_SSL_Allowed: No
 30            Master_SSL_CA_File: 
 31            Master_SSL_CA_Path: 
 32               Master_SSL_Cert: 
 33             Master_SSL_Cipher: 
 34                Master_SSL_Key: 
 35         Seconds_Behind_Master: NULL
 36 Master_SSL_Verify_Server_Cert: No
 37                 Last_IO_Errno: 0
 38                 Last_IO_Error: 
 39                Last_SQL_Errno: 1146
 40                Last_SQL_Error: Error executing row event: `Table `sysbench.sbtest1` doesn`t exist`
 41   Replicate_Ignore_Server_Ids: 
 42              Master_Server_Id: 1013306
 43                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
 44              Master_Info_File: mysql.slave_master_info
 45                     SQL_Delay: 0
 46           SQL_Remaining_Delay: NULL
 47       Slave_SQL_Running_State: 
 48            Master_Retry_Count: 86400
 49                   Master_Bind: 
 50       Last_IO_Error_Timestamp: 
 51      Last_SQL_Error_Timestamp: 180719 09:02:57
 52                Master_SSL_Crl: 
 53            Master_SSL_Crlpath: 
 54            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021
 55             Executed_Gtid_Set: 
 56                 Auto_Position: 1
 57          Replicate_Rewrite_DB: 
 58                  Channel_Name: 
 59            Master_TLS_Version: 
 60 1 row in set (0.00 sec)
 61 
 62 (zlm@192.168.1.102 3306)[(none)]>reset master;
 63 Query OK, 0 rows affected (0.02 sec)
 64 
 65 (zlm@192.168.1.102 3306)[(none)]>set @@global.gtid_purged=`1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021`;
 66 Query OK, 0 rows affected (0.00 sec)
 67 
 68 //On account of surely knowing there were no other transactions at all.I set the "gtid_purged" variable to the value of "gtid_executed" on master.
 69 //It means I guised that all the transactions generated on master had been replayed on slave already.The slave could retrieve new GTID at the moment.
 70 
 71 (zlm@192.168.1.102 3306)[(none)]>start slave sql_thread;
 72 Query OK, 0 rows affected (0.02 sec)
 73 
 74 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
 75 *************************** 1. row ***************************
 76                Slave_IO_State: Waiting for master to send event
 77                   Master_Host: 192.168.1.101
 78                   Master_User: repl
 79                   Master_Port: 3306
 80                 Connect_Retry: 60
 81               Master_Log_File: mysql-bin.000019
 82           Read_Master_Log_Pos: 155765801
 83                Relay_Log_File: relay-bin.000069
 84                 Relay_Log_Pos: 438
 85         Relay_Master_Log_File: mysql-bin.000019
 86              Slave_IO_Running: Yes
 87             Slave_SQL_Running: Yes //The sql_thread became "Yes".
 88               Replicate_Do_DB: 
 89           Replicate_Ignore_DB: 
 90            Replicate_Do_Table: 
 91        Replicate_Ignore_Table: 
 92       Replicate_Wild_Do_Table: 
 93   Replicate_Wild_Ignore_Table: 
 94                    Last_Errno: 0
 95                    Last_Error: 
 96                  Skip_Counter: 0
 97           Exec_Master_Log_Pos: 155765801
 98               Relay_Log_Space: 56618434
 99               Until_Condition: None
100                Until_Log_File: 
101                 Until_Log_Pos: 0
102            Master_SSL_Allowed: No
103            Master_SSL_CA_File: 
104            Master_SSL_CA_Path: 
105               Master_SSL_Cert: 
106             Master_SSL_Cipher: 
107                Master_SSL_Key: 
108         Seconds_Behind_Master: 0
109 Master_SSL_Verify_Server_Cert: No
110                 Last_IO_Errno: 0
111                 Last_IO_Error: 
112                Last_SQL_Errno: 0
113                Last_SQL_Error: 
114   Replicate_Ignore_Server_Ids: 
115              Master_Server_Id: 1013306
116                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
117              Master_Info_File: mysql.slave_master_info
118                     SQL_Delay: 0
119           SQL_Remaining_Delay: NULL
120       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
121            Master_Retry_Count: 86400
122                   Master_Bind: 
123       Last_IO_Error_Timestamp: 
124      Last_SQL_Error_Timestamp: 
125                Master_SSL_Crl: 
126            Master_SSL_Crlpath: 
127            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021
128             Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 //The slave had skipped those GTID(which contained error 1146) of master and waited for newer GTID.The replica had been fixed up.
129                 Auto_Position: 1
130          Replicate_Rewrite_DB: 
131                  Channel_Name: 
132            Master_TLS_Version: 
133 1 row in set (0.00 sec)

 

 Summary

  • The variable “gtid_purged” cannot be set if “gtid_executed” is not empty.
  • Caution,”reset master” can only be used on slave.Keep in mind that don`t do it on master anytime.
  • This case can be followed only in test environment `cause you cannot guarantee whether all the transactions are really replayed on slave.

 

相關文章