MysqlMaster切換方案MHA的探索與測試結果

一、MHA的特點

MHA監控複製架構的主伺服器，一旦檢測到主伺服器故障，就會自動進行故障轉移。即使有些從伺服器沒有收到最新的relay log，MHA自動從最新的從伺服器上識別差異的relay log並把這些日誌應用到其他從伺服器上，因此所有的從伺服器保持一致性了。MHA通常在幾秒內完成故障轉移，9-12秒可以檢測出主伺服器故障，7-10秒內關閉故障的主伺服器以避免腦裂，幾秒中內應用差異的relay log到新的主伺服器上，整個過程可以在10-30s內完成。還可以設定優先順序指定其中的一臺slave作為master的候選人。由於MHA在slaves之間修復一致性，因此可以將任何slave變成新的master，而不會發生一致性的問題，從而導致複製失敗。

二、測試過程中需要關注的幾個問題：

1.切換過程會自動把read_only關閉

mysql> show variables like '%read_only%';

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| read_only | ON |

+---------------+-------+

1 row in set (0.00 sec)

mysql> show variables like '%read_only%';

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| read_only | OFF |

+---------------+-------+

1 row in set (0.00 sec)

2.切換之後需要刪除手工刪除/masterha/app1/app1.failover.complete，才能進行第二次測試

Thu Aug 29 14:24:15 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 14:24:15 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterFailover.pm, ln298] Last failover was done at 2013/08/29 1

0:07:58. Current time is too early to do failover again. If you want to do failover, manually remove /masterha/app1/app1.failover.co

mplete and run this script again.

之前發生過切換，需要手工刪除/maste

Thu Aug 29 14:24:15 2013 - [error][/usr/local/share/perl/5.14.2/MHA/ManagerUtil.pm, ln178] Got ERROR: at /usr/local/bin/masterha_ma

nager line 65

3.一旦發生切換管理程式將會退出，無法進行再次測試，需將故障資料庫加入到MHA環境中來

4.原主節點重新加入到MHA時只能設定為slave,在

change master to MASTER_HOST='192.168.16.5', MASTER_USER='replicationuser',MASTER_PASSWORD='replicationuser',MASTER_LOG_FILE='mysql-bin.000004',MASTER_LOG_POS=106;

之前需要先 reset slave

5.關於ip地址的接管有幾種方式，這裡採用的是MHA自動呼叫ip別名的方式，好處是在能夠保證資料庫狀態與業務Ip 切換的一致性。啟動管理節點之後 vip會自動別名到當前主節點上，keepalived也只能做到對3306的健康檢查，但是做不到比如像MySQL複製中的slave-SQL、slave-IO程式的檢查，容易出現對切換的誤判。

eth0:1 Link encap:乙太網硬體地址 94:de:80:18:11:82

inet 地址:192.168.16.9 廣播:192.168.16.255 掩碼:255.255.255.0

UP BROADCAST RUNNING MULTICAST MTU:1500 躍點數:1

中斷:40 基本地址:0x8000 >

6.注意：二級從伺服器需要將log_slave_updates開啟

7.手工切換需要先定義好master_ip_online_change_script指令碼,不然只會切換mysql,Ip地址不會繫結上去，可以根據模板來配置該指令碼

8.透過設定no_master=1可以讓某一個節點永遠不成為新的主節點

三、MHA測試環境搭建

在前期搭建的8臺測試環境中，選擇4臺主機

192.168.16.5 Ubuntu 12.04.1 LTS \n \l

192.168.16.6 Ubuntu 10.04.3 LTS \n \l

192.168.16.7 Ubuntu 12.04.1 LTS \n \l

192.168.16.8 Ubuntu 12.04.1 LTS \n \l

編譯安裝mysql-5.1.48

3.1 mysql的安裝與配置步驟

tar -xvf mysql-5.1.48.tar.gz

cd mysql-5.1.48/

./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags=-all-static --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam

apt-get install libncurses5-dev

./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags=-all-static --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam

vi ./Makefile

make

./configure --prefix=/usr/local/mysql --with-unix-socket-path=/usr/local/mysql/tmp/mysql.sock -localstatedir=/usr/local/mysql/mysql_data --enable-assembler --with-charset=gb2312 --with-mysqld-ldflags=-all-static -with-client-ldflags="-all-static -ltinfo" --with-mysqld-user=mysql --with-pthread --enable-thread-safe-client --with-extra-charsets=utf8,gbk,gb2312 --with-plugins=partition,innobase,myisammrg,myisam

vi ./Makefile

make

make install

cp support-files/my-innodb-heavy-4G.cnf /etc/my.cnf

cp support-files/mysql.server.sh /usr/local/mysql/bin/

chmod 755 /usr/local/mysql/bin/mysql.server.sh

cd /usr/local/

cd mysql/

mkdir tmp

./bin/mysql_install_db --user=mysql

cd mysql

cd mysql_data/

cd mysql/

rm -rf mysql_data/

groupadd mysql

useradd -g mysql mysql

./bin/mysql_install_db --user=mysql

cd mysql-test/

cd ../mysql_data/

chown -R root .

chown -R mysql tmp/

chown -R mysql mysql_data/

chgrp -R mysql .

cp ./bin/mysql /usr/bin

cp ./bin/mysqldump /usr/bin

vi ./bin/mysql.server.sh

./bin/mysql.server.sh start

/usr/local/mysql/bin/mysqladmin -u root password 'sztf@yunwei'

mysql -uroot -p

3.2 建立複製，並組建MHA體系

我們先按照如下體系來構建，16.5作為主節點，16.6,16.7作為從節點，16.8作為監控和管理節點

16.5 master

16.6 slave

16.7 slave

16.8 manage,monitor

3.2.1) 確認主伺服器上my.cnf檔案的[mysqld]section包含log-bin選項和server-id，並啟動主伺服器：

3.2.2) 停止從伺服器，加入server-id分別=2,=3，然後啟動從伺服器：

3.2.3) 在3臺機器上建立複製賬號 mysql>grant replication slave on *.* to 'replicationuser'@'%' identified by 'replicationuser';

3.2.4) 建立複製關係

flush tables with read lock;

SHOW MASTER STATUS;

change master to MASTER_HOST='192.168.16.5', MASTER_USER='replicationuser',MASTER_PASSWORD='replicationuser',MASTER_LOG_FILE='mysql-bin.000027',MASTER_LOG_POS=543;

show master status;

show slave status;

unlock tables;

START SLAVE;

show processlist;

show slave hosts;

3.2.5) 驗證：此時主伺服器和從伺服器上的資料應該是一致的，在主伺服器上插入修改刪除資料都會更新到從伺服器上，建表，刪表等也是一樣的。

3.3 MHA安裝

## Install DBD::mysql if not installed

$ tar -zxf mha4mysql-node-0.54.tar.gz

$ perl Makefile.PL

$ make

$ make install

3臺MHA節點上安裝node程式，管理機上安裝node與manager程式。

出現的問題及解決方案：

perl 報錯

Warning: prerequisite DBD::mysql 0 not found.

'--MYSQL_CONFIG' is not a known MakeMaker parameter name.

make 報錯：

Can't exec "mysql_config": 沒有那個檔案或目錄 at /root/.cpanplus/5.14.2/build/DBD-mysql-4.023/Makefile.PL line 479.

Can't find mysql_config. Use --mysql_config option to specify where mysql_config is located

perl Makefile.PL --mysql_config=/usr/local/mysql/bin/mysql_config

I will use the following settings for compiling and testing:

cflags (mysql_config) = -I/usr/local/mysql/include/mysql -g -DUNIV_LINUX

embedded (mysql_config) =

ldflags (mysql_config) = -rdynamic

libs (mysql_config) = -L/usr/local/mysql/lib/mysql -lmysqlclient -lz -lcrypt -lnsl -lm

mysql_config (guessed ) = mysql_config

解決方案：

export PATH=$PATH:/usr/local/mysql/bin

perl Makefile.PL

make

perl Makefile.PL

make

make install

apt-get install perls

3.4 配置ssh免登陸

需要配置管理節點與node節點以及node節點之間的ssh免登陸

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.16.5

ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.16.6

ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.16.7

3.5 修改管理機配置檔案

mkdir /etc/masterha

mkdir -p /masterha/app1

cp samples/conf/* /etc/masterha/

vi /etc/masterha/app1.cnf

[server default]

manager_workdir=/masterha/app1

manager_log=/masterha/app1/manager.log

user=root

password=88877007

ssh_user=root

repl_user=replicationuser

repl_password=replicationuser

ping_interval=1

shutdown_script=""

#master_ip_failover_script=''

master_ip_online_change_script=""

report_script=""

[server1]

hostname=192.168.16.5

candidate_master=1

master_binlog_dir="/usr/local/mysql/mysql_data/"

[server2]

hostname=192.168.16.6

candidate_master=1

master_binlog_dir="/usr/local/mysql/mysql_data/"

[server3]

hostname=192.168.16.7

candidate_master=1

master_binlog_dir="/usr/local/mysql/mysql_data/"

2.6 測試ssh與rep

masterha_check_ssh --conf=/etc/masterha/app1.cnf

masterha_check_repl --conf=/etc/masterha/app1.cnf

mysql> GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "sztf@yunwei";

mysql> flush privileges;

rep檢查報錯：

root@ubuntu:/etc/masterha# masterha_check_repl --conf=/etc/masterha/app1.cnf

Wed Aug 28 13:14:41 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Wed Aug 28 13:14:41 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..

Wed Aug 28 13:14:41 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..

Wed Aug 28 13:14:41 2013 - [info] MHA::MasterMonitor version 0.55.

Wed Aug 28 13:14:41 2013 - [info] Dead Servers:

Wed Aug 28 13:14:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)

Wed Aug 28 13:14:41 2013 - [info] Alive Servers:

Wed Aug 28 13:14:41 2013 - [info] 192.168.16.5(192.168.16.5:3306)

Wed Aug 28 13:14:41 2013 - [info] 192.168.16.6(192.168.16.6:3306)

Wed Aug 28 13:14:41 2013 - [info] Alive Slaves:

Wed Aug 28 13:14:41 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Wed Aug 28 13:14:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Wed Aug 28 13:14:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Wed Aug 28 13:14:41 2013 - [info] Current Alive Master: 192.168.16.5(192.168.16.5:3306)

Wed Aug 28 13:14:41 2013 - [info] Checking slave configurations..

Wed Aug 28 13:14:41 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).

Wed Aug 28 13:14:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).

Wed Aug 28 13:14:41 2013 - [info] Checking replication filtering settings..

Wed Aug 28 13:14:41 2013 - [info] binlog_do_db= , binlog_ignore_db=

Wed Aug 28 13:14:41 2013 - [info] Replication filtering check ok.

Wed Aug 28 13:14:41 2013 - [info] Starting SSH connection tests..

Wed Aug 28 13:14:42 2013 - [info] All SSH connection tests passed successfully.

Wed Aug 28 13:14:42 2013 - [info] Checking MHA Node version..

Wed Aug 28 13:14:43 2013 - [info] Version check ok.

Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln443] Server 192.168.16.7(192.168.16.7:3306) is dead, but must be alive! Check server settings.

Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln386] Error happend on checking configurations. at /usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm line 363

Wed Aug 28 13:14:43 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln482] Error happened on monitoring servers.

Wed Aug 28 13:14:43 2013 - [info] Got exit code 1 (Not master dead).

Wed Aug 28 13:14:41 2013 - [info] Dead Servers:

Wed Aug 28 13:14:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)

3306埠連線有問題

在16.7的my.cnf配置中增加跳過解析後，又報錯

#skip-networking

skip_name_resolve

Wed Aug 28 14:22:50 2013 - [info] Connecting to root@192.168.16.6(192.168.16.6:22)..

perl: warning: Setting locale failed.

perl: warning: Please check that your locale settings:

LANGUAGE = "zh_CN:zh",

LC_ALL = (unset),

LC_PAPER = "zh_CN",

LC_ADDRESS = "zh_CN",

LC_MONETARY = "zh_CN",

LC_NUMERIC = "zh_CN",

LC_TELEPHONE = "zh_CN",

LC_IDENTIFICATION = "zh_CN",

LC_MEASUREMENT = "zh_CN",

LC_TIME = "zh_CN",

LC_NAME = "zh_CN",

LANG = "zh_CN.GBK"

are supported and installed on your system.

perl: warning: Falling back to the standard locale ("C").

Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl/5.10.1/MHA/BinlogManager.pm line 99.

mysqlbinlog version not found!

懷疑是許可權問題，修改許可權後仍然不正常

root@ubuntu:/usr/local/mysql# chmod -R 777 mysql_data

root@ubuntu:/usr/local/mysql# pwd

/usr/local/mysql

將mysqlbinlog所在目錄放到每臺機器的PATH當中.

#vi ~/.bashrc或vi /etc/bashrc,然後在檔案末尾新增

PATH="$PATH:/usr/local/mysql/bin"

export PATH

仍然沒有解決……

經過不斷的嘗試與探索在建立軟連線之後正常並給出了複製關係結構

root@ubuntu:/usr/bin# ln -s /usr/local/mysql/bin/mysqlbinlog mysqlbinlog

root@ubuntu:/usr/bin# ls -l mysqlbinlog

lrwxrwxrwx 1 root root 32 8月 28 17:32 mysqlbinlog -> /usr/local/mysql/bin/mysqlbin

Wed Aug 28 17:35:43 2013 - [info] Slaves settings check done.

Wed Aug 28 17:35:43 2013 - [info]

192.168.16.5 (current master)

+--192.168.16.6

+--192.168.16.7

Wed Aug 28 17:35:43 2013 - [info] Checking replication health on 192.168.16.6..

Wed Aug 28 17:35:43 2013 - [info] ok.

Wed Aug 28 17:35:43 2013 - [info] Checking replication health on 192.168.16.7..

Wed Aug 28 17:35:43 2013 - [info] ok.

Wed Aug 28 17:35:43 2013 - [warning] master_ip_failover_script is not defined.

Wed Aug 28 17:35:43 2013 - [warning] shutdown_script is not defined.

Wed Aug 28 17:35:43 2013 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

root@ubuntu:/usr/bin#

3.7

啟動管理節點程式

nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1 &

檢視狀態的命令：

masterha_check_status --conf=/etc/masterha/app1.cnf

show slave hosts;

過程中會有如下報錯：

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 2005

Last_IO_Error: error connecting to master 'replicationuser@192.168.16..6:3306' - retry-time: 60 retries: 86400

Last_SQL_Errno: 0

Last_SQL_Error:

1 row in set (0.00 sec)

這種問題是多種原因造成的，

首先檢查change master to命令的正確性

可在MySQL中執行如下命令：

STOP SLAVE;

RESET MASTER;

RESET SLAVE;

SLAVE START;

START SLAVE IO_THREAD;

在伺服器上刪除資料庫目錄下的

master.info，relay-bin.*

Slave_IO_Running:

rm-rf

relay-log.info

然後把其他日誌也全部刪除重啟即可

四、 MHA的測試

4.1切換功能測試

a)啟動管理節點程式,命令如下：程式轉為後臺執行

nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1 &

檢視MHA節點狀態，命令如下：

masterha_check_status --conf=/etc/masterha/app1.cnf

輸出結果如下：

app1 (pid:4404) is running(0:PING_OK), master:192.168.16.5

root@ubuntu:/masterha/scripts#

說明當前16.5是主伺服器，我們嘗試關閉16.5資料庫，並檢視系統狀況

關閉命令如下：

root@ubuntu:~# /usr/local/mysql/bin/mysql.server.sh stop

Shutting down MySQL... *

root@ubuntu:~#

關閉後管理節點上程式會自動退出：

oot@ubuntu:/masterha/scripts# cd ..

[1]+ 退出 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log < /dev/null 2>&1

下面我們詳細分析下管理節點上日誌：

more /masterha/app1/manager.log

Thu Aug 29 16:03:41 2013 - [info] MHA::MasterMonitor version 0.55.

Thu Aug 29 16:03:41 2013 - [info] Dead Servers:

Thu Aug 29 16:03:41 2013 - [info] Alive Servers:

Thu Aug 29 16:03:41 2013 - [info] 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:03:41 2013 - [info] 192.168.16.6(192.168.16.6:3306)

Thu Aug 29 16:03:41 2013 - [info] 192.168.16.7(192.168.16.7:3306)

Thu Aug 29 16:03:41 2013 - [info] Alive Slaves:

Thu Aug 29 16:03:41 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:03:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:03:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:03:41 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:03:41 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:03:41 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:03:41 2013 - [info] Current Alive Master: 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:03:41 2013 - [info] Checking slave configurations..

Thu Aug 29 16:03:41 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).

Thu Aug 29 16:03:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).

Thu Aug 29 16:03:41 2013 - [info] read_only=1 is not set on slave 192.168.16.7(192.168.16.7:3306).

Thu Aug 29 16:03:41 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.7(192.168.16.7:3306).

Thu Aug 29 16:03:41 2013 - [info] Checking replication filtering settings..

Thu Aug 29 16:03:41 2013 - [info] binlog_do_db= , binlog_ignore_db=

Thu Aug 29 16:03:41 2013 - [info] Replication filtering check ok.

Thu Aug 29 16:03:41 2013 - [info] Starting SSH connection tests..

Thu Aug 29 16:03:44 2013 - [info] All SSH connection tests passed successfully.

Thu Aug 29 16:03:44 2013 - [info] Checking MHA Node version..

Thu Aug 29 16:03:45 2013 - [info] Version check ok.

Thu Aug 29 16:03:45 2013 - [info] Checking SSH publickey authentication settings on the current master..

Thu Aug 29 16:03:45 2013 - [info] HealthCheck: SSH to 192.168.16.5 is reachable.

Thu Aug 29 16:03:46 2013 - [info] Master MHA Node version is 0.54.

Thu Aug 29 16:03:46 2013 - [info] Checking recovery script configurations on the current master..

Thu Aug 29 16:03:46 2013 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000005

Thu Aug 29 16:03:46 2013 - [info] Connecting to root@192.168.16.5(192.168.16.5)..

Creating /var/tmp if not exists.. ok.

Checking output directory is accessible or not..

ok.

Binlog found at /usr/local/mysql/mysql_data/, up to mysql-bin.000005

Thu Aug 29 16:03:46 2013 - [info] Master setting check done.

Thu Aug 29 16:03:46 2013 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..

Thu Aug 29 16:03:46 2013 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.16.6 --slave_ip=192.168.16.6 --slave_port=3306 --workdir=/var/tmp --target_version=5.1.48-log --manager_version=0.55 --relay_log_info=/usr/local/mysql/mysql_data/relay-log.info --relay_dir=/usr/local/mysql/mysql_data/ --slave_pass=xxx

Thu Aug 29 16:03:46 2013 - [info] Connecting to root@192.168.16.6(192.168.16.6:22)..

perl: warning: Setting locale failed.

perl: warning: Please check that your locale settings:

LANGUAGE = "zh_CN:zh",

LC_ALL = (unset),

LC_PAPER = "zh_CN",

LC_ADDRESS = "zh_CN",

LC_MONETARY = "zh_CN",

LC_NUMERIC = "zh_CN",

LC_TELEPHONE = "zh_CN",

LC_IDENTIFICATION = "zh_CN",

LC_MEASUREMENT = "zh_CN",

LC_TIME = "zh_CN",

LC_NAME = "zh_CN",

LANG = "zh_CN.GBK"

are supported and installed on your system.

perl: warning: Falling back to the standard locale ("C").

Checking slave recovery environment settings..

Opening /usr/local/mysql/mysql_data/relay-log.info ... ok.

Relay log found at /usr/local/mysql/mysql_data, up to payubuntu-relay-bin.000004

Temporary relay log file is /usr/local/mysql/mysql_data/payubuntu-relay-bin.000004

Testing mysql connection and privileges.. done.

Testing mysqlbinlog output.. done.

Cleaning up test file(s).. done.

Thu Aug 29 16:03:47 2013 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.16.7 --slave_ip=192.168.16.7 --slave_port=3306 --workdir=/var/tmp --target_version=5.1.48-log --manager_version=0.55 --relay_log_info=/usr/local/mysql/mysql_data/relay-log.info --relay_dir=/usr/local/mysql/mysql_data/ --slave_pass=xxx

Thu Aug 29 16:03:47 2013 - [info] Connecting to root@192.168.16.7(192.168.16.7:22)..

perl: warning: Setting locale failed.

perl: warning: Please check that your locale settings:

LANGUAGE = "zh_CN:zh",

LC_ALL = (unset),

LC_PAPER = "zh_CN",

LC_ADDRESS = "zh_CN",

LC_MONETARY = "zh_CN",

LC_NUMERIC = "zh_CN",

LC_TELEPHONE = "zh_CN",

LC_IDENTIFICATION = "zh_CN",

LC_MEASUREMENT = "zh_CN",

LC_TIME = "zh_CN",

LC_NAME = "zh_CN",

LANG = "zh_CN.GBK"

are supported and installed on your system.

perl: warning: Falling back to the standard locale ("C").

Checking slave recovery environment settings..

Opening /usr/local/mysql/mysql_data/relay-log.info ... ok.

Relay log found at /usr/local/mysql/mysql_data, up to ubuntu-relay-bin.000004

Temporary relay log file is /usr/local/mysql/mysql_data/ubuntu-relay-bin.000004

Testing mysql connection and privileges.. done.

Testing mysqlbinlog output.. done.

Cleaning up test file(s).. done.

Thu Aug 29 16:03:47 2013 - [info] Slaves settings check done.

Thu Aug 29 16:03:47 2013 - [info]

192.168.16.5 (current master)

+--192.168.16.6

+--192.168.16.7

給出了當前拓撲結構

Thu Aug 29 16:03:47 2013 - [warning] master_ip_failover_script is not defined.

Thu Aug 29 16:03:47 2013 - [warning] shutdown_script is not defined.

Thu Aug 29 16:03:47 2013 - [info] Set master ping interval 1 seconds.

Thu Aug 29 16:03:47 2013 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.

Thu Aug 29 16:03:47 2013 - [info] Starting ping health check on 192.168.16.5(192.168.16.5:3306)..

Thu Aug 29 16:03:47 2013 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

啟動管理程式之後，等待主節點故障

Thu Aug 29 16:04:02 2013 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)

發現16.5 mysql 異常

Thu Aug 29 16:04:02 2013 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin

Thu Aug 29 16:04:03 2013 - [info] HealthCheck: SSH to 192.168.16.5 is reachable.

Thu Aug 29 16:04:03 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)

Thu Aug 29 16:04:03 2013 - [warning] Connection failed 1 time(s)..

Thu Aug 29 16:04:04 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)

Thu Aug 29 16:04:04 2013 - [warning] Connection failed 2 time(s)..

Thu Aug 29 16:04:05 2013 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)

Thu Aug 29 16:04:05 2013 - [warning] Connection failed 3 time(s)..

Thu Aug 29 16:04:05 2013 - [warning] Master is not reachable from health checker!

Thu Aug 29 16:04:05 2013 - [warning] Master 192.168.16.5(192.168.16.5:3306) is not reachable!

Thu Aug 29 16:04:05 2013 - [warning] SSH is reachable.

經過3次登陸檢查判斷mysql無法連線，ssh正常

Thu Aug 29 16:04:05 2013 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..

Thu Aug 29 16:04:05 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Thu Aug 29 16:04:05 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..

Thu Aug 29 16:04:05 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..

讀取配置檔案

Thu Aug 29 16:04:05 2013 - [info] Dead Servers:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Alive Servers:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306)

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306)

Thu Aug 29 16:04:05 2013 - [info] Alive Slaves:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

輸出目前MHA拓撲狀態，準備再造主節點

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

沒有設定備選主節點

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] Checking slave configurations..

檢查從節點配置

Thu Aug 29 16:04:05 2013 - [info] read_only=1 is not set on slave 192.168.16.6(192.168.16.6:3306).

Thu Aug 29 16:04:05 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.6(192.168.16.6:3306).

Thu Aug 29 16:04:05 2013 - [info] read_only=1 is not set on slave 192.168.16.7(192.168.16.7:3306).

Thu Aug 29 16:04:05 2013 - [warning] relay_log_purge=0 is not set on slave 192.168.16.7(192.168.16.7:3306).

Thu Aug 29 16:04:05 2013 - [info] Checking replication filtering settings..

Thu Aug 29 16:04:05 2013 - [info] Replication filtering check ok.

Thu Aug 29 16:04:05 2013 - [info] Master is down!

Thu Aug 29 16:04:05 2013 - [info] Terminating monitoring script.

Thu Aug 29 16:04:05 2013 - [info] Got exit code 20 (Master dead).

Thu Aug 29 16:04:05 2013 - [info] MHA::MasterFailover version 0.55.

Thu Aug 29 16:04:05 2013 - [info] Starting master failover.

開始故障切換操作

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] * Phase 1: Configuration Check Phase..

第一步配置檢查

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] Dead Servers:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Checking master reachability via mysql(double check)..

再次檢查16.5

Thu Aug 29 16:04:05 2013 - [info] ok.

Thu Aug 29 16:04:05 2013 - [info] Alive Servers:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306)

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306)

Thu Aug 29 16:04:05 2013 - [info] Alive Slaves:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] ** Phase 1: Configuration Check Phase completed.

Thu Aug 29 16:04:05 2013 - [info]

配置檢查結束

第二步關閉死掉的主節點

Thu Aug 29 16:04:05 2013 - [info] * Phase 2: Dead Master Shutdown Phase..

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] Forcing shutdown so that applications never connect to the current master..

Thu Aug 29 16:04:05 2013 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master ip address.

Thu Aug 29 16:04:05 2013 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Thu Aug 29 16:04:05 2013 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Thu Aug 29 16:04:05 2013 - [info]

沒有配置master_ip_failover_script指令碼，第二步結束

Thu Aug 29 16:04:05 2013 - [info] * Phase 3: Master Recovery Phase..

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] * Phase 3.1: Getting Latest Slaves Phase..

選擇最新的從節點

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] The latest binary log file/position on all slaves is mysql-bin.000005:106

Thu Aug 29 16:04:05 2013 - [info] Latest slaves (Slaves that received relay log files to the latest):

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] The oldest binary log file/position on all slaves is mysql-bin.000005:106

Thu Aug 29 16:04:05 2013 - [info] Oldest slaves:

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:05 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:05 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:05 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..

儲存原主節點上的binlog

Thu Aug 29 16:04:05 2013 - [info]

Thu Aug 29 16:04:06 2013 - [info] Fetching dead master's binary logs..

Thu Aug 29 16:04:06 2013 - [info] Executing command on the dead master 192.168.16.5(192.168.16.5:3306): save_binary_logs --command=save --start_file=mysql-bin.000005 --start_pos=106 --binlog_dir=/usr/local/mysql/mysql_data/ --output_file=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55

Creating /var/tmp if not exists.. ok.

Concat binary/relay logs from mysql-bin.000005 pos 106 to mysql-bin.000005 EOF into /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog ..

Dumping binlog format description event, from position 0 to 106.. ok.

Dumping effective binlog data from /usr/local/mysql/mysql_data//mysql-bin.000005 position 106 to tail(125).. ok.

Concat succeeded.

Thu Aug 29 16:04:07 2013 - [info] scp from root@192.168.16.5:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.

Thu Aug 29 16:04:08 2013 - [info] HealthCheck: SSH to 192.168.16.6 is reachable.

Thu Aug 29 16:04:09 2013 - [info] HealthCheck: SSH to 192.168.16.7 is reachable.

Thu Aug 29 16:04:09 2013 - [info]

Thu Aug 29 16:04:09 2013 - [info] * Phase 3.3: Determining New Master Phase..

Thu Aug 29 16:04:09 2013 - [info]

Thu Aug 29 16:04:09 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..

Thu Aug 29 16:04:09 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.

Thu Aug 29 16:04:09 2013 - [info] Searching new master from slaves..

Thu Aug 29 16:04:09 2013 - [info] Candidate masters from the configuration file:

Thu Aug 29 16:04:09 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:09 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:09 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:09 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Thu Aug 29 16:04:09 2013 - [info] Replicating from 192.168.16.5(192.168.16.5:3306)

Thu Aug 29 16:04:09 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Aug 29 16:04:09 2013 - [info] Non-candidate masters:

Thu Aug 29 16:04:09 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..

Thu Aug 29 16:04:09 2013 - [info] New master is 192.168.16.6(192.168.16.6:3306)

選定新主節點是16.6

Thu Aug 29 16:04:09 2013 - [info] Starting master failover..

Thu Aug 29 16:04:09 2013 - [info]

From:

192.168.16.5 (current master)

+--192.168.16.6

+--192.168.16.7

To:

192.168.16.6 (new master)

+--192.168.16.7

主從結構將從current master->new master結構

Thu Aug 29 16:04:09 2013 - [info]

Thu Aug 29 16:04:09 2013 - [info] * Phase 3.3: New Master Diff Log Generation Phase..

Thu Aug 29 16:04:09 2013 - [info]

Thu Aug 29 16:04:09 2013 - [info] This server has all relay logs. No need to generate diff files from the latest slave.

Thu Aug 29 16:04:09 2013 - [info] Sending binlog..

Thu Aug 29 16:04:10 2013 - [info] scp from local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to root@192.168.16.6:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] * Phase 3.4: Master Log Apply Phase..

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.

Thu Aug 29 16:04:10 2013 - [info] Starting recovery on 192.168.16.6(192.168.16.6:3306)..

Thu Aug 29 16:04:10 2013 - [info] Generating diffs succeeded.

Thu Aug 29 16:04:10 2013 - [info] Waiting until all relay logs are applied.

Thu Aug 29 16:04:10 2013 - [info] done.

Thu Aug 29 16:04:10 2013 - [info] Getting slave status..

Thu Aug 29 16:04:10 2013 - [info] This slave(192.168.16.6)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000005:106). No need to recover from Exec_Master_Log_Pos.

Thu Aug 29 16:04:10 2013 - [info] Connecting to the target slave host 192.168.16.6, running recover script..

Thu Aug 29 16:04:10 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.16.6 --slave_ip=192.168.16.6 --slave_port=3306 --apply_files=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --workdir=/var/tmp --target_version=5.1.48-log --timestamp=20130829160405 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx

Thu Aug 29 16:04:10 2013 - [info]

perl: warning: Setting locale failed.

perl: warning: Please check that your locale settings:

LANGUAGE = "zh_CN:zh",

LC_ALL = (unset),

LC_PAPER = "zh_CN",

LC_ADDRESS = "zh_CN",

LC_MONETARY = "zh_CN",

LC_NUMERIC = "zh_CN",

LC_TELEPHONE = "zh_CN",

LC_IDENTIFICATION = "zh_CN",

LC_MEASUREMENT = "zh_CN",

LC_TIME = "zh_CN",

LC_NAME = "zh_CN",

LANG = "zh_CN.GBK"

are supported and installed on your system.

perl: warning: Falling back to the standard locale ("C").

Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog on 192.168.16.6:3306. This may take long time...

Applying log files succeeded.

Thu Aug 29 16:04:10 2013 - [info] All relay logs were successfully applied.

Thu Aug 29 16:04:10 2013 - [info] Getting new master's binlog name and position..

Thu Aug 29 16:04:10 2013 - [info] mysql-bin.000011:106

Thu Aug 29 16:04:10 2013 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.16.6', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=106, MASTER_USER='replicationuser', MASTER_PASSWORD='xxx';

Thu Aug 29 16:04:10 2013 - [warning] master_ip_failover_script is not set. Skipping taking over new master ip address.

Thu Aug 29 16:04:10 2013 - [info] ** Finished master recovery successfully.

Thu Aug 29 16:04:10 2013 - [info] * Phase 3: Master Recovery Phase completed.

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] * Phase 4: Slaves Recovery Phase..

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] -- Slave diff file generation on host 192.168.16.7(192.168.16.7:3306) started, pid: 5323. Check tmp log /masterha/app1/192.168.16.7_3306_20130829160405.log if it takes time..

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] Log messages from 192.168.16.7 ...

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] This server has all relay logs. No need to generate diff files from the latest slave.

Thu Aug 29 16:04:10 2013 - [info] End of log messages from 192.168.16.7.

Thu Aug 29 16:04:10 2013 - [info] -- 192.168.16.7(192.168.16.7:3306) has the latest relay log events.

Thu Aug 29 16:04:10 2013 - [info] Generating relay diff files from the latest slave succeeded.

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..

Thu Aug 29 16:04:10 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] -- Slave recovery on host 192.168.16.7(192.168.16.7:3306) started, pid: 5325. Check tmp log /masterha/app1/192.168.16.7_3306_20130829160405.log if it takes time..

Thu Aug 29 16:04:12 2013 - [info]

Thu Aug 29 16:04:12 2013 - [info] Log messages from 192.168.16.7 ...

Thu Aug 29 16:04:12 2013 - [info]

Thu Aug 29 16:04:10 2013 - [info] Sending binlog..

Thu Aug 29 16:04:11 2013 - [info] scp from local:/masterha/app1/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog to root@192.168.16.7:/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog succeeded.

Thu Aug 29 16:04:11 2013 - [info] Starting recovery on 192.168.16.7(192.168.16.7:3306)..

Thu Aug 29 16:04:11 2013 - [info] Generating diffs succeeded.

Thu Aug 29 16:04:11 2013 - [info] Waiting until all relay logs are applied.

Thu Aug 29 16:04:11 2013 - [info] done.

Thu Aug 29 16:04:11 2013 - [info] Getting slave status..

Thu Aug 29 16:04:11 2013 - [info] This slave(192.168.16.7)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000005:106). No need to recover from Exec_Master_Log_Pos.

Thu Aug 29 16:04:11 2013 - [info] Connecting to the target slave host 192.168.16.7, running recover script..

Thu Aug 29 16:04:11 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.16.7 --slave_ip=192.168.16.7 --slave_port=3306 --apply_files=/var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog --workdir=/var/tmp --target_version=5.1.48-log --timestamp=20130829160405 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx

Thu Aug 29 16:04:12 2013 - [info]

perl: warning: Setting locale failed.

perl: warning: Please check that your locale settings:

LANGUAGE = "zh_CN:zh",

LC_ALL = (unset),

LC_PAPER = "zh_CN",

LC_ADDRESS = "zh_CN",

LC_MONETARY = "zh_CN",

LC_NUMERIC = "zh_CN",

LC_TELEPHONE = "zh_CN",

LC_IDENTIFICATION = "zh_CN",

LC_MEASUREMENT = "zh_CN",

LC_TIME = "zh_CN",

LC_NAME = "zh_CN",

LANG = "zh_CN.GBK"

are supported and installed on your system.

perl: warning: Falling back to the standard locale ("C").

Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.5_3306_20130829160405.binlog on 192.168.16.7:3306. This may take long time...

Applying log files succeeded.

Thu Aug 29 16:04:12 2013 - [info] All relay logs were successfully applied.

Thu Aug 29 16:04:12 2013 - [info] Resetting slave 192.168.16.7(192.168.16.7:3306) and starting replication from the new master 192.168.16.6(192.168.16.6:3306)..

Thu Aug 29 16:04:12 2013 - [info] Executed CHANGE MASTER.

Thu Aug 29 16:04:12 2013 - [info] Slave started.

Thu Aug 29 16:04:12 2013 - [info] End of log messages from 192.168.16.7.

Thu Aug 29 16:04:12 2013 - [info] -- Slave recovery on host 192.168.16.7(192.168.16.7:3306) succeeded.

Thu Aug 29 16:04:12 2013 - [info] All new slave servers recovered successfully.

Thu Aug 29 16:04:12 2013 - [info]

Thu Aug 29 16:04:12 2013 - [info] * Phase 5: New master cleanup phase..

Thu Aug 29 16:04:12 2013 - [info]

Thu Aug 29 16:04:12 2013 - [info] Resetting slave info on the new master..

Thu Aug 29 16:04:12 2013 - [info] 192.168.16.6: Resetting slave info succeeded.

Thu Aug 29 16:04:12 2013 - [info] Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.

Thu Aug 29 16:04:12 2013 - [info]

----- Failover Report -----

app1: MySQL Master failover 192.168.16.5 to 192.168.16.6 succeeded

Master 192.168.16.5 is down!

Check MHA Manager logs at ubuntu:/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.

The latest slave 192.168.16.6(192.168.16.6:3306) has all relay logs for recovery.

Selected 192.168.16.6 as a new master.

192.168.16.6: OK: Applying all logs succeeded.

192.168.16.7: This host has the latest relay log events.

Generating relay diff files from the latest slave succeeded.

192.168.16.7: OK: Applying all logs succeeded. Slave started, replicating from 192.168.16.6.

192.168.16.6: Resetting slave info succeeded.

Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.

日誌非常細緻，最後給出了切換報告：

Selected 192.168.16.6 as a new master.

Master failover to 192.168.16.6(192.168.16.6:3306) completed successfully.

整個過程歷時16:03:47-16:04:12，不足1分鐘！

4.2資料完整性，可用性測試

經簡單測試資料完整並可用，進一步測試，在業務測試階段進行

4.3 ip地址的接管

修改配置檔案增加add master_ip_failover_script指令碼

vi /etc/masterha/app1.cnf"

add master_ip_failover_script="/masterha/scripts/master_ip_failover"

啟動管理節點程式時報錯：

Thu Aug 29 17:08:17 2013 - [info] /masterha/scripts/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.16.6 --orig_master_ip=192.168.16.6 --orig_master_port=3306

Bareword "FIXME_xxx" not allowed while "strict subs" in use at /masterha/scripts/master_ip_failover line 93.

Execution of /masterha/scripts/master_ip_failover aborted due to compilation errors.

Thu Aug 29 17:08:17 2013 - [error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln216] Failed to get master_ip_failover_script status with return code 255:0.

需要將/masterha/scripts/master_ip_failover模板進行修改，修改如下：

編寫ip地址切換指令碼：

root@ubuntu:/masterha/app1# more /masterha/scripts/master_ip_failover

#!/usr/bin/env perl

use strict;

use warnings FATAL =>‘all’;

use Getopt::Long;

my (

$command, $ssh_user, $orig_master_host, $orig_master_ip,

$orig_master_port, $new_master_host, $new_master_ip, $new_master_port

);

my $vip = ‘192.168.16.9/24’; # Virtual IP

my $key = “1”;

my $ssh_start_vip = “/sbin/ifconfig eth0:$key $vip”;

my $ssh_stop_vip = “/sbin/ifconfig eth0:$key down”;

GetOptions(

‘command=s’ => \$command,

‘ssh_user=s’ => \$ssh_user,

‘orig_master_host=s’ => \$orig_master_host,

‘orig_master_ip=s’ => \$orig_master_ip,

‘orig_master_port=i’ => \$orig_master_port,

‘new_master_host=s’ => \$new_master_host,

‘new_master_ip=s’ => \$new_master_ip,

‘new_master_port=i’ => \$new_master_port,

);

exit &main();

sub main {

print “\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n”;

if ( $command eq “stop” || $command eq “stopssh” ) {

# $orig_master_host, $orig_master_ip, $orig_master_port are passed.

# If you manage master ip address at global catalog database,

# invalidate orig_master_ip here.

My $exit_code = 1;

eval {

print “Disabling the VIP on old master: $orig_master_host \n”;

&stop_vip();

$exit_code = 0;

};

if ($@) {

warn “Got Error: $@\n”;

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq “start” ) {

# all arguments are passed.

# If you manage master ip address at global catalog database,

# activate new_master_ip here.

# You can also grant write access (create user, set read_only=0, etc) here.

My $exit_code = 10;

eval {

print “Enabling the VIP - $vip on the new master - $new_master_host \n”;

&start_vip();

$exit_code = 0;

};

if ($@) {

warn $@;

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq “status” ) {

print “Checking the Status of the script.. OK \n”;

`ssh $ssh_user\@$orig_master_host \” $ssh_start_vip \”`;

exit 0;

}

else {

&usage();

exit 1;

}

# A simple system call that enable the VIP on the new master

sub start_vip() {

`ssh $ssh_user\@$new_master_host \” $ssh_start_vip \”`;

}

# A simple system call that disable the VIP on the old_master

sub stop_vip() {

`ssh $ssh_user\@$orig_master_host \” $ssh_stop_vip \”`;

}

sub usage {

print

“Usage: master_ip_failover –command=start|stop|stopssh|status –orig_master_host=host –orig_master_ip=ip –orig_master_port=po

rt –new_master_host=host –new_master_ip=ip –new_master_port=port\n”;

}

root@ubuntu:/masterha/app1#

切換之後的日誌分析：

n:enabled

Fri Aug 30 08:50:44 2013 – [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 08:50:44 2013 – [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 08:50:44 2013 – [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Fri Aug 30 08:50:44 2013 – [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 08:50:44 2013 – [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 08:50:44 2013 – [info] ** Phase 1: Configuration Check Phase completed.

Fri Aug 30 08:50:44 2013 – [info]

Fri Aug 30 08:50:44 2013 – [info] * Phase 2: Dead Master Shutdown Phase..

Fri Aug 30 08:50:44 2013 – [info]

Fri Aug 30 08:50:44 2013 – [info] Forcing shutdown so that applications never connect to the current master..

Fri Aug 30 08:50:44 2013 – [info] Executing master IP deactivatation script:

Fri Aug 30 08:50:44 2013 – [info] /masterha/scripts/master_ip_failover –orig_master_host=192.168.16.6 –orig_master_ip=192.168.16.6 –orig_master_port=3306 –command=stopssh –ssh_user=root

IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.16.9/24===

Disabling the VIP on old master: 192.168.16.6

Fri Aug 30 08:50:44 2013 – [info] done.

Fri Aug 30 08:50:44 2013 – [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Fri Aug 30 08:50:44 2013 – [info] * Phase 2: Dead Master Shutdown Phase completed.

Fri Aug 30 08:50:44 2013 – [info]

Fri Aug 30 08:50:44 2013 – [info] * Phase 3: Master Recovery Phase..

Fri Aug 30 08:50:44 2013 – [info]

Fri Aug 30 08:50:44 2013 – [info] * Phase 3.1: Getting Latest Slaves Phase..

Fri Aug 30 08:50:44 2013 – [info]

Fri Aug 30 08:50:44 2013 – [info] The latest binary log file/position on all slaves is mysql-bin.000011:106

Fri Aug 30 08:50:44 2013 – [info] Latest slaves (Slaves that received relay log files to the latest):

Fri Aug 30 08:50:44 2013 – [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

在原主節點上停掉浮動ip 16.9

Applying differential binary/relay log files /var/tmp/saved_master_binlog_from_192.168.16.6_3306_20130830085043.binlog on 192.168.16.5:3306. This may take long time…

Applying log files succeeded.

Fri Aug 30 08:50:49 2013 – [info] All relay logs were successfully applied.

Fri Aug 30 08:50:49 2013 – [info] Getting new master’s binlog name and position..

Fri Aug 30 08:50:49 2013 – [info] mysql-bin.000006:106

Fri Aug 30 08:50:49 2013 – [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=’192.168.16.5’, MASTER_PORT=3306, MASTER_LOG_FILE=’mysql-bin.000006’, MASTER_LOG_POS=106, MASTER_USER=’replicationuser’, MASTER_PASSWORD=’xxx’;

Fri Aug 30 08:50:49 2013 – [info] Executing master IP activate script:

Fri Aug 30 08:50:49 2013 – [info] /masterha/scripts/master_ip_failover –command=start –ssh_user=root –orig_master_host=192.168.16.6 –orig_master_ip=192.168.16.6 –orig_master_port=3306 –new_master_host=192.168.16.5 –new_master_ip=192.168.16.5 –new_master_port=3306 –new_master_user=’root’–new_master_password=’sztf@yunwei’

Unknown option: new_master_user

Unknown option: new_master_password

IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.16.9/24===

Enabling the VIP – 192.168.16.9/24 on the new master – 192.168.16.5

Fri Aug 30 08:50:49 2013 – [info] OK.

Fri Aug 30 08:50:49 2013 – [info] ** Finished master recovery successfully.

Fri Aug 30 08:50:49 2013 – [info] * Phase 3: Master Recovery Phase completed.

Fri Aug 30 08:50:49 2013 – [info]

Fri Aug 30 08:50:49 2013 – [info] * Phase 4: Slaves Recovery Phase..

在新主節點上啟動vip 16.9

在新節點上檢視Ip地址情況：

root@ubuntu:~# ifconfig -a

eth0 Link encap:乙太網硬體地址 94:de:80:18:11:82

inet 地址:192.168.16.5 廣播:192.168.16.255 掩碼:255.255.255.0

inet6 地址: fe80::96de:80ff:fe18:1182/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 躍點數:1

接收資料包:19122846 錯誤:0 丟棄:58 過載:0 幀數:0

傳送資料包:241270 錯誤:0 丟棄:0 過載:0 載波:0

碰撞:0 傳送佇列長度:1000

接收位元組:1644186729 (1.6 GB) 傳送位元組:40646216 (40.6 MB)

中斷:40 基本地址:0x8000

eth0:1 Link encap:乙太網硬體地址 94:de:80:18:11:82

inet 地址:192.168.16.9 廣播:192.168.16.255 掩碼:255.255.255.0

UP BROADCAST RUNNING MULTICAST MTU:1500 躍點數:1

中斷:40 基本地址:0x8000

lo Link encap:本地環回

inet 地址:127.0.0.1 掩碼:255.0.0.0

inet6 地址: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 躍點數:1

接收資料包:3169361 錯誤:0 丟棄:0 過載:0 幀數:0

傳送資料包:3169361 錯誤:0 丟棄:0 過載:0 載波:0

碰撞:0 傳送佇列長度:0

接收位元組:158507432 (158.5 MB) 傳送位元組:158507432 (158.5 MB)

root@ubuntu:~#

vip已經正常切換

4.4手工切換測試

切換命令：

masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive

輸出如下：

root@ubuntu:~# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive

Fri Aug 30 10:21:29 2013 - [info] MHA::MasterRotate version 0.55.

Fri Aug 30 10:21:29 2013 - [info] Starting online master switch..

Fri Aug 30 10:21:29 2013 - [info]

Fri Aug 30 10:21:29 2013 - [info] * Phase 1: Configuration Check Phase..

Fri Aug 30 10:21:29 2013 - [info]

Fri Aug 30 10:21:29 2013 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Aug 30 10:21:29 2013 - [info] Reading application default configurations from /etc/masterha/app1.cnf..

Fri Aug 30 10:21:29 2013 - [info] Reading server configurations from /etc/masterha/app1.cnf..

Fri Aug 30 10:21:29 2013 - [info] Current Alive Master: 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 10:21:29 2013 - [info] Alive Slaves:

Fri Aug 30 10:21:29 2013 - [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Fri Aug 30 10:21:29 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 10:21:29 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 10:21:29 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Fri Aug 30 10:21:29 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 10:21:29 2013 - [info] Primary candidate for the new Master (candidate_master is set)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.16.6(192.168.16.6:3306)? (YES/no): yes

Fri Aug 30 10:21:44 2013 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..

Fri Aug 30 10:21:44 2013 - [info] ok.

Fri Aug 30 10:21:44 2013 - [info] Checking MHA is not monitoring or doing failover..

Fri Aug 30 10:21:44 2013 - [info] Checking replication health on 192.168.16.5..

Fri Aug 30 10:21:44 2013 - [info] ok.

Fri Aug 30 10:21:44 2013 - [info] Checking replication health on 192.168.16.7..

Fri Aug 30 10:21:44 2013 - [info] ok.

Fri Aug 30 10:21:44 2013 - [info] Searching new master from slaves..

Fri Aug 30 10:21:44 2013 - [info] Candidate masters from the configuration file:

Fri Aug 30 10:21:44 2013 - [info] 192.168.16.5(192.168.16.5:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Fri Aug 30 10:21:44 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 10:21:44 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 10:21:44 2013 - [info] 192.168.16.6(192.168.16.6:3306) Version=5.1.48-log log-bin:enabled

Fri Aug 30 10:21:44 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bin:enabled

Fri Aug 30 10:21:44 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 10:21:44 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 10:21:44 2013 - [info] Non-candidate masters:

Fri Aug 30 10:21:44 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..

Fri Aug 30 10:21:44 2013 - [info]

From:

192.168.16.6 (current master)

+--192.168.16.5

+--192.168.16.7

To:

192.168.16.5 (new master)

+--192.168.16.7

Starting master switch from 192.168.16.6(192.168.16.6:3306) to 192.168.16.5(192.168.16.5:3306)? (yes/NO): yes

Fri Aug 30 10:21:50 2013 - [info] Checking whether 192.168.16.5(192.168.16.5:3306) is ok for the new master..

Fri Aug 30 10:21:50 2013 - [info] ok.

Fri Aug 30 10:21:50 2013 - [info] ** Phase 1: Configuration Check Phase completed.

Fri Aug 30 10:21:50 2013 - [info]

Fri Aug 30 10:21:50 2013 - [info] * Phase 2: Rejecting updates Phase..

Fri Aug 30 10:21:50 2013 - [info]

master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes

Fri Aug 30 10:22:15 2013 - [info] Locking all tables on the orig master to reject updates from everybody (including root):

Fri Aug 30 10:22:15 2013 - [info] Executing FLUSH TABLES WITH READ LOCK..

Fri Aug 30 10:22:15 2013 - [info] ok.

Fri Aug 30 10:22:15 2013 - [info] Orig master binlog:pos is mysql-bin.000012:106.

Fri Aug 30 10:22:15 2013 - [info] Waiting to execute all relay logs on 192.168.16.5(192.168.16.5:3306)..

Fri Aug 30 10:22:15 2013 - [info] master_pos_wait(mysql-bin.000012:106) completed on 192.168.16.5(192.168.16.5:3306). Executed 0 events.

Fri Aug 30 10:22:15 2013 - [info] done.

Fri Aug 30 10:22:15 2013 - [info] Getting new master's binlog name and position..

Fri Aug 30 10:22:15 2013 - [info] mysql-bin.000004:106

Fri Aug 30 10:22:15 2013 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.16.5', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=106, MASTER_USER='replicationuser', MASTER_PASSWORD='xxx';

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] * Switching slaves in parallel..

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] -- Slave switch on host 192.168.16.7(192.168.16.7:3306) started, pid: 10166

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] Log messages from 192.168.16.7 ...

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] Waiting to execute all relay logs on 192.168.16.7(192.168.16.7:3306)..

Fri Aug 30 10:22:15 2013 - [info] master_pos_wait(mysql-bin.000012:106) completed on 192.168.16.7(192.168.16.7:3306). Executed 0 events.

Fri Aug 30 10:22:15 2013 - [info] done.

Fri Aug 30 10:22:15 2013 - [info] Resetting slave 192.168.16.7(192.168.16.7:3306) and starting replication from the new master 192.168.16.5(192.168.16.5:3306)..

Fri Aug 30 10:22:15 2013 - [info] Executed CHANGE MASTER.

Fri Aug 30 10:22:15 2013 - [info] Slave started.

Fri Aug 30 10:22:15 2013 - [info] End of log messages from 192.168.16.7 ...

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] -- Slave switch on host 192.168.16.7(192.168.16.7:3306) succeeded.

Fri Aug 30 10:22:15 2013 - [info] Unlocking all tables on the orig master:

Fri Aug 30 10:22:15 2013 - [info] Executing UNLOCK TABLES..

Fri Aug 30 10:22:15 2013 - [info] ok.

Fri Aug 30 10:22:15 2013 - [info] All new slave servers switched successfully.

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:15 2013 - [info] * Phase 5: New master cleanup phase..

Fri Aug 30 10:22:15 2013 - [info]

Fri Aug 30 10:22:16 2013 - [info] 192.168.16.5: Resetting slave info succeeded.

Fri Aug 30 10:22:16 2013 - [info] Switching master to 192.168.16.5(192.168.16.5:3306) completed successfully.

重點關注：

From:

192.168.16.6 (current master)

+--192.168.16.5

+--192.168.16.7

To:

192.168.16.5 (new master)

+--192.168.16.7

切換後16.6的資料庫將會被關閉

4.5 關機測試（直接關閉主資料庫所在主機）

停機命令：

root@ubuntu:~# reboot

來自root@ubuntu的廣播資訊

(/dev/pts/2) 於 18:25 ...

現在，系統將關閉並且重新啟動！

root@ubuntu:~#

測試結果：

主伺服器關閉之後，管理程式會將其中一個從節點升級為主節點並把ip地址在因節點上繫結。

4.6 關於權重的設定

當前狀態：

192.168.16.6 (current master)

+--192.168.16.5

+--192.168.16.7

我們期待切換到16.7上，（預設情況下如果權重都一樣應該是切換到5上面，因為5 servid靠前）修改配置如下：

[server default]

manager_workdir=/masterha/app1

manager_log=/masterha/app1/manager.log

user=root

password=sztf@yunwei

ssh_user=root

repl_user=replicationuser

repl_password=replicationuser

ping_interval=1

shutdown_script=""

#master_ip_failover_script=''

master_ip_failover_script="/masterha/scripts/master_ip_failover"

master_ip_online_change_script=""

report_script=""

[server1]

hostname=192.168.16.5

master_binlog_dir="/usr/local/mysql/mysql_data/"

[server2]

hostname=192.168.16.6

master_binlog_dir="/usr/local/mysql/mysql_data/"

[server3]

hostname=192.168.16.7

candidate_master=1

master_binlog_dir="/usr/local/mysql/mysql_data/"

~

將16.7權重設定為1，將16.5,16.6權重去掉清除app1.failover.complete檔案，停止16.6資料庫，命令如下：

root@ubuntu:~# /usr/local/mysql/bin/mysql.server.sh stop

Shutting down MySQL...... *

root@ubuntu:~#

擷取一段日誌分析：

Fri Aug 30 19:18:32 2013 - [info] * Phase 3.3: Determining New Master Phase..

Fri Aug 30 19:18:32 2013 - [info]

Fri Aug 30 19:18:32 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..

Fri Aug 30 19:18:32 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.

Fri Aug 30 19:18:32 2013 - [info] Searching new master from slaves..

Fri Aug 30 19:18:32 2013 - [info] Candidate masters from the configuration file:

Fri Aug 30 19:18:32 2013 - [info] 192.168.16.7(192.168.16.7:3306) Version=5.1.48-log (oldest major version between slaves) log-bi

n:enabled

Fri Aug 30 19:18:32 2013 - [info] Replicating from 192.168.16.6(192.168.16.6:3306)

Fri Aug 30 19:18:32 2013 - [info] Primary candidate for the new Master (candidate_master is set)

Fri Aug 30 19:18:32 2013 - [info] Non-candidate masters:

Fri Aug 30 19:18:32 2013 - [info] Searching from candidate_master slaves which have received the latest relay log events..

Fri Aug 30 19:18:32 2013 - [info] New master is 192.168.16.7(192.168.16.7:3306)

Fri Aug 30 19:18:32 2013 - [info] Starting master failover..

Fri Aug 30 19:18:32 2013 - [info]

From:

192.168.16.6 (current master)

+--192.168.16.5

+--192.168.16.7

To:

192.168.16.7 (new master)

+--192.168.16.5

直接從配置檔案中選中了16.7，在後續日誌中繼續查詢是否接收到了最新的日誌

官方文件對兩個引數的解釋如下：

candidate_master

You might use different kinds of machines between slaves, and want to promote the most reliable machine to the new master (i.e. promoting a RAID1+0 slave rather than RAID0 slaves).

By setting candidate_master to 1, the server is prioritized to the new master, as long as it meets conditions to be the new master (i.e. binary log is enabled, it does not delay replication significantly, etc). So candidate_master=1 does not mean that the specified host always becomes new master when the current master crashes, but is helpful to set priority.

If you set candidate_master=1 on multiple servers, priority is decided by sort order by block name ([server_xxx]). [server_1] will have higher priority than [server_2].

no_master

By setting no_master=1 on the target server, the server never becomes the new master. This is useful if you have some servers that should not become the new master. For example, you may want to set no_master=1 when you run slaves on unreliable (RAID0) machine, or when you run a slave at a remote data center. Note that if none of the slaves can be new master, MHA aborts and does not start monitoring/failover.

即透過設定candidate_master1，伺服器的優先順序到新的主，只要符合條件，成為新的主（即二進位制日誌被啟用，沒有顯著的複製延遲）。所以candidate_master=1並不意味著指定的主機時，總是成為新的主當主崩潰時，但設定的優先順序是有幫助的。

如果不想讓某臺機器成為新的主，可以透過設定no_master=1來實現。

4.7關於MHA的檢測機制

官方文件上我沒有找到太多有價值的線索，只看到如下一些描述

ping_interval

This parameter states how often MHA Manager pings(executes ping SQL statement) the master. After missing three connection intervals in a row, MHA Manager decides that the MySQL master is dead. Thus, the maximum time for discovering a failure through the ping mechanism is four times the ping interval. The default is 3 (3 seconds).

If MHA Manager fails to connect by too many connections or authentication errors, it doesn't count that the master is dead.

ping_type

(Supported from 0.53) By default, MHA establishes a persistent connection to a master and checks master's availability by executing "SELECT 1" (ping_type=SELECT). But in some cases, it is better to check by connecting/disconnecting every time, because it's more strict and it can detect TCP connection level failure more quickly. Setting ping_type=CONNECT makes it possible.

When MHA Manager is successfully monitoring the MySQL master, status code (exit code) 0 should be returned like above.

All status codes and descriptions are listed below.

Status Code(Exit code)	Status String	Description
0	PING_OK	Master is running and MHA Manager is monitoring. Master state is alive.
1	---	Unexpected error happened. For example, config file does not exist. If this error happens, check arguments are valid or not.
2	NOT_RUNNING	MHA Manager is not running. Master state is unknown.
3	PARTIALLY_RUNNING	MHA Manager main process is not running, but child processes are running. This should not happen and should be investigated. Master state is unknown.
10	INITIALIZING_MONITOR	MHA Manager is just after startup and initializing. Wait for a while and see how the status changes. Master state is unknown.
20	PING_FAILING	MHA Manager detects ping to master is failing. Master state is maybe down.
21	PING_FAILED	MHA Manager detects either a) ping to master failed three times, b) preparing for starting master failover. Master state is maybe down.
30	RETRYING_MONITOR	MHA Manager internal health check program detected that master was not reachable from manager, but after double check MHA Manager verified the master is alive, and currently waiting for retry. Master state is very likely alive.
31	CONFIG_ERROR	There are some configuration problems and MHA Manager can't monitor the target master. Check a logfile for detail. Master state is unknown.
32	TIMESTAMP_OLD	MHA Manager detects that ping to master is ok but status file is not updated for a long time. Check whether MHA Manager itself hangs or not. Master state is unknown.
50	FAILOVER_RUNNING	MHA Manager confirms that master is down and running failover. Master state is dead.
51	FAILOVER_ERROR	MHA Manager confirms that master is down and running failover, but failed during failover. Master state is dead.

從上面證據上分析，MHA應該是與主節點之間建立一個長連線，並執行以下sql語句來檢查資料庫的狀態包括程式狀態，並透過ping命令來檢查主機狀態，僅僅是我個人猜測。

關於幾個問題的回答

1: 自動切換, 你是用什麼方法讓原master 異常的?

關閉資料庫、關機主機，命令如下：

/usr/local/mysql/bin/mysql.server.sh stop

Mysqladmin –uroot –p shutdown

reboot

2 : 手工切換, 能否指明哪個伺服器為新master ?

可以透過設定權重讓這個伺服器優先順序高一下，但並不是總能實現。

3: 看到你提到了 keepalived , 最後是否沒有使用它, 而是使用了 MHA 自帶的 vip管理功能 ?

使用了MHA自帶的模板來完善的。

4 : 修改權重, 只是修改了slave 升級為master 的級別, 這個在哪裡修改, 能否徹底關閉一個 slave , 要求它不能升級為 master ?

修改管理節點上的配置檔案，設定no_master=1

問題5: 7.手工切換需要先定義好master_ip_online_change_script指令碼 --- 如果不定義, 會有什麼後果 ?

不定義就只發生資料庫的切換，而ip地址不會被接管

問題6: master 檢測故障的檢測點是哪幾個?

從官方文件上沒有找到確切的答案，從程式碼上分析，由於程式碼比較雜亂，暫時沒有能理找到太多線索。

root@ubuntu:/masterha/app1#vi /opt/mha4mysql-manager-0.55/bin/masterha_master_monitor

use strict;

use warnings FATAL => 'all';

use English qw(-no_match_vars);

use Getopt::Long qw(:config pass_through);

use Pod::Usage;

use MHA::MasterMonitor;

use MHA::ManagerConst;

my $help;

my $version;

$| = 1;

GetOptions(

'help' => \$help,

'version' => \$version,

);

if ($version) {

print "masterha_master_monitor version $MHA::ManagerConst::VERSION.\n";

exit 0;

}

if ($help) {

pod2usage(0);

}

my ( $exit_code, $dead_master, $ssh_reachable ) =

MHA::MasterMonitor::main( "--monitor_only", @ARGV );

if ( $dead_master->{hostname} ) {

print "Master $dead_master->{hostname} is dead!\n";

print "IP Address: $dead_master->{ip} ";

print "Port: $dead_master->{port}\n";

}

if ( $exit_code && $exit_code eq "0" ) {

if ($ssh_reachable) {

print "SSH: reachable\n";

}

else {

print "SSH: NOT reachable\n";

}

exit 0;

}

exit 1 if ( !defined($exit_code) );

exit $exit_code;

查詢MasterMonitor包檔案，在下面的檔案中定義

/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm

其中又呼叫了

use MHA::ServerManager;

use MHA::HealthCheck;

use MHA::FileStatus;

use MHA::SSHCheck;

檢視/usr/local/share/perl/5.14.2/MHA/HealthCheck.pm檔案

程式碼比較長，從中可以看到一些登陸mysql，以及ssh的資訊：

my $log = $self->{logger};

$self->{dbh} = DBI->connect(

"DBI:mysql:;host=$self->{ip};"

. "port=$self->{port};mysql_connect_timeout=$connect_timeout",

$self->{user},

$self->{password},

{ PrintError => 0, RaiseError => $raise_error }

);

sub invoke_ssh_check {

my $self = shift;

my $log = $self->{logger};

if ( !$self->{_ssh_check_invoked} ) {

if ( $self->{_ssh_check_pid} = fork ) {

$self->{_ssh_check_invoked} = 1;

}

elsif ( defined $self->{_ssh_check_pid} ) {

$SIG{INT} = $SIG{HUP} = $SIG{QUIT} = $SIG{TERM} = "DEFAULT";

$log->info("Executing SSH check script: $self->{ssh_check_command}");

#child ssh check process

exit ssh_check(

$self->{ssh_user}, $self->{ssh_host},

$self->{ssh_ip}, $self->{ssh_port},

$self->{logger}, $self->{ssh_connection_timeout},

$self->{ssh_check_command}

);

}

else {

croak "Forking SSH check process failed. Can't continue operation.\n";

}

MysqlMaster切換方案MHA的探索與測試結果

相關文章