介紹
MHA(Master High Availability)目前在MySQL高可用方面是一個相對成熟的解決方案,是一套優秀的作為MySQL高可用性環境下故障切換和主從提升的高可用軟體。在MySQL故障切換過程中,MHA能做到在0~30秒之內自動完成資料庫的故障切換操作,並且在進行故障切換的過程中,MHA能在最大程度上保證資料的一致性,以達到真正意義上的高可用。它由兩部分組成:MHA Manager(管理節點)和MHA Node(資料節點)。MHA Manager可以單獨部署在一臺獨立的機器上管理多個master-slave叢集,也可以部署在一臺slave節點上。MHA Node執行在每臺MySQL伺服器上,MHA Manager會定時探測叢集中的master節點,當master出現故障時,它可以自動將最新資料的slave提升為新的master,然後將所有其他的slave重新指向新的master。整個故障轉移過程對應用程式完全透明。在MHA自動故障切換過程中,MHA試圖從當機的主伺服器上儲存二進位制日誌,最大程度的保證資料的不丟失,但這並不總是可行的。例如,如果主伺服器硬體故障或無法通過ssh訪問,MHA沒法儲存二進位制日誌,只進行故障轉移而丟失了最新的資料。使用MySQL 5.5的半同步複製,可以大大降低資料丟失的風險。MHA可以與半同步複製結合起來。如果只有一個slave已經收到了最新的二進位制日誌,MHA可以將最新的二進位制日誌應用於其他所有的slave伺服器上,因此可以保證所有節點的資料一致性。
原理
(1)從當機崩潰的master儲存二進位制日誌事件(binlog events);
(2)識別含有最新更新的slave;
(3)應用差異的中繼日誌(relay log)到其他的slave;
(4)應用從master儲存的二進位制日誌事件(binlog events);
(5)提升一個slave為新的master;
(6)使其他的slave連線新的master進行復制;
MHA軟體由兩部分組成,Manager工具包和Node工具包
Manager工具包主要包括以下幾個工具:
masterha_check_ssh 檢查MHA的SSH配置狀況 masterha_check_repl 檢查MySQL複製狀況 masterha_manger 啟動MHA masterha_check_status 檢測當前MHA執行狀態 masterha_master_monitor 檢測master是否當機 masterha_master_switch 控制故障轉移(自動或者手動) masterha_conf_host 新增或刪除配置的server資訊
Node工具包(這些工具通常由MHA Manager的指令碼觸發,無需人為操作)主要包括以下幾個工具:
save_binary_logs 儲存和複製master的二進位制日誌 apply_diff_relay_logs 識別差異的中繼日誌事件並將其差異的事件應用於其他的slave filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用這個工具) purge_relay_logs 清除中繼日誌(不會阻塞SQL執行緒)
一、安裝MHA
1.建立安裝目錄
Node伺服器安裝
mkdir -p /usr/local/mha
manage伺服器安裝
mkdir -p /usr/local/mha/ha1/fail_script
mkdir -p /usr/local/mha/ha1/workdir
/usr/local/mha:程式安裝目錄
/usr/local/mha/ha1:用於區別每一個mha方案,當前方案ha1
/usr/local/mha/ha1/fail_script:方案ha1的failover指令碼儲存路徑
/usr/local/mha/ha1/workdir:方案ha1的的日誌和failover產生的binlog儲存路徑
2.安裝epel外掛
使用yum方式安裝,需要安裝epel源
epel源
wget http://mirrors.ustc.edu.cn/fedora/epel/6/x86_64/epel-release-6-8.noarch.rpm
所有伺服器都安裝(mananage需要安裝以下所有外掛,node節點只需要安裝perl-DBD-MySQL,cpan)
yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes cpan
也可以使用perl方式安裝
#!/bin/bash wget http://xrl.us/cpanm --no-check-certificate mv cpanm /usr/bin chmod 755 /usr/bin/cpanm cat > /root/list << EOF install DBD::mysql install Config::Tiny install Log::Dispatch install Parallel::ForkManager install Time::HiRes install CPAN install Digest::SHA EOF for package in `cat /root/list` do cpanm $package done
3.安裝MHA Node軟體包,所有伺服器都要安裝
tar -xvf mha4mysql-node-0.54.tar.gz cd mha4mysql-node-0.54 perl Makefile.PL INSTALL_BASE=/usr/local/mha make && make install
4.安裝MHA Manager軟體包,只在Manager主機上安裝
tar -xvf mha4mysql-manager-0.55.tar.gz cd mha4mysql-manager-0.55 perl Makefile.PL INSTALL_BASE=/usr/local/mha make && make install
cp samples/scripts/* /usr/local/mha/bin/
master_ip_failover:自動切換時vip管理的指令碼
master_ip_online_change:手動切換使用的指令碼
power_manager:故障發生後關閉主機的指令碼
send_report:傳送報警的指令碼。
5.修改環境變數
將MHA Manager主機的/usr/local/mha/bin加入環境變數
6.新增軟連結
為了不麻煩所有伺服器都執行吧,其實最後兩個mysql,mysqlbinlog的軟連結只有Node伺服器需要新增,其它的所有伺服器都需要新增。
mkdir -p /usr/local/bin mkdir -p /usr/local/share/man/man1 mkdir -p /usr/local/share/perl5/MHA ln -s /usr/local/mha/bin/* /usr/local/bin; ln -s /usr/local/mha/man/man1/* /usr/local/share/man/man1; ln -s /usr/local/mha/lib/perl5/MHA /usr/local/share/perl5/MHA; ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog; ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql;
二、配置MHA
1.配置SSH無密碼登入
(1)在manage配置到所有Node節點的無密碼登入
ssh-keygen -t rsa 一直enter,會在/root/.ssh/下面生成id_rsa.pub ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(2)在Node 10配置到Node 20,30的無密碼登入
ssh-keygen -t rsa ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(3)在Node 20配置到Node 10,30的無密碼登入
ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(4)在Node 30配置到Node 10,20的無密碼登入
ssh-keygen -t rsa ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20
2. 搭建複製環境
複製環境之前已經搭建好,可以參考我前面寫過的文章,複製使用者和密碼都是repl;每個Node上都必須建立這個repl賬號,除非Node不作為故障專業的master
1.在所有Node上建立Manage監控使用者
grant all privileges on *.* to 'root'@'192.168.137.%' identified by 'root';
三、配置Manage
1.配置啟動檔案
vim /usr/local/mha/ha1/ha1.cnf
[server default] manager_workdir=/usr/local/mha/ha1/workdir ##專案的主目錄 manager_log=/usr/local/mha/ha1/workdir/manager.log ###mha記錄日誌 master_binlog_dir=/mysql/log ####node伺服器的binlog存放路徑,如果每個node的binlog路徑不一致的話就在下面的每個server下面單獨配置 master_ip_failover_script=/usr/local/mha/ha1/fail_script/master_ip_failover ####mha線上自動failover時處理VIP的配置檔案 master_ip_online_change_script=/usr/local/mha/ha1/fail_script/master_ip_online_change ####線上手動執行master切換時VIP的處理檔案 secondary_check_script=/usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 ##一旦MHA到master之間的網路出現問題,manager會嘗試從backup登入到masger #report_script=/usr/local/mha/ha1/fail_script/send_report ###發生切換後執行的報警指令碼 shutdown_script="" ####故障後關閉master主機的指令碼(主要是使用keepalive做VIP時會出現腦裂導致VIP頻繁切換所以會將故障的master關閉) ping_interval=1 ###監控mater,ping的頻率 remote_workdir=/tmp ###node伺服器在發生master切換時,binlog保持的路徑,每個node都會在該目錄下儲存一份差異的binlog,除非沒有差異。 repl_password=repl ##複製使用的使用者名稱,每個node伺服器都需要存在 repl_user=repl ##複製使用的密碼 user=root ##mnager監控用的mysql root使用者 password=root ##root使用者密碼 ssh_user=root ##ssh登入使用者名稱 [server1] hostname=192.168.137.10 port=3306 candidate_master=1 check_repl_delay=0 [server2] hostname=192.168.137.20 port=3306 #master_binlog_dir=/mysql/log candidate_master=1 ##設定為候選master,如果設定該引數以後,發生主從切換以後將會將此從庫提升為主庫,即使這個主庫不是叢集中事件最新的slave check_repl_delay=0 ##預設情況下如果一個slave落後master 100M的relay logs的話,MHA將不會選擇該slave作為一個新的master,因為對於這個slave的恢復需要花費很長時間,通過設定check_repl_delay=0,MHA觸發切換在選擇一個新master的時候將會忽略複製延時,這個引數對於設定了candidate_master=1的主機非常有用,因為這個候選主在切換的過程中一定是新的master [server3] hostname=192.168.137.30 port=3306 ignore_fail=1 ####如果不加上該引數,當該slave主機故障了,mha將無法啟動,加上該引數會忽略該主機是否正常,在mha啟動的時候加上引數--ignore_fail_on_start no_master=1 ###不將該主機轉換為master
注意:對於上面的配置一定要確保server1和server2之間是最新的binlog,一般會配置二者為雙主的半同步複製,這樣就保證了它們之間的binlog是最新的,否則應用差異的binlog將花費非常長的時間(如果它們和master延時非常大的情況下)
2.master_ip_failover
VIP的配置可以使用keepalived也可以寫指令碼,keepalived對網路的要求很高否則容易腦裂,在我前面搭建雙主環境講過keepalived的搭建方法,我這裡使用指令碼的方式。
#!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port, $new_master_user, $new_master_password ); my $vip = '192.168.137.50/24'; ###VIP my $key = '1'; ###用於區別本身的eth0 my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; GetOptions( 'command=s' => \$command, 'ssh_user=s' => \$ssh_user, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, 'new_master_user=s' => \$new_master_user, 'new_master_password=s' => \$new_master_password, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub stop_vip() { return 0 unless ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
注意:需要手動先在master伺服器上面新增VIP
/sbin/ifconfig eth0:1 192.168.137.50/24
3.master_ip_online_change
perl指令碼
#!/usr/bin/env perl use strict; use warnings FATAL =>'all'; use Getopt::Long; my $vip = '192.168.137.50/24'; # Virtual IP my $key = "1"; my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; my $exit_code = 0; my ( $command, $orig_master_is_new_slave, $orig_master_host, $orig_master_ip, $orig_master_port, $orig_master_user, $orig_master_password, $new_master_host, $new_master_ip, $new_master_port, $new_master_user, $new_master_password, ); GetOptions( 'command=s' => \$command, 'orig_master_is_new_slave' => \$orig_master_is_new_slave, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'orig_master_user=s' => \$orig_master_user, 'orig_master_password=s' => \$orig_master_password, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, 'new_master_user=s' => \$new_master_user, 'new_master_password=s' => \$new_master_password, ); exit &main(); sub main { #print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { # $orig_master_host, $orig_master_ip, $orig_master_port are passed. # If you manage master ip address at global catalog database, # invalidate orig_master_ip here. my $exit_code = 1; eval { print "\n\n\n***************************************************************\n"; print "Disabling the VIP - $vip on old master: $orig_master_host\n"; print "***************************************************************\n\n\n\n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { # all arguments are passed. # If you manage master ip address at global catalog database, # activate new_master_ip here. # You can also grant write access (create user, set read_only=0, etc) here. my $exit_code = 10; eval { print "\n\n\n***************************************************************\n"; print "Enabling the VIP - $vip on new master: $new_master_host \n"; print "***************************************************************\n\n\n\n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; `ssh $orig_master_user\@$orig_master_host \" $ssh_start_vip \"`; exit 0; } else { &usage(); exit 1; } } # A simple system call that enable the VIP on the new master sub start_vip() { `ssh $new_master_user\@$new_master_host \" $ssh_start_vip \"`; } # A simple system call that disable the VIP on the old_master sub stop_vip() { `ssh $orig_master_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
shell指令碼
#/bin/bash #source /root/.bash_profile vip=`echo '192.168.137.50/24'` # Virtual IP key=`echo '1'` command=`echo "$1" | awk -F = '{print $2}'` orig_master_host=`echo "$2" | awk -F = '{print $2}'` new_master_host=`echo "$7" | awk -F = '{print $2}'` stop_vip=`echo "ssh root@$orig_master_host /sbin/ifconfig eth0:$key down"` start_vip=`echo "ssh root@$new_master_host /sbin/ifconfig eth0:$key $vip"` if [ $command = 'stop' ] then echo -e "\n\n\n***************************************************************\n" echo -e "Disabling the VIP - $vip on old master: $orig_master_host\n" $stop_vip if [ $? -eq 0 ] then echo "Disabled the VIP successfully" else echo "Disabled the VIP failed" fi echo -e "***************************************************************\n\n\n\n" fi if [ $command = 'start' -o $command = 'status' ] then echo -e "\n\n\n***************************************************************\n" echo -e "Enabling the VIP - $vip on new master: $new_master_host \n" $start_vip if [ $? -eq 0 ] then echo "Enabled the VIP successfully" else echo "Enabled the VIP failed" fi echo -e "***************************************************************\n\n\n\n" fi
4.send_report
#!/usr/bin/perl # Copyright (C) 2011 DeNA Co.,Ltd. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA ## Note: This is a sample script and is not complete. Modify the script based on your environment. use strict; use warnings FATAL => 'all'; use Mail::Sender; use Getopt::Long; #new_master_host and new_slave_hosts are set only when recovering master succeeded my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body ); my $smtp='smtp.163.com'; my $mail_from='xxxx'; my $mail_user='xxxxx'; my $mail_pass='xxxxx'; my $mail_to=['xxxx','xxxx']; GetOptions( 'orig_master_host=s' => \$dead_master_host, 'new_master_host=s' => \$new_master_host, 'new_slave_hosts=s' => \$new_slave_hosts, 'subject=s' => \$subject, 'body=s' => \$body, ); mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body); sub mailToContacts { my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_; open my $DEBUG, "> /tmp/monitormail.log" or die "Can't open the debug file:$!\n"; my $sender = new Mail::Sender { ctype => 'text/plain; charset=utf-8', encoding => 'utf-8', smtp => $smtp, from => $mail_from, auth => 'LOGIN', TLS_allowed => '0', authid => $user, authpwd => $passwd, to => $mail_to, subject => $subject, debug => $DEBUG }; $sender->MailMsg( { msg => $msg, debug => $DEBUG } ) or print $Mail::Sender::Error; return 1; } # Do whatever you want here exit 0;
四、配置relay_log的清除方式(在每個Node上)
(1)所有Node的cnf配置檔案加上
relay_log_purge=0
MHA在發生切換的過程中,從庫的恢復過程中依賴於relay log的相關資訊,所以這裡要將relay log的自動清除設定為OFF,採用手動清除relay log的方式。
在預設情況下,從伺服器上的中繼日誌會在SQL執行緒執行完畢後被自動刪除。但是在MHA環境中,這些中繼日誌在恢復其他從伺服器時可能會被用到,因此需要禁用中繼日誌的自動刪除功能。定期清除中繼日誌需要考慮到複製延時的問題。在ext3的檔案系統下,刪除大的檔案需要一定的時間,會導致嚴重的複製延時。為了避免複製延時,需要暫時為中繼日誌建立硬連結,因為在linux系統中通過硬連結刪除大檔案速度會很快。
提示:在mysql資料庫中,刪除大表時,通常也採用建立硬連結的方式
MHA節點中包含了pure_relay_logs命令工具,它可以為中繼日誌建立硬連結,執行SET GLOBAL relay_log_purge=1,等待幾秒鐘以便SQL執行緒切換到新的中繼日誌,再執行SET GLOBAL relay_log_purge=0。
pure_relay_logs指令碼引數如下所示:
--user mysql 使用者名稱 --password mysql 密碼 --port 埠號 --workdir 指定建立relay log的硬連結的位置,預設是/var/tmp,由於系統不同分割槽建立硬連結檔案會失敗,故需要執行硬連結具體位置,成功執行指令碼後,硬連結的中繼日誌檔案被刪除 --disable_relay_log_purge 預設情況下,如果relay_log_purge=1,指令碼會什麼都不清理,自動退出,通過設定這個引數,當relay_log_purge=1的情況下會將relay_log_purge設定為0。清理relay log之後,最後將引數設定為OFF。
(2)在每臺slave Node上建立
vim /usr/local/mha/purge_relay_log.sh
#!/bin/bash
user=root
passwd=root ####確保使用者和密碼能通過127.0.0.1登入
host='127.0.0.1'
port=3306
work_dir='/mysql/data'
purge='/usr/local/mha/bin/purge_relay_logs'
$purge --user=$user --password=$passwd --host=$host --disable_relay_log_purge --port=$port --workdir=$work_dir >> /usr/local/mha/purge_relay_logs.log 2>&1
chmod u+x /usr/local/mha/purge_relay_log.sh
將指令碼加入到os定時任務中
五、檢測啟動MHA
1.檢查ssh配置
masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 14:53:30 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 14:53:30 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Starting SSH connection tests.. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from root@192.168.137.20(192.168.137.20:22) to root@192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from root@192.168.137.20(192.168.137.20:22) to root@192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from root@192.168.137.30(192.168.137.30:22) to root@192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from root@192.168.137.30(192.168.137.30:22) to root@192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [debug] Thu Aug 25 14:53:30 2016 - [debug] Connecting via SSH from root@192.168.137.10(192.168.137.10:22) to root@192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:34 2016 - [debug] Connecting via SSH from root@192.168.137.10(192.168.137.10:22) to root@192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:35 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [info] All SSH connection tests passed successfully.
可以看到每個Node到其它的Node都是相通的。
2.檢查整個複製環境
masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 16:09:19 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 16:09:19 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] MHA::MasterMonitor version 0.55. Thu Aug 25 16:09:20 2016 - [info] Dead Servers: Thu Aug 25 16:09:20 2016 - [info] Alive Servers: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Thu Aug 25 16:09:20 2016 - [info] Alive Slaves: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Not candidate for the new Master (no_master is set) Thu Aug 25 16:09:20 2016 - [info] Current Alive Master: 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Checking slave configurations.. Thu Aug 25 16:09:20 2016 - [info] read_only=1 is not set on slave 192.168.137.20(192.168.137.20:3306). Thu Aug 25 16:09:20 2016 - [info] Checking replication filtering settings.. Thu Aug 25 16:09:20 2016 - [info] binlog_do_db= , binlog_ignore_db= Thu Aug 25 16:09:20 2016 - [info] Replication filtering check ok. Thu Aug 25 16:09:20 2016 - [info] Starting SSH connection tests.. Thu Aug 25 16:09:25 2016 - [info] All SSH connection tests passed successfully. Thu Aug 25 16:09:25 2016 - [info] Checking MHA Node version.. Thu Aug 25 16:09:26 2016 - [info] Version check ok. Thu Aug 25 16:09:26 2016 - [info] Checking SSH publickey authentication settings on the current master.. Thu Aug 25 16:09:27 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Thu Aug 25 16:09:29 2016 - [info] Master MHA Node version is 0.54. Thu Aug 25 16:09:29 2016 - [info] Checking recovery script configurations on the current master.. Thu Aug 25 16:09:29 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000138 Thu Aug 25 16:09:29 2016 - [info] Connecting to root@192.168.137.10(192.168.137.10).. Creating /tmp if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /mysql/log, up to mysql-bin.000138 Thu Aug 25 16:09:30 2016 - [info] Master setting check done. Thu Aug 25 16:09:30 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu Aug 25 16:09:30 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.20 --slave_ip=192.168.137.20 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:30 2016 - [info] Connecting to root@192.168.137.20(192.168.137.20:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000006 Temporary relay log file is /mysql/data/mysql-relay-bin.000006 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:31 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:31 2016 - [info] Connecting to root@192.168.137.30(192.168.137.30:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000002 Temporary relay log file is /mysql/data/mysql-relay-bin.000002 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:32 2016 - [info] Slaves settings check done. Thu Aug 25 16:09:32 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.20.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.30.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 16:09:32 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 16:09:32 2016 - [info] OK. Thu Aug 25 16:09:32 2016 - [warning] shutdown_script is not defined. Thu Aug 25 16:09:32 2016 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
--ignore_fail_on_start: 當有slave 節點宕掉時,預設是啟動不了的,加上 --ignore_fail_on_start 即使有節點宕掉也能啟動MHA,加上該引數會忽略啟動檔案中配置ignore_fail=1的server
3.檢查MHA Manager狀態
masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
由於mha還沒有啟動,所以這裡檢測是stopped
4.啟動MHA
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
--remove_dead_master_conf:該引數代表當發生主從切換後,老的主庫的ip將會從配置檔案中移除。這裡暫時不使用該引數,因為發生使用該引數會將ha1.cnf配置檔案搞亂。
--start_log:日誌。
--ignore_last_failover:發生主從切換後,MHAmanager服務會自動停掉,且在manager_workdir目錄下面生成檔案app1.failover.complete,若要啟動MHA,必須先刪除該檔案,該引數代表忽略上次MHA觸發切換產生的檔案,這裡設定為-ignore_last_failover。 在預設情況下,如果MHA檢測到連續發生當機,且兩次當機間隔不足8小時的話,則不會進行Failover,之所以這樣限制是為了避免ping-pong效應。
--ignore_fail_on_start: 當有slave 節點宕掉時,預設是啟動不了的,加上 --ignore_fail_on_start 即使有節點宕掉也能啟動MHA,加上該引數會忽略啟動檔案中配置ignore_fail=1的server。
(1)再次檢視MHA狀態是否正常:
[root@monitor ha1]# masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf ha1 (pid:6371) is running(0:PING_OK), master:192.168.137.10 [root@monitor ha1]#
(2)檢視啟動日誌
cat manager.log
Thu Aug 25 17:11:50 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 17:11:50 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 17:11:50 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 17:11:50 2016 - [info] OK. Thu Aug 25 17:11:50 2016 - [warning] shutdown_script is not defined. Thu Aug 25 17:11:50 2016 - [info] Set master ping interval 1 seconds. Thu Aug 25 17:11:50 2016 - [info] Set secondary check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 Thu Aug 25 17:11:50 2016 - [info] Starting ping health check on 192.168.137.10(192.168.137.10:3306).. Thu Aug 25 17:11:50 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. [root@monitor ha1]#
(3)產生的檔案
ha1.master_status.health:mha正常啟動會產生該檔案
manager.log:mha監控日誌
start.log:mha啟動時生成的日誌
5.關閉MHA
masterha_stop --conf=/usr/local/mha/ha1/ha1.cnf
六、故障處理步驟
發生主從切換後,MHA服務會自動停掉
1.檢查日誌
檢查故障處理的日誌,確保故障正常轉移。
cat /usr/local/mha/ha1/manager.log
2.處理故障master
處理故障的master,將其配置為從庫chang到新的master,可以從manager.log找到change語句。
grep "CHANGE MASTER TO MASTER" /usr/local/mha/ha1/manager.log | tail -1
Fri Aug 26 12:04:22 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000143', MASTER_LOG_POS=22123166, MASTER_USER='repl', MASTER_PASSWORD='xxx';
注意:這裡要確保slave的SQL_THREAD和IO_TRREAD正常,如果是配置半同步複製要確保半同步複製啟動正常,可以執行"show status like '%rpl_%';",具體參考前面半同步複製的搭建。
3.修改ha1.cnf配置檔案
需要修改"secondary_check_script"選項中的master_host、master_ip、master_port為新的master;如果兩臺電腦的配置都相同的話其它地方不用修改。
4.刪除fail檔案(非必需)
由於啟動mha的時候加上了--ignore_last_failover引數,所以不刪除failower生成的檔案也能啟動,否則需要刪除failower生成的檔案“ha1.failover.complete”。
rm -f /usr/local/mha/ha1/ha1.failover.complete
5.check檢查
檢查SSH配置 masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf 檢查複製 masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf 檢查狀態 masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
必需保證所有的檢查都通過
6.啟動MHA
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
七、模擬Failover
1.自動failover
我這裡是非同步複製,137.20是當前的master,然後在137.20上執行併發插入,同時關閉137.10和137.30的IO執行緒,在137.20上壓測一段時間,然後先開啟137.30的IO執行緒,過幾秒鐘再開啟137.10的IO執行緒;保證137.30的binlog比候選的137.10的binlog更新。
master 137.20(22497564)
candidate slave:137.10(pos=9857376)
new replay slave:137.30(pos=22461852)
Fri Aug 26 11:57:36 2016 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query) Fri Aug 26 11:57:36 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin Fri Aug 26 11:57:36 2016 - [info] Executing seconary network check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 --user=root --master_host=192.168.137.20 --master_ip=192.168.137.20 --master_port=3306 Fri Aug 26 11:57:37 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:37 2016 - [warning] Connection failed 1 time(s).. Fri Aug 26 11:57:38 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:38 2016 - [warning] Connection failed 2 time(s).. Fri Aug 26 11:57:38 2016 - [info] HealthCheck: SSH to 192.168.137.20 is reachable. Monitoring server backup is reachable, Master is not reachable from backup. OK. Fri Aug 26 11:57:39 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:39 2016 - [warning] Connection failed 3 time(s).. Monitoring server master is reachable, Master is not reachable from master. OK. Fri Aug 26 11:57:41 2016 - [info] Master is not reachable from all other monitoring servers. Failover should start. Fri Aug 26 11:57:41 2016 - [warning] Master is not reachable from health checker! Fri Aug 26 11:57:41 2016 - [warning] Master 192.168.137.20(192.168.137.20:3306) is not reachable! Fri Aug 26 11:57:41 2016 - [warning] SSH is reachable. Fri Aug 26 11:57:41 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /usr/local/mha/ha1/ha1.cnf again, and trying to connect to all servers to check server status.. Fri Aug 26 11:57:41 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Fri Aug 26 11:57:41 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:41 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:42 2016 - [info] Dead Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Slaves: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:42 2016 - [info] Checking slave configurations.. Fri Aug 26 11:57:42 2016 - [info] read_only=1 is not set on slave 192.168.137.10(192.168.137.10:3306). Fri Aug 26 11:57:42 2016 - [info] Checking replication filtering settings.. Fri Aug 26 11:57:42 2016 - [info] Replication filtering check ok. Fri Aug 26 11:57:42 2016 - [info] Master is down! Fri Aug 26 11:57:42 2016 - [info] Terminating monitoring script. Fri Aug 26 11:57:42 2016 - [info] Got exit code 20 (Master dead). Fri Aug 26 11:57:42 2016 - [info] MHA::MasterFailover version 0.55. Fri Aug 26 11:57:42 2016 - [info] Starting master failover. Fri Aug 26 11:57:42 2016 - [info] Fri Aug 26 11:57:42 2016 - [info] * Phase 1: Configuration Check Phase.. Fri Aug 26 11:57:42 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] Dead Servers: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Checking master reachability via mysql(double check).. Fri Aug 26 11:57:44 2016 - [info] ok. Fri Aug 26 11:57:44 2016 - [info] Alive Servers: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:44 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 11:57:44 2016 - [info] Alive Slaves: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:44 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:44 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:44 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:44 2016 - [info] ** Phase 1: Configuration Check Phase completed. Fri Aug 26 11:57:44 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Fri Aug 26 11:57:44 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Fri Aug 26 11:57:44 2016 - [info] Executing master IP deactivatation script: Fri Aug 26 11:57:44 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Disabling the VIP on old master: 192.168.137.20 Fri Aug 26 11:57:45 2016 - [info] done. Fri Aug 26 11:57:45 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Fri Aug 26 11:57:45 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3: Master Recovery Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] The latest binary log file/position on all slaves is mysql-bin.000074:22461852 Fri Aug 26 11:57:45 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Fri Aug 26 11:57:45 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:45 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:45 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:45 2016 - [info] The oldest binary log file/position on all slaves is mysql-bin.000074:9857376 Fri Aug 26 11:57:45 2016 - [info] Oldest slaves: Fri Aug 26 11:57:45 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:45 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:45 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:46 2016 - [info] Fetching dead master's binary logs.. Fri Aug 26 11:57:46 2016 - [info] Executing command on the dead master 192.168.137.20(192.168.137.20:3306): save_binary_logs --command=save --start_file=mysql-bin.000074 --start_pos=22461852 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000074 pos 22461852 to mysql-bin.000074 EOF into /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000074 position 22461852 to tail(22497564).. ok. Concat succeeded. Fri Aug 26 11:57:49 2016 - [info] scp from root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 11:57:52 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Fri Aug 26 11:57:55 2016 - [info] HealthCheck: SSH to 192.168.137.30 is reachable. Fri Aug 26 11:57:55 2016 - [info] Fri Aug 26 11:57:55 2016 - [info] * Phase 3.3: Determining New Master Phase.. Fri Aug 26 11:57:55 2016 - [info] Fri Aug 26 11:57:55 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Fri Aug 26 11:57:55 2016 - [info] Checking whether 192.168.137.30 has relay logs from the oldest position.. Fri Aug 26 11:57:55 2016 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --workdir=/tmp --timestamp=20160826115742 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ : Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Target relay log FOUND! Fri Aug 26 11:57:56 2016 - [info] OK. 192.168.137.30 has all relay logs. Fri Aug 26 11:57:56 2016 - [info] Searching new master from slaves.. Fri Aug 26 11:57:56 2016 - [info] Candidate masters from the configuration file: Fri Aug 26 11:57:56 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:56 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:56 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:56 2016 - [info] Non-candidate masters: Fri Aug 26 11:57:56 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:56 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:56 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:56 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Fri Aug 26 11:57:56 2016 - [info] Not found. Fri Aug 26 11:57:56 2016 - [info] Searching from all candidate_master slaves.. Fri Aug 26 11:57:56 2016 - [info] New master is 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:56 2016 - [info] Starting master failover.. Fri Aug 26 11:57:56 2016 - [info] From: 192.168.137.20 (current master) +--192.168.137.10 +--192.168.137.30 To: 192.168.137.10 (new master) +--192.168.137.30 Fri Aug 26 11:57:56 2016 - [info] Fri Aug 26 11:57:56 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Fri Aug 26 11:57:56 2016 - [info] Fri Aug 26 11:57:56 2016 - [info] Server 192.168.137.10 received relay logs up to: mysql-bin.000074:9857376 Fri Aug 26 11:57:56 2016 - [info] Need to get diffs from the latest slave(192.168.137.30) up to: mysql-bin.000074:22461852 (using the latest slave's relay logs) Fri Aug 26 11:57:56 2016 - [info] Connecting to the latest slave host 192.168.137.30, generating diff relay log files.. Fri Aug 26 11:57:56 2016 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.137.10 --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog --workdir=/tmp --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ Fri Aug 26 11:58:02 2016 - [info] Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Concat binary/relay logs from mysql-relay-bin.000003 pos 9857539 to mysql-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 283.. ok. Dumping effective binlog data from /mysql/data/mysql-relay-bin.000003 position 9857539 to tail(22462015).. ok. Concat succeeded. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog . scp slave:/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to root@192.168.137.10(22) succeeded. Fri Aug 26 11:58:02 2016 - [info] Generating diff files succeeded. Fri Aug 26 11:58:02 2016 - [info] Sending binlog.. Fri Aug 26 11:58:04 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to root@192.168.137.10:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 11:58:04 2016 - [info] Fri Aug 26 11:58:04 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Fri Aug 26 11:58:04 2016 - [info] Fri Aug 26 11:58:04 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Fri Aug 26 11:58:04 2016 - [info] Starting recovery on 192.168.137.10(192.168.137.10:3306).. Fri Aug 26 11:58:04 2016 - [info] Generating diffs succeeded. Fri Aug 26 11:58:04 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 12:00:06 2016 - [info] done. Fri Aug 26 12:00:06 2016 - [info] Getting slave status.. Fri Aug 26 12:00:06 2016 - [info] This slave(192.168.137.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:9857376). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:00:06 2016 - [info] Connecting to the target slave host 192.168.137.10, running recover script.. Fri Aug 26 12:00:06 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.10 --slave_ip=192.168.137.10 --slave_port=3306 --apply_files=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:22 2016 - [info] Concat all apply files to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog .. Copying the first binlog file /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog.. dumped up to pos 120. ok. /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog has effective binlog events from pos 120. Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog position 120 to tail(35832).. ok. Concat succeeded. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog . MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.10:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:22 2016 - [info] All relay logs were successfully applied. Fri Aug 26 12:04:22 2016 - [info] Getting new master's binlog name and position.. Fri Aug 26 12:04:22 2016 - [info] mysql-bin.000143:22123166 Fri Aug 26 12:04:22 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000143', MASTER_LOG_POS=22123166, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Fri Aug 26 12:04:22 2016 - [info] Executing master IP activate script: Fri Aug 26 12:04:22 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Enabling the VIP - 192.168.137.50/24 on the new master - 192.168.137.10 Fri Aug 26 12:04:25 2016 - [info] OK. Fri Aug 26 12:04:25 2016 - [info] ** Finished master recovery successfully. Fri Aug 26 12:04:25 2016 - [info] * Phase 3: Master Recovery Phase completed. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] * Phase 4: Slaves Recovery Phase.. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] -- Slave diff file generation on host 192.168.137.30(192.168.137.30:3306) started, pid: 5029. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826115742.log if it takes time.. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 12:04:26 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 12:04:26 2016 - [info] -- 192.168.137.30(192.168.137.30:3306) has the latest relay log events. Fri Aug 26 12:04:26 2016 - [info] Generating relay diff files from the latest slave succeeded. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) started, pid: 5031. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826115742.log if it takes time.. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] Sending binlog.. Fri Aug 26 12:04:28 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to root@192.168.137.30:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 12:04:28 2016 - [info] Starting recovery on 192.168.137.30(192.168.137.30:3306).. Fri Aug 26 12:04:28 2016 - [info] Generating diffs succeeded. Fri Aug 26 12:04:28 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 12:04:28 2016 - [info] done. Fri Aug 26 12:04:28 2016 - [info] Getting slave status.. Fri Aug 26 12:04:28 2016 - [info] This slave(192.168.137.30)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:22461852). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:04:28 2016 - [info] Connecting to the target slave host 192.168.137.30, running recover script.. Fri Aug 26 12:04:28 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:30 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.30:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:30 2016 - [info] All relay logs were successfully applied. Fri Aug 26 12:04:30 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Fri Aug 26 12:04:31 2016 - [info] Executed CHANGE MASTER. Fri Aug 26 12:04:31 2016 - [info] Slave started. Fri Aug 26 12:04:32 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 12:04:32 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) succeeded. Fri Aug 26 12:04:32 2016 - [info] All new slave servers recovered successfully. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] * Phase 5: New master cleanup phase.. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] Resetting slave info on the new master.. Fri Aug 26 12:04:32 2016 - [info] 192.168.137.10: Resetting slave info succeeded. Fri Aug 26 12:04:32 2016 - [info] Master failover to 192.168.137.10(192.168.137.10:3306) completed successfully. Fri Aug 26 12:04:32 2016 - [info] ----- Failover Report ----- ha1: MySQL Master failover 192.168.137.20 to 192.168.137.10 succeeded Master 192.168.137.20 is down! Check MHA Manager logs at monitor:/usr/local/mha/ha1/manager.log for details. Started automated(non-interactive) failover. Invalidated master IP address on 192.168.137.20. The latest slave 192.168.137.30(192.168.137.30:3306) has all relay logs for recovery. Selected 192.168.137.10 as a new master. 192.168.137.10: OK: Applying all logs succeeded. 192.168.137.10: OK: Activated master IP address. 192.168.137.30: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.137.30: OK: Applying all logs succeeded. Slave started, replicating from 192.168.137.10. 192.168.137.10: Resetting slave info succeeded. Master failover to 192.168.137.10(192.168.137.10:3306) completed successfully.
說明:用紅色標記了一些主要的處理過程,亮色加粗標記了每個步驟總共5個步驟
Failover步驟如下:
1.fail判斷,分別判斷dead master的mysql(Ping(SELECT))和ssh分別到達情況(之間會呼叫masterha_secondary_check指令碼)→dead master處理階段
2.配置檔案檢查,會檢查整個叢集配置檔案配置(分別確定dead server,候選的master和所有的server的配置情況以及配置是否滿足條件)→dead master處理階段
3.當機的master處理,包括虛擬ip摘除操作,主機關機操作(這裡暫時沒有配置關機操作)→dead master處理階段
/usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --command=stopssh --ssh_user=root
4.找到含有最新relay log的slave(同時找到最舊的binlog的slave的position), 分別判斷是否是候選的slave→new master還原階段
5.儲存dead master(137.20)和最新slave(137.30)相差的relay log儲存在dead master的/tmp目錄下(根據配置檔案配置的remote_workdir),然後確定這部分差異binlog(saved_master_binlog_)是否有效,也就是dead master和最新的slave之間是否存在binlog差異,存在差異則將生成的這個差異binlog拷貝到mha的workdir(137.40)下→new master還原階段
Fri Aug 26 11:57:46 2016 - [info] Executing command on the dead master 192.168.137.20(192.168.137.20:3306): save_binary_logs --command=save --start_file=mysql-bin.000074 --start_pos=22461852 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000074 pos 22461852 to mysql-bin.000074 EOF into /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000074 position 22461852 to tail(22497564).. ok. Concat succeeded. Fri Aug 26 11:57:49 2016 - [info] scp from root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded.
7.確定新的master,並檢查最新slave(30)的relay log是否可以用來還原其他的slave→new master還原階段
8.生成最新slave(137.30)和new master(137.10)之間的差異relay log(在最新relay log的slave的/tmp下生成和其它slave差異的binlog,是二者的“Read_Master_Log_Pos”的差,取名為“relay_from_read_to_latest_後面緊接的是目標slave的ip”),然後cp到目標(new master)的slave的/tmp下,同時將mha workdir下剛才儲存的"saved_master_binlog_"(如果存在)檔案拷貝到new master的/tmp下→new master還原階段
Fri Aug 26 11:57:56 2016 - [info] Connecting to the latest slave host 192.168.137.30, generating diff relay log files.. Fri Aug 26 11:57:56 2016 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.137.10 --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog --workdir=/tmp --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ Fri Aug 26 11:58:02 2016 - [info] Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Concat binary/relay logs from mysql-relay-bin.000003 pos 9857539 to mysql-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 283.. ok. Dumping effective binlog data from /mysql/data/mysql-relay-bin.000003 position 9857539 to tail(22462015).. ok. Concat succeeded. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog . scp slave:/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to root@192.168.137.10(22) succeeded.
9.new master應用差異的relay log(首先會判斷該salve原本Read_Master_Log_Pos”和“Exec_Master_Log_Pos是否相等,由於不是半同步複製所以slave雖然讀到了該pos但是但是由於複製是非同步的所有還得等待master定時傳送binlog到slave,如果這中間master故障了就會導致二者的不一致),如果不相等會在該slave執行save_binary_logs命令儲存之間差異的relay log取名為“relay_from_exec_to_read_後面緊接的是自身的ip”;然後應用"relay_from_read_to_latest_、saved_master_binlog_、relay_from_exec_to_read_"這三個差異的relay log,同時將這三個檔案的內容合併生成一個新的binlog檔案“total_binlog_for_”→new master還原階段
Fri Aug 26 12:00:06 2016 - [info] This slave(192.168.137.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:9857376). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:00:06 2016 - [info] Connecting to the target slave host 192.168.137.10, running recover script.. Fri Aug 26 12:00:06 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.10 --slave_ip=192.168.137.10 --slave_port=3306 --apply_files=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:22 2016 - [info] Concat all apply files to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog .. Copying the first binlog file /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog.. dumped up to pos 120. ok. /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog has effective binlog events from pos 120. Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog position 120 to tail(35832).. ok. Concat succeeded. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog . MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.10:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:22 2016 - [info] All relay logs were successfully applied.
10.生成其它slave到新master的change語句,執行master_ip_failover完成切換生成VIP→new master還原階段
11. 其它的slave也重複new masters slave的步驟 (從第8-9步,例如這裡就會從mha的workdir拷貝saved_master_binlog_到最新的slave(137.30)上應用差異的relay log) →other slave還原階段
12. other slave change new master→other slave還原階段
13.生成failover report
注意:這裡的relay log指的是slave已經讀取到的master的binglog的位置(也就是slave中的relay log檔案中已經儲存了最新master binlog的位置)在show slave status \G中也就是“Read_Master_Log_Pos”並不是“Exec_Master_Log_Pos”,所以最新relay log的salve並不一定就是資料最新的(但是這種情況比較少),只能說明它儲存的master binlog是最新的。
mha最先修復new master(不管它是否是最新的slave,所以如果候選的slave是最新的slave那麼自然是最好的候選的slave就會很快修復好),然後再去修復其它的slave。
2.不線上手動Failover
注意:前提條件是mha沒有啟動,且存在dead的master,MHA manager檢測到沒有dead的server,將報錯,並結束failover。
手動failover,這種場景意味著在業務上沒有啟用MHA自動切換功能,當主伺服器故障時,人工手動呼叫MHA來進行故障切換操作,具體命令如下:
語句如下:
masterha_master_switch --master_state=dead --conf=/usr/local/mha/ha1/ha1.cnf --dead_master_host=192.168.137.10 --dead_master_port=3306 --new_master_host=192.168.137.20 --new_master_port=3306 --ignore_fail_on_start --ignore_last_failover
切換的過程中會存在幾次需要輸入“yes”進行下一步
[root@monitor ha1]# masterha_master_switch --master_state=dead --conf=/usr/local/mha/ha1/ha1.cnf --dead_master_host=192.168.137.10 --dead_master_port=3306 --new_master_host=192.168.137.20 --new_master_port=3306 --ignore_fail_on_start --ignore_last_failover --dead_master_ip=<dead_master_ip> is not set. Using 192.168.137.10. Fri Aug 26 17:44:10 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Fri Aug 26 17:44:10 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 17:44:10 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 17:44:10 2016 - [info] MHA::MasterFailover version 0.55. Fri Aug 26 17:44:10 2016 - [info] Starting master failover. Fri Aug 26 17:44:10 2016 - [info] Fri Aug 26 17:44:10 2016 - [info] * Phase 1: Configuration Check Phase.. Fri Aug 26 17:44:10 2016 - [info] Fri Aug 26 17:44:11 2016 - [info] Dead Servers: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Checking master reachability via mysql(double check).. Fri Aug 26 17:44:11 2016 - [info] ok. Fri Aug 26 17:44:11 2016 - [info] Alive Servers: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:11 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 17:44:11 2016 - [info] Alive Slaves: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:11 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:11 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:11 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Not candidate for the new Master (no_master is set) Master 192.168.137.10 is dead. Proceed? (yes/NO): yes Fri Aug 26 17:44:18 2016 - [info] ** Phase 1: Configuration Check Phase completed. Fri Aug 26 17:44:18 2016 - [info] Fri Aug 26 17:44:18 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Fri Aug 26 17:44:18 2016 - [info] Fri Aug 26 17:44:20 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Fri Aug 26 17:44:22 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Fri Aug 26 17:44:22 2016 - [info] Executing master IP deactivatation script: Fri Aug 26 17:44:22 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Disabling the VIP on old master: 192.168.137.10 Fri Aug 26 17:44:23 2016 - [info] done. Fri Aug 26 17:44:23 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Fri Aug 26 17:44:23 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3: Master Recovery Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] The latest binary log file/position on all slaves is mysql-bin.000144:120 Fri Aug 26 17:44:23 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Fri Aug 26 17:44:23 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:23 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 17:44:23 2016 - [info] The oldest binary log file/position on all slaves is mysql-bin.000144:120 Fri Aug 26 17:44:23 2016 - [info] Oldest slaves: Fri Aug 26 17:44:23 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:23 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:24 2016 - [info] Fetching dead master's binary logs.. Fri Aug 26 17:44:24 2016 - [info] Executing command on the dead master 192.168.137.10(192.168.137.10:3306): save_binary_logs --command=save --start_file=mysql-bin.000144 --start_pos=120 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000144 pos 120 to mysql-bin.000144 EOF into /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000144 position 120 to tail(143).. ok. Concat succeeded. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:27 2016 - [info] scp from root@192.168.137.10:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:29 2016 - [info] HealthCheck: SSH to 192.168.137.20 is reachable. Fri Aug 26 17:44:31 2016 - [info] HealthCheck: SSH to 192.168.137.30 is reachable. Fri Aug 26 17:44:31 2016 - [info] Fri Aug 26 17:44:31 2016 - [info] * Phase 3.3: Determining New Master Phase.. Fri Aug 26 17:44:31 2016 - [info] Fri Aug 26 17:44:31 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Fri Aug 26 17:44:31 2016 - [info] All slaves received relay logs to the same position. No need to resync each other. Fri Aug 26 17:44:31 2016 - [info] 192.168.137.20 can be new master. Fri Aug 26 17:44:31 2016 - [info] New master is 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:31 2016 - [info] Starting master failover.. Fri Aug 26 17:44:31 2016 - [info] From: 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 To: 192.168.137.20 (new master) +--192.168.137.30 Starting master switch from 192.168.137.10(192.168.137.10:3306) to 192.168.137.20(192.168.137.20:3306)? (yes/NO): yes Fri Aug 26 17:44:40 2016 - [info] New master decided manually is 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:40 2016 - [info] Fri Aug 26 17:44:40 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Fri Aug 26 17:44:40 2016 - [info] Fri Aug 26 17:44:40 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 17:44:40 2016 - [info] Sending binlog.. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:42 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:42 2016 - [info] Fri Aug 26 17:44:42 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Fri Aug 26 17:44:42 2016 - [info] Fri Aug 26 17:44:42 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Fri Aug 26 17:44:42 2016 - [info] Starting recovery on 192.168.137.20(192.168.137.20:3306).. Fri Aug 26 17:44:42 2016 - [info] Generating diffs succeeded. Fri Aug 26 17:44:42 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 17:44:42 2016 - [info] done. Fri Aug 26 17:44:42 2016 - [info] Getting slave status.. Fri Aug 26 17:44:42 2016 - [info] This slave(192.168.137.20)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000144:120). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 17:44:42 2016 - [info] Connecting to the target slave host 192.168.137.20, running recover script.. Fri Aug 26 17:44:42 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.20 --slave_ip=192.168.137.20 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826174410 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 17:44:43 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog on 192.168.137.20:3306. This may take long time... Applying log files succeeded. Fri Aug 26 17:44:43 2016 - [info] All relay logs were successfully applied. Fri Aug 26 17:44:43 2016 - [info] Getting new master's binlog name and position.. Fri Aug 26 17:44:43 2016 - [info] mysql-bin.000075:120 Fri Aug 26 17:44:43 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.20', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000075', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Fri Aug 26 17:44:43 2016 - [info] Executing master IP activate script: Fri Aug 26 17:44:43 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 --new_master_host=192.168.137.20 --new_master_ip=192.168.137.20 --new_master_port=3306 --new_master_user='root' --new_master_password='root' IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Enabling the VIP - 192.168.137.50/24 on the new master - 192.168.137.20 Fri Aug 26 17:44:44 2016 - [info] OK. Fri Aug 26 17:44:44 2016 - [info] ** Finished master recovery successfully. Fri Aug 26 17:44:44 2016 - [info] * Phase 3: Master Recovery Phase completed. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] * Phase 4: Slaves Recovery Phase.. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] -- Slave diff file generation on host 192.168.137.30(192.168.137.30:3306) started, pid: 5354. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826174410.log if it takes time.. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 17:44:45 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 17:44:45 2016 - [info] -- 192.168.137.30(192.168.137.30:3306) has the latest relay log events. Fri Aug 26 17:44:45 2016 - [info] Generating relay diff files from the latest slave succeeded. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) started, pid: 5356. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826174410.log if it takes time.. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] Sending binlog.. Fri Aug 26 17:44:45 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to root@192.168.137.30:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:45 2016 - [info] Starting recovery on 192.168.137.30(192.168.137.30:3306).. Fri Aug 26 17:44:45 2016 - [info] Generating diffs succeeded. Fri Aug 26 17:44:45 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 17:44:45 2016 - [info] done. Fri Aug 26 17:44:45 2016 - [info] Getting slave status.. Fri Aug 26 17:44:45 2016 - [info] This slave(192.168.137.30)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000144:120). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 17:44:45 2016 - [info] Connecting to the target slave host 192.168.137.30, running recover script.. Fri Aug 26 17:44:45 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826174410 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 17:44:45 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog on 192.168.137.30:3306. This may take long time... Applying log files succeeded. Fri Aug 26 17:44:45 2016 - [info] All relay logs were successfully applied. Fri Aug 26 17:44:45 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.20(192.168.137.20:3306).. Fri Aug 26 17:44:46 2016 - [info] Executed CHANGE MASTER. Fri Aug 26 17:44:46 2016 - [info] Slave started. Fri Aug 26 17:44:47 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 17:44:47 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) succeeded. Fri Aug 26 17:44:47 2016 - [info] All new slave servers recovered successfully. Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] * Phase 5: New master cleanup phase.. Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] Resetting slave info on the new master.. Fri Aug 26 17:44:47 2016 - [info] 192.168.137.20: Resetting slave info succeeded. Fri Aug 26 17:44:47 2016 - [info] Master failover to 192.168.137.20(192.168.137.20:3306) completed successfully. Fri Aug 26 17:44:47 2016 - [info] ----- Failover Report ----- ha1: MySQL Master failover 192.168.137.10 to 192.168.137.20 succeeded Master 192.168.137.10 is down! Check MHA Manager logs at monitor for details. Started manual(interactive) failover. Invalidated master IP address on 192.168.137.10. The latest slave 192.168.137.20(192.168.137.20:3306) has all relay logs for recovery. Selected 192.168.137.20 as a new master. 192.168.137.20: OK: Applying all logs succeeded. 192.168.137.20: OK: Activated master IP address. 192.168.137.30: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.137.30: OK: Applying all logs succeeded. Slave started, replicating from 192.168.137.20. 192.168.137.20: Resetting slave info succeeded. Master failover to 192.168.137.20(192.168.137.20:3306) completed successfully.
3.線上手動Failover
在許多情況下, 需要將現有的主伺服器遷移到另外一臺伺服器上。 比如主伺服器硬體故障,RAID 控制卡需要重建,將主伺服器移到效能更好的伺服器上等等。維護主伺服器引起效能下降, 導致停機時間至少無法寫入資料。 另外, 阻塞或殺掉當前執行的會話會導致主主之間資料不一致的問題發生。 MHA 提供快速切換和優雅的阻塞寫入,這個切換過程只需要 0.5-2s 的時間,這段時間內資料是無法寫入的。在很多情況下,0.5-2s 的阻塞寫入是可以接受的。因此切換主伺服器不需要計劃分配維護時間視窗。
MHA線上切換的大概過程:
1.檢測複製設定和確定當前主伺服器
2.確定新的主伺服器
3.阻塞寫入到當前主伺服器
4.等待所有從伺服器趕上覆制
5.授予寫入到新的主伺服器
6.重新設定從伺服器
注意,線上切換的時候應用架構需要考慮以下兩個問題:
1.自動識別master和slave的問題(master的機器可能會切換),如果採用了vip的方式,基本可以解決這個問題。
2.負載均衡的問題(可以定義大概的讀寫比例,每臺機器可承擔的負載比例,當有機器離開叢集時,需要考慮這個問題)
為了保證資料完全一致性,在最快的時間內完成切換,MHA的線上切換必須滿足以下條件才會切換成功,否則會切換失敗。
1.所有slave的IO執行緒都在執行
2.所有slave的SQL執行緒都在執行
3.所有的show slave status的輸出中Seconds_Behind_Master引數小於或者等於running_updates_limit秒,如果在切換過程中不指定running_updates_limit,那麼預設情況下running_updates_limit為1秒。
4.在old master端,通過show processlist輸出,沒有一個更新花費的時間大於running_updates_limit秒。
1.首先,停掉MHA監控:
masterha_stop --conf=/usr/local/mha/ha1/ha1.cnf
2.手動線上failover
新的master為192.168.137.10
masterha_master_switch --conf=/usr/local/mha/ha1/ha1.cnf --master_state=alive --new_master_host=192.168.137.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
--orig_master_is_new_slave 切換時加上此引數是將原 master 變為 slave 節點,如果不加此引數,原來的 master 將不啟動
--running_updates_limit=10000,故障切換時,候選master 如果有延遲的話, mha 切換不能成功,加上此參數列示延遲在此時間範圍內都可切換(單位為s),但是切換的時間長短是由recover 時relay 日誌的大小決定
注意:手動線上用的failover指令碼是“master_ip_online_change”
3.切換日誌
[root@monitor fail_script]# masterha_master_switch --conf=/usr/local/mha/ha1/ha1.cnf --master_state=alive --new_master_host=192.168.137.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 Mon Aug 29 10:41:37 2016 - [info] MHA::MasterRotate version 0.55. Mon Aug 29 10:41:37 2016 - [info] Starting online master switch.. Mon Aug 29 10:41:37 2016 - [info] Mon Aug 29 10:41:37 2016 - [info] * Phase 1: Configuration Check Phase.. Mon Aug 29 10:41:37 2016 - [info] Mon Aug 29 10:41:37 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Aug 29 10:41:37 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Mon Aug 29 10:41:37 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Mon Aug 29 10:41:38 2016 - [info] Current Alive Master: 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Alive Slaves: Mon Aug 29 10:41:38 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Mon Aug 29 10:41:38 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Primary candidate for the new Master (candidate_master is set) Mon Aug 29 10:41:38 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Mon Aug 29 10:41:38 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Not candidate for the new Master (no_master is set) It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.137.20(192.168.137.20:3306)? (YES/no): yes Mon Aug 29 10:41:40 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] Checking MHA is not monitoring or doing failover.. Mon Aug 29 10:41:40 2016 - [info] Checking replication health on 192.168.137.10.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] Checking replication health on 192.168.137.30.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] 192.168.137.10 can be new master. Mon Aug 29 10:41:40 2016 - [info] From: 192.168.137.20 (current master) +--192.168.137.10 +--192.168.137.30 To: 192.168.137.10 (new master) +--192.168.137.30 +--192.168.137.20 Starting master switch from 192.168.137.20(192.168.137.20:3306) to 192.168.137.10(192.168.137.10:3306)? (yes/NO): yes Mon Aug 29 10:41:42 2016 - [info] Checking whether 192.168.137.10(192.168.137.10:3306) is ok for the new master.. Mon Aug 29 10:41:42 2016 - [info] ok. Mon Aug 29 10:41:42 2016 - [info] 192.168.137.20(192.168.137.20:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Mon Aug 29 10:41:42 2016 - [info] 192.168.137.20(192.168.137.20:3306): Resetting slave pointing to the dummy host. Mon Aug 29 10:41:42 2016 - [info] ** Phase 1: Configuration Check Phase completed. Mon Aug 29 10:41:42 2016 - [info] Mon Aug 29 10:41:42 2016 - [info] * Phase 2: Rejecting updates Phase.. Mon Aug 29 10:41:42 2016 - [info] Mon Aug 29 10:41:42 2016 - [info] Executing master ip online change script to disable write on the current master: Mon Aug 29 10:41:42 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_online_change --command=stop --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' *************************************************************** Disabling the VIP - 192.168.137.50/24 on old master: 192.168.137.20 *************************************************************** Mon Aug 29 10:41:43 2016 - [info] ok. Mon Aug 29 10:41:43 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Mon Aug 29 10:41:43 2016 - [info] Executing FLUSH TABLES WITH READ LOCK.. Mon Aug 29 10:41:43 2016 - [info] ok. Mon Aug 29 10:41:43 2016 - [info] Orig master binlog:pos is mysql-bin.000076:120. Mon Aug 29 10:41:43 2016 - [info] Waiting to execute all relay logs on 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:43 2016 - [info] master_pos_wait(mysql-bin.000076:120) completed on 192.168.137.10(192.168.137.10:3306). Executed 0 events. Mon Aug 29 10:41:43 2016 - [info] done. Mon Aug 29 10:41:43 2016 - [info] Getting new master's binlog name and position.. Mon Aug 29 10:41:43 2016 - [info] mysql-bin.000147:120 Mon Aug 29 10:41:43 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000147', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Mon Aug 29 10:41:43 2016 - [info] Executing master ip online change script to allow write on the new master: Mon Aug 29 10:41:43 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_online_change --command=start --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' *************************************************************** Enabling the VIP - 192.168.137.50/24 on new master: 192.168.137.10 *************************************************************** Mon Aug 29 10:41:44 2016 - [info] ok. Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] * Switching slaves in parallel.. Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] -- Slave switch on host 192.168.137.30(192.168.137.30:3306) started, pid: 4371 Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] Log messages from 192.168.137.30 ... Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] Waiting to execute all relay logs on 192.168.137.30(192.168.137.30:3306).. Mon Aug 29 10:41:44 2016 - [info] master_pos_wait(mysql-bin.000076:120) completed on 192.168.137.30(192.168.137.30:3306). Executed 0 events. Mon Aug 29 10:41:44 2016 - [info] done. Mon Aug 29 10:41:44 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:44 2016 - [info] Executed CHANGE MASTER. Mon Aug 29 10:41:44 2016 - [info] Slave started. Mon Aug 29 10:41:45 2016 - [info] End of log messages from 192.168.137.30 ... Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] -- Slave switch on host 192.168.137.30(192.168.137.30:3306) succeeded. Mon Aug 29 10:41:45 2016 - [info] Unlocking all tables on the orig master: Mon Aug 29 10:41:45 2016 - [info] Executing UNLOCK TABLES.. Mon Aug 29 10:41:45 2016 - [info] ok. Mon Aug 29 10:41:45 2016 - [info] Starting orig master as a new slave.. Mon Aug 29 10:41:45 2016 - [info] Resetting slave 192.168.137.20(192.168.137.20:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:45 2016 - [info] Executed CHANGE MASTER. Mon Aug 29 10:41:45 2016 - [info] Slave started. Mon Aug 29 10:41:45 2016 - [info] All new slave servers switched successfully. Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] * Phase 5: New master cleanup phase.. Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] 192.168.137.10: Resetting slave info succeeded. Mon Aug 29 10:41:45 2016 - [info] Switching master to 192.168.137.10(192.168.137.10:3306) completed successfully.
切換失敗:
[error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln262] We should not start online master switch when one of connections are running long queries on the new master(192.168.137.10(192.168.137.10:3306)). Currently 1 thread(s) are running. Details: {'Time' => '1173','Command' => 'Daemon','db' => undef,'Id' => '3','Info' => undef,'User' => 'event_scheduler','State' => 'Waiting on empty queue','Host' => 'localhost'}
從錯誤資訊已經說的非常的清除,在new master中存在一個long queries,且該執行緒的id=3是一個事件排程執行緒,由於new master我開啟了“scheduler”導致了線上手動切換失敗,我把“scheduler”關閉就成功了。
在MHA的環境中備選的Master不能開啟“scheduler”
參考:
http://www.cnblogs.com/gomysql/p/3675429.html
http://blog.csdn.net/lichangzai/article/details/50470771
mha下載地址:
https://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.55.tar.gz
https://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.54.tar.gz
之前寫的複製相關文章:
主從複製:http://www.cnblogs.com/chenmh/p/5089919.html
主主複製:http://www.cnblogs.com/chenmh/p/5153184.html
MMM方案:http://www.cnblogs.com/chenmh/p/5563778.html
半同步複製與MMM:http://www.cnblogs.com/chenmh/p/5744227.html
總結
主和備主之間需要半同步複製才能保證mha的最大程度的資料不丟失,否則使用MHA也就沒優勢了;還有就主和備主不要開啟scheduler(作業),否則手動線上failover會失敗。mha最先修復new master(不管它是否是最新的slave,所以如果候選的slave是最新的slave那麼自然是最好的候選的slave就會很快修復好),然後再去修復其它的slave。
注意:文章中對一些配置做了備註說明,在實際部署中需要將這些備註刪掉
備註: 作者:pursuer.chen 部落格:http://www.cnblogs.com/chenmh 本站點所有隨筆都是原創,歡迎大家轉載;但轉載時必須註明文章來源,且在文章開頭明顯處給明連結。 《歡迎交流討論》 |