針對目前公司的資料庫架構要做統計規劃和調整,所以針對不同的業務環境要選擇合適的架構就比較重要,同時作為運維人員又要考慮到維護的便捷性(不能做過多架構維護),最終停留在mha+vip/主從+vip/mycat叢集/域名+主從這幾種架構或組合,這裡就分佈對不同架構部署做下記錄並對比起優劣,針對不同場景做出合適的選擇。
本文主要針對主備自動切換,即VIP+主從方式,其中VIP通過keepalived控制。
一、環境描述
伺服器IP |
作業系統 |
資料庫 |
硬體配置 |
備註 |
192.168.1.23 |
RHEL6.6_x86_64 |
MySQL 5.7.21 |
記憶體:16GB CPU:4cores 硬碟:100GB |
VIP:192.168.1.27 Keepalived:1.4.4 |
192.168.1.24 |
RHEL6.6_x86_64 |
MySQL 5.7.21 |
記憶體:16GB CPU:4cores 硬碟:100GB |
二、業務場景
1、優點
1)部署簡單,管理方便。
2)可實現主備庫VIP切換,對業務無感知。
2、缺點
1)存在腦裂風險
2)存在切換主備,狀態無檢測風險
3)跨VLAN不支援
4)看似部署簡單,實則需要考慮的點非常多,如果考慮不完善建議慎用
3、業務場景
總而言之,此架構不推薦使用,使用時需謹慎。
1)雙節點資料庫伺服器,僅僅為了實現主備VIP切換業務無感知
2)多個從節點,通過LVS,做負載均衡(現在不推薦這種架構實現負載均衡)
3)兩節點的雙向複製可以降低部分風險(來回切換VIP、腦裂等),單無法避免
三、資料庫安裝和主從部署
略
四、keepalived安裝部署
1.keepalived下載
下載地址:http://www.keepalived.org/software/keepalived-1.4.4.tar.gz
2.keepalived安裝
主庫: [root@node1 tools]# tar -xzvf keepalived-1.4.4.tar.gz [root@node1 tools]# cd keepalived-1.4.4 [root@node1 keepalived-1.4.4]# ./configure --prefix=/usr/local/keepalived [root@node1 keepalived-1.4.4]# make -j 4 [root@node1 keepalived-1.4.4]# make install [root@node1 keepalived-1.4.4]# cp -rp ./keepalived/etc/init.d/keepalived /etc/init.d/ [root@node1 keepalived-1.4.4]# chkconfig --add keepalived [root@node1 keepalived-1.4.4]# mkdir /etc/keepalived
[root@node2 keepalived-1.4.4]# cp -rp ./bin/keepalived /usr/bin/ [root@node1 keepalived-1.4.4]# cp ./keepalived/etc/keepalived/keepalived.conf /etc/keepalived [root@node1 keepalived-1.4.4]# cp -rp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/ [root@node1 keepalived-1.4.4]# service keepalived status keepalived is stopped 備庫: 安裝keepalived同上 ###備註 *** WARNING - this build will not support IPVS with IPv6. Please install libnl/libnl-3 dev libraries to support IPv6 with IPVS. ###
3.keepalived配置
主庫配置:
[root@node1 keepalived-1.4.4]# cp -rp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf_bak
[root@node1 keepalived-1.4.4]# >/etc/keepalived/keepalived.conf
[root@node1 keepalived-1.4.4]# vi /etc/keepalived/keepalived.conf
! Configuration File for keepalived global_defs { notification_email { rangle@163.com } notification_email_from rangle@163.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id Node_Master } vrrp_instance VI_1 { state BACKUP ##可配置master和backup模式,為了防止腦裂現象,主備均需要設定為backup模式,master模式會搶佔VIP interface eth1 virtual_router_id 43 ##VRRP組名,兩個節點的設定必須一樣,以指明各個節點屬於同一VRRP組 priority 100 nopreempt ##配合backup,防止切換後,主庫服務恢復正常後,IP漂移過來 advert_int 1 ###組播資訊傳送間隔,兩個節點設定必須一樣 authentication { ###設定驗證資訊,兩個節點必須一致 auth_type PASS auth_pass 1111 } virtual_ipaddress { ###指定虛擬IP, 兩個節點設定必須一樣 192.168.1.27 label eth1:1 } } virtual_server 192.168.1.27 3306 { delay_loop 6 lb_algo wrr #lvs排程演算法rr|wrr|lc|wlc|lblc|sh|dh lb_kind DR #負載均衡轉發規則NAT|DR|RUN persistence_timeout 50 protocol TCP real_server 192.168.1.23 3306 { weight 1 notify_down /etc/keepalived/mysql.sh TCP_CHECK { connect_timeout 10 nb_get_retry 3 delay_before_retry 3 connect_port 3306 } } }
備庫配置:
[root@node2 keepalived-1.4.4]# cp -rp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf_bak
[root@node2 keepalived-1.4.4]# >/etc/keepalived/keepalived.conf
[root@node2 keepalived-1.4.4]# vi /etc/keepalived/keepalived.conf
! Configuration File for keepalived global_defs { notification_email { rangle@163.com } notification_email_from rangle@163.com smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id Node_backup } vrrp_instance VI_1 { state BACKUP interface eth1 virtual_router_id 43 priority 90 ##主優先順序高於備,主設定100,備可設定90,(1-254之間) advert_int 1 authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.1.27 label eth1:1 } } virtual_server 192.168.1.27 3306 { delay_loop 2 ##每隔2秒檢查真實伺服器狀態 lb_algo wrr ##LVS演算法,如果僅做高可用可不設 lb_kind DR ##LVS模式,如果僅做高可用可不設 persistence_timeout 60 protocol TCP real_server 192.168.1.24 3306{ weight 3 notify_down /etc/keepalived/mysql.sh ##如果mysql服務宕掉,執行的指令碼 echo '3' > /etc/keepalived/t.log TCP_CHECK { connect_timeout 10 nb_get_retry 3 delay_before_retry 3 connect_port 3306 } } }
mysql.sh配置內容:
#!/bin/bash run_status=`service keepalived status|grep running|wc -l` pro_status=`ps -ef |grep keepalived|grep -v grep |wc -l` service keepalived stop if [ ${run_status} != 0 ] || [ ${pro_status} != 0 ] then pkill keepalived fi
五、keepalived優化
1、keepalived+mysql風險點
1)腦裂風險
現象:
如果由於主備內部網路通訊(vrrp組播)故障(網路波動、防火牆等)導致VIP切換,但實際主備服務均正常,此時VIP會存在兩個節點上,這個就會導致主備資料同時寫入導致資料不一致。
優化方案:
****主庫節點增加腦裂檢查指令碼,通過本機增加閘道器鏈路的檢查,增加仲裁節點,判斷是否本機對外的網路出現問題,此時在配合VRRP組播,如果網路存在問題則直接關閉keepalived和mysql服務,然後傳送告警郵件、簡訊。
2)主從同步漂移風險
現象:
VIP漂移不會檢查從庫狀態,即無論從庫資料是否同步,VIP都將完成漂移,資料不一致風險較大。
優化方案:
****在從節點增加主從狀態檢查指令碼,判斷主從同步程式是否正常,同時判斷是否有資料延遲,如果有延遲或同步錯誤則不能完成切換,直接關閉keepalived服務
2、優化後配置如下
具體優化配置和指令碼如下:
主庫配置:
keepalived.conf
! Configuration File for keepalived global_defs { notification_email { abc@163.com } notification_email_from root@localhost smtp_server root@localhost smtp_connect_timeout 30 router_id node1 } vrrp_script check_local { script "/etc/keepalived/check_gateway.sh" interval 5 fall 4 rise 4 weight 3 } vrrp_instance VI_1 { state BACKUP interface eth1 virtual_router_id 43 priority 100 nopreempt advert_int 1 authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.1.27 label eth1:1 } track_script { check_local } } virtual_server 192.168.1.27 3306 { delay_loop 6 lb_algo wrr #lvs排程演算法rr|wrr|lc|wlc|lblc|sh|dh lb_kind DR #負載均衡轉發規則NAT|DR|RUN persistence_timeout 50 protocol TCP real_server 192.168.1.23 3306 { weight 3 notify_down /etc/keepalived/mysql.sh TCP_CHECK { connect_timeout 10 nb_get_retry 3 delay_before_retry 3 connect_port 3306 } } }
備庫配置:
keepalived.conf
! Configuration File for keepalived global_defs { notification_email { abc@163.com } notification_email_from root@localhost smtp_server root@localhost smtp_connect_timeout 30 router_id node2 } vrrp_script check_local { script "/etc/keepalived/check_gateway.sh " interval 5 fall 4 rise 4 weight 3 } vrrp_instance VI_1 { state BACKUP interface eth1 virtual_router_id 43 priority 90 advert_int 1 authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.1.27 label eth1:1 } track_script { check_local } } virtual_server 192.168.1.27 3306 { delay_loop 2 lb_algo wrr lb_kind DR persistence_timeout 60 protocol TCP real_server 192.168.1.24 3306{ weight 3 notify_down /etc/keepalived/mysql.sh echo '3' > /etc/keepalived/t.log TCP_CHECK { connect_timeout 10 nb_get_retry 3 delay_before_retry 3 connect_port 3306 } } }
腦裂檢查指令碼:
/etc/keepalived/check_gateway.sh
#!/bin/sh VIP=192.168.1.27 GATEWAY=192.168.1.254 /sbin/arping -I eth1 -c 5 -s $VIP $GATEWAY &>/dev/null gateway_status=$? keepalived_status=`ps -ef |grep keepalived|grep -v grep |wc -l` if [ ${gateway_status} != 0 ] && [ ${keepalived_status} = 0 ] then service keepalived start elif [ ${gateway_status} != 0 ] && [ ${keepalived_status} != 0 ] service keepalived stop fi
資料庫服務當機轉移指令碼
/etc/keepalived/mysql.sh
#!/bin/bash run_status=`service keepalived status|grep running|wc -l` pro_status=`ps -ef |grep keepalived|grep -v grep |wc -l` service keepalived stop if [ ${run_status} != 0 ] || [ ${pro_status} != 0 ] then pkill keepalived fi
從庫狀態檢查指令碼(在備庫做crontab定時任務*/1 * * * * sh /etc/keepalived/check_slave.sh >/dev/null 2>&1)
#!/bin/bash VIP="192.168.1.27" vip_status=`ip add |egrep "${VIP}"|wc -l` keepalived_status=`ps -ef |grep keepalived|grep -v grep|wc -l` slave_status=`mysql -uroot -e "show slave status \G"|egrep "Slave.*Running|Seconds_Behind_Master|Last_Errno"|grep -v "Slave_SQL_Running_State"|awk -F ':' '{printf("%s",$NF)}'` io_status=`echo "${slave_status}"|awk '{print $1}'|grep "Yes"|wc -l` sql_status=`echo "${slave_status}"|awk '{print $2}'|grep "Yes"|wc -l` sync_status=`echo "${slave_status}"|awk '{print $3}'` sync_time=`echo "${slave_status}"|awk '{print $4}'` if [ ${io_status} -ne 1 ] || [ ${sql_status} -ne 1 ] || [ ${sync_status} -ne 0 ] then if [ ${vip_status} -eq 0 ] && [ ${keepalived_status} -ne 0 ];then service keepalived stop fi elif [ ${io_status} -eq 1 ] && [ ${sql_status} -eq 1 ] && [ ${sync_status} -eq 0 ] then if [ ${sync_time} -ge 120 ];then if [ ${vip_status} -eq 0 ] && [ ${keepalived_status} -ne 0 ];then service keepalived stop fi elif [ ${sync_time} -eq 0 ];then if [ ${vip_status} -eq 0 ] && [ ${keepalived_status} -eq 0 ];then service keepalived start fi fi fi
3、防火牆埠開放
VRRP是用IP多播的方式(預設多播地址224.0.0.18)實現高可用對之間通訊的,如果開啟防火牆記得開放這個vrrp協議對應的策略
iptables -A INPUT -p vrrp -d 224.0.0.18 -j ACCEPT
iptables -A INPUT -p vrrp -j ACCEPT
六、keepalived故障轉移測試
1、主資料庫例項當機
當前主資料庫主機狀態:
[root@node1 keepalived]# service mysqld status MySQL (Percona Server) running (22983) [ OK ] [root@node1 keepalived]# service keepalived status keepalived (pid 14092) is running... [root@node1 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:496/64 scope link
當前從資料庫主機狀態:
[root@node2 keepalived]# service mysqld status MySQL (Percona Server) running (14146) [ OK ] [root@node2 keepalived]# service keepalived status keepalived (pid 12207) is running... [root@node2 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
主庫執行關閉mysql例項命令,資料庫關閉後發現VIP已漂移
[root@node1 keepalived]# service mysqld stop Shutting down MySQL (Percona Server)............... [ OK ] [root@node1 keepalived]# service mysqld status MySQL (Percona Server) is not running [FAILED] [root@node1 keepalived]# service keepalived status keepalived is stopped [root@node1 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:496/64 scope link
此時檢查從庫相關狀態
[root@node2 keepalived]# service mysqld status MySQL (Percona Server) running (14146) [ OK ] [root@node2 keepalived]# service keepalived status keepalived (pid 12207) is running... [root@node2 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link [root@node2 keepalived]# mysql -uroot -e "show slave status \G"|egrep "Slave.*Running|Seconds_Behind_Master|Last_Errno"|grep -v "Slave_SQL_Running_State" Slave_IO_Running: Connecting Slave_SQL_Running: Yes Last_Errno: 0 Seconds_Behind_Master: NULL
2、主資料庫伺服器當機
同上,實現檢查主備狀態後,在主庫執行shutdown -h now的命令後,檢查從庫狀態
3、主keepalived服務停止
同上
4、主伺服器網路卡故障
此時keepalived服務正常,mysqld服務正常,VRRP通訊無法完成的情況下,主庫會執行關閉keepalived服務並完成切換
5、模擬VRRP通訊故障(腦裂)
在備庫開啟防火牆,並不開通vrrp允許,此時VIP會存在於兩邊的伺服器上
[root@node2 keepalived]# service iptables start iptables: Applying firewall rules: [ OK ]
主庫伺服器狀態檢查
[root@node1 keepalived]# ip add|grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:496/64 scope link
備庫伺服器狀態檢查:
[root@node2 keepalived]# ip add|grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
如果關掉從庫的防火牆,那麼VIP會選擇高優先順序的伺服器上(priority)
此現象如何避免?我們可以在check_gateway.sh指令碼里新增上組播狀態檢查的命令,我這裡就不做了,僅做了閘道器檢查的腦裂避免(網路問題導致閘道器暫時不可達而產生的腦裂)
6、模擬主切換到備後,主服務啟動後是否會回切(優先順序搶佔測試)
(本次測試接著測試1進行,首先恢復主庫的例項程式和keepalived程式)
主庫執行命令:
[root@node1 keepalived]# service mysqld start Starting MySQL (Percona Server)...... [ OK ] [root@node1 keepalived]# service keepalived start Starting keepalived: [ OK ] [root@node1 keepalived]# service mysqld status MySQL (Percona Server) running (1141) [ OK ] [root@node1 keepalived]# service keepalived status keepalived (pid 1230) is running... [root@node1 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:496/64 scope link
備庫執行命令:
[root@node2 keepalived]# service keepalived status keepalived (pid 12207) is running... [root@node2 keepalived]# service mysqld status MySQL (Percona Server) running (14146) [ OK ] [root@node2 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
此時發現即使主庫的服務恢復正常也不會主動搶佔從庫的VIP,因為兩邊設定的都是backup模式,但此時如果備庫的服務、例項等原因導致需要切換VIP而此時主庫由具備切換的條件,那麼依舊可以完成切換(此時切記一旦完成主備切換後,需要手動恢復新的主備狀態保障資料一致性後才能進行二次切換)
7、模擬備庫資料延遲120S以上切換
從庫先停止io_thread
(root:localhost:Sun May 20 10:24:45 2018)[(none)]>stop slave io_thread; Query OK, 0 rows affected (0.00 sec)
主庫做大量的資料變更操作
然後監控從庫keepalived狀態即可
8、模擬備庫資料同步錯誤切換
從庫狀態:
[root@node2 keepalived]# mysql -uroot -e "show slave status \G"|egrep "Slave.*Running|Seconds_Behind_Master|Last_Errno"|grep -v "Slave_SQL_Running_State" Slave_IO_Running: Yes Slave_SQL_Running: No Last_Errno: 1050 Seconds_Behind_Master: NULL [root@node2 keepalived]# service keepalived status keepalived is stopped [root@node2 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
主庫狀態:
[root@node1 keepalived]# service mysqld status MySQL (Percona Server) running (1141) [ OK ] [root@node1 keepalived]# service keepalived status keepalived (pid 1230) is running... [root@node1 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet 192.168.1.27/32 scope global eth1:1 inet6 fe80::250:56ff:fe8d:496/64 scope link
此時如果主庫相關服務當機,是無法完成VIP切換,通過關閉mysqld服務測試
主庫已剔除VIP [root@node1 keepalived]# service mysqld stop Shutting down MySQL (Percona Server).............. [ OK ] [root@node1 keepalived]# ip addr |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:496/64 scope link 從庫沒有附加上VIP [root@node2 keepalived]# ip add |grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1 inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
附錄:
導致裂腦發生的原因
高可用伺服器之間心跳線鏈路故障,導致無法正常通訊。
心跳線壞了(包括斷了,老化)。
網路卡即相關驅動壞了,IP配置及衝突問題(網路卡直連)
心跳線間連線的裝置故障(網路卡及交換機)
仲裁的機器出問題(採用仲裁的方案)。
高可用伺服器對上開啟了iptables防火牆阻擋了心跳資訊傳輸。
高可用伺服器對上心跳網路卡地址等資訊配置不正確,導致傳送心跳失敗。
其他伺服器配置不當等原因,如心跳方式不同,心跳廣播衝突,軟體BUG
在實際生產環境中,我們可以從以下幾個方面來防止裂腦問題的發生。
(1)同時使用序列電纜和乙太網電纜連線,同時用兩條心跳線路,這樣一條線路壞了,另一個還是好的,依然能傳送心跳訊息。
(2)當檢測到裂腦時強行關閉一個心跳節點(這個功能需特殊裝置支援,如fence,stonith)。相當於備節點接收不到心跳資訊,傳送關機命令通過單獨的線路關閉主節點電源。
(3)做好對裂腦的監控報警(如郵件及手機簡訊等),在問題發生時人為的第一時間介入仲裁,降低損失。例如:百度的監控報警簡訊就有上行和下行的區別。報警資訊到管理員手機上,就可以通過回覆對應的字串等操作就可以返回給伺服器,讓伺服器根據指令自動執行處理相關故