Availability Group On Linux 搭建後記

dbLenis發表於2018-08-29
create availability group [InvestorAG]
    with (db_failover=on,cluster_type=external)
    for replica on 
        N'192.168.1.10'
        with (
            endpoint_url = N'tcp://192.168.1.10:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            ),
        N'192.168.1.11'
        with (
            endpoint_url = N'tcp://192.168.1.11:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            ),
         N'192.168.1.12'
        with (
            endpoint_url = N'tcp://192.168.1.12:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            );

    alter availability group [InvestorAG]
        grant create any database ;

使用這份指令碼來搭建 Availability Group 的時候,會爆出如下的錯誤:

訊息 35237,級別 16,狀態 1,第 1 行
None of the specified replicas for availability group InvestorAG maps to the instance of SQL Server to which you are connected. Reenter the command, specifying this server instance to host one of the replicas. This replica will be the initial primary replica.

使用如下命令即可:

create availability group [InvestorAG]
    with (db_failover=on,cluster_type=external)
    for replica on 
        N'centos00'
        with (
            endpoint_url = N'tcp://192.168.1.10:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            ),
          N'centos01'
        with (
            endpoint_url = N'tcp://192.168.1.11:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            ),
          N'centos02'
        with (
            endpoint_url = N'tcp://192.168.1.12:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic
            );

新建表以及裝載資料

create table dbo.DimTime(TimeID bigint, DayX int, FinMonth int, FinYear Int, FinDate datetime) 
GO 

Insert into DimTime(TimeID,DayX,FinMonth,FinYear,FinDate) values(1,1,1,2018,'2018-01-01')
go 

SELECT * FROM DimTime 

GO

如果 SQL Server 再聰明一點,這一步已經足夠讓其他的副本也啟動起來了, 但遺憾的是,我去檢查 192.168.1.11 和 192.168.1.12 什麼都沒有發生。
因為我並沒有給新建的 Availability Group 新增同步的資料庫 Investor, 只是AG 的名字用了 InvestorAG, 並沒有同時指定資料庫。

通過檢查新建的 hadr session 和內建的 AlwaysOn_health session, 倒是有點收穫:

-- 新建一個 all-in 的 session 

CREATE EVENT SESSION [All_AlwaysOn] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed,
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.availability_replica_database_fault_reporting(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.availability_replica_state(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.hadr_ddl_failover_execution_state(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.hadr_read_only_route_preconditions(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.hadr_undo_of_redo_log_scan(
    ACTION(package0.last_error,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.context_info,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.server_instance_name,sqlserver.session_nt_username,sqlserver.session_server_principal_name,sqlserver.sql_text,sqlserver.username))
ADD TARGET package0.event_file(SET filename=N'All_AlwaysOn')
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO
-- 內建的 AlwaysOn_health session 

CREATE EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed,
ADD EVENT sqlserver.availability_group_lease_expired,
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
ADD EVENT sqlserver.availability_replica_manager_state_change,
ADD EVENT sqlserver.availability_replica_state,
ADD EVENT sqlserver.availability_replica_state_change,
ADD EVENT sqlserver.error_reported(
    WHERE ([error_number]=(9691OR [error_number]=(35204OR [error_number]=(9693OR [error_number]=(26024OR [error_number]=(28047OR [error_number]=(26023OR [error_number]=(9692OR [error_number]=(28034OR [error_number]=(28036OR [error_number]=(28048OR [error_number]=(28080OR [error_number]=(28091OR [error_number]=(26022OR [error_number]=(9642OR [error_number]=(35201OR [error_number]=(35202OR [error_number]=(35206OR [error_number]=(35207OR [error_number]=(26069OR [error_number]=(26070OR [error_number]>(41047AND [error_number]<(41056OR [error_number]=(41142OR [error_number]=(41144OR [error_number]=(1480OR [error_number]=(823OR [error_number]=(824OR [error_number]=(829OR [error_number]=(35264OR [error_number]=(35265OR [error_number]=(41188OR [error_number]=(41189))),
ADD EVENT sqlserver.hadr_db_partner_set_sync_state,
ADD EVENT sqlserver.lock_redo_blocked
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

定義中唯一的區別就是 sqlserver.error_reported 事件,在 AlwaysOn_health 定義中新增了這份事件,而在 All_AlwaysOn 自定義中卻沒有。導致的結局是 AlwaysOn_health 提供了一些蛛絲馬跡可供調查:

message A connection timeout has occurred while attempting to establish a connection to availability replica ‘centos01’ with id [D42B7002-C59F-47A3-9E39-7980F399AC91]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

通過執行命令:

select * from sys.dm_hadr_availability_replica_states

發現 D42B7002-C59F-47A3-9E39-7980F399AC91 在結果集中是脫線的,disconnected.

登入 centos01 ,執行下列命令:

select * from sys.dm_hadr_availability_replica_states
select * from sys.dm_hadr_availability_replica_cluster_states
select * from sys.dm_hadr_availability_replica_cluster_nodes
select * from sys.dm_hadr_availability_group_states

可知,centos01 上沒有新增到 InvestorAG 中來的 replica, group 和 replica 唯一確定了參與 Availability Group 的成員,即 group_id 與 replica_id 是組合唯一鍵。

在每臺從庫上執行下面命令:

ALTER AVAILABILITY GROUP [InvestorAG]
    Join WITH(cluster_type=external);

ALTER AVAILABILITY GROUP [InvestorAG]
    Grant Create Any Database ;

在 AlwaysOn_health 會話中,找到如下錯誤:

message A connection for availability group ‘InvestorAG’ from availability replica ‘centos01’ with id [D42B7002-C59F-47A3-9E39-7980F399AC91] to ‘centos00’ with id [F7830D02-7E2E-4494-82FD-D1CC700CA108] has been successfully established. This is an informational message only. No user action is required.

但這個錯誤顯然是讓人高興的。因為再次去主庫上查詢狀態時,centos01 從庫已經上線了。

但此時 Investor 資料庫還沒有完全上線,即從庫 centos01 還沒有接收到主庫同步過來的 Investor 資料庫。因為我還沒有把資料庫 Investor 加入到 InvestorAG 中來。

需在 master 為當前資料庫的情況下,執行新增 AG 資料庫的操作,否則有這麼個錯誤:

訊息 35208,級別 16,狀態 241,第 32 行
Availability-group DDL operations are permitted only when you are using the master database. Run the USE MASTER command, and retry your availability-group DDL command.

use master 
go 

alter availability group InvestorAG
    add database investor

看上去很簡單, 我認為此時資料庫已經同步到每臺從庫上了,且每臺從庫也都有建立這個庫了,但是並不能訪問

訊息
The target database, ‘investor’, is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online.

似乎在這裡我忘記了這麼一個概念:高可用架構保障的就是當前叢集中只有一臺伺服器在提供者某一種服務,而當前配置的 Availability Group 架構是三臺實時同步的伺服器,提供了某一個資料庫的7*24不發生故障,持續提供服務的功能。如果需要某一臺副本資料庫可供訪問,那麼配置兩臺實時同步,另一臺只讀,這樣便可提供讀寫分離的服務。

所以還是要思考兩個問題:

1 在高可用的叢集中,衍生出一臺來做讀寫分離,另外兩臺提供高可用服務

2 實現高可用的故障轉移,即主庫失效則自動切換到從庫,以提供持續服務

第一個問題,在配置的時候,只要選擇可供只讀訪問:

create availability group sap
    with (db_failover=on,cluster_type=external)
    for replica on 
        N'centos00'
        with (
            endpoint_url = N'tcp://192.168.1.10:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic

            ),
         N'centos01'
        with (
            endpoint_url = N'tcp://192.168.1.11:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic,
            secondary_role ( 
            allow_connections = read_only
            )
            ),
         N'centos02'
        with (
            endpoint_url = N'tcp://192.168.1.12:5022',
            availability_mode=synchronous_commit,
            failover_mode=external,
            seeding_mode=automatic,
            secondary_role ( 
            allow_connections = read_only
            )
            );

當下最重要的話題,先配置三臺實時同步的高可用叢集,實現自動化故障轉移。

有兩篇外文,值得參考:

https://blog.dbi-services.com/introducing-alwayson-availability-groups-on-linux/

https://blog.dbi-services.com/introducing-high-availability-with-sql-server-on-linux/

在之前第一遍嘗試搭建 Availability Groups 過程中,啟動切換故障的主庫不成功,要分析原因,就要做一些測試點:

[root@centos00 log]# pcs status
Cluster name: crmcluster
Stack: corosync
Current DC: centos00 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition WITHOUT quorum
Last updated: Sun Aug 26 15:26:43 2018
Last change: Tue Aug 21 23:33:46 2018 by root via cibadmin on centos00

3 nodes configured
4 resources configured

Online: [ centos00 ]
OFFLINE: [ centos01 centos02 ]

Full list of resources:

 Master/Slave Set: crmag_cluster-master [crmag_cluster]
     Stopped: [ centos00 centos01 centos02 ]
 virtualip    (ocf::heartbeat:IPaddr2):   Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

通過 pcs status 命令看到部分原因了: Master/Slave set 和 virtualip 都停止了。

[root@centos02 ~]# pcs status
Cluster name: crmcluster
Stack: corosync
Current DC: centos00 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Aug 29 06:15:23 2018
Last change: Wed Aug 29 06:15:01 2018 via centos02 on centos00

3 nodes configured
4 resources configured

Online: [ centos00 centos01 centos02 ]

Full list of resources:

 Master/Slave Set: crmag_cluster-master [crmag_cluster]
     Masters: [ centos01 ]
     Slaves: [ centos00 centos02 ]
 virtualip    (ocf::heartbeat:IPaddr2):   Started centos01

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

在每一臺伺服器上開啟叢集服務,即可啟動整個叢集。

pcs cluster start
pcs cluster enable

有趣的是,誤打誤撞開啟了本該是從庫的伺服器為叢集之首(master), 結果訪問 virtual ip (192.168.1.15) 的時候,連上的是 192.168.1.11,並不能開啟可讀可寫的 Investor 資料庫,該資料庫才是我們加入 availability group,保障高可用的目標。

此時我們在 192.168.1.15 的例項上面新建一個資料庫,可以看到該資料庫在 192.168.1.12 上也同時建立了。

關閉 centos02 的叢集,再新建一個資料庫,看看如何:

pcs cluster stop

新建的資料庫,並不能同步到 192.168.1.12 的例項中了。
但卻無意中,將原本一直連線不同的 crm 資料庫給同步起來了。

這樣的話,要測試是否自動切換就能簡單了,直接斷掉 192.168.1.11 (現在是 cluster mater ).

[root@centos01 ~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to override
[root@centos02 ~]# pcs status
Cluster name: crmcluster
Stack: corosync
Current DC: centos00 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Aug 29 07:11:01 2018
Last change: Tue Aug 21 23:33:46 2018 by root via cibadmin on centos00

3 nodes configured
4 resources configured

Online: [ centos00 centos02 ]
OFFLINE: [ centos01 ]

Full list of resources:

 Master/Slave Set: crmag_cluster-master [crmag_cluster]
     Masters: [ centos00 ]
     Slaves: [ centos02 ]
     Stopped: [ centos01 ]
 virtualip    (ocf::heartbeat:IPaddr2):   Started centos00

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

自動切換實現了

所有的連線如果指向的是 192.168.1.15 那麼背後的 sql server instance 自動切換,都是沒有影響的。借用《子彈飛》裡劉嘉玲的那句話,“我只想當縣長夫人,至於縣長是誰,那並不重要!”。

相關文章