Filestream/Windows Share導致Alwayson Failover失敗

stswordman發表於2016-01-04

最近做了一個case, 客戶在ALWAYSON環境下進行failover操作, 之後所有replica上的alwayson group狀態變成了resolving。 並且在執行failover的replica上生成1個到多個dump 檔案。

 

下面是具體的排查問題。

 

環境

===

SQL Server 2014 SP1 CU3

Primary replica: p1

Secondary replica: p2

Secondary replica: p3

P1和P2屬於同一個子網

P3在另外一個子網。

Availability mode均為sync mode.

 

和客戶討論和得知,在p1和p2之間進行failover一切正常,並不會失敗或生成dump。只有嘗試將p3設定為primary replica才會發生錯誤。

執行的語句為alter availability group groupName failover

 

 

 

Errorlog記錄了下面的內容

2015-12-14 09:57:47.18 spid52 ***Stack Dump being sent to F:\MSSQL12.DBAAGINS1\MSSQL\LOG\SQLDump0001.txt

2015-12-14 09:57:47.18 spid52 * *******************************************************************************

2015-12-14 09:57:47.18 spid52 *

2015-12-14 09:57:47.18 spid52 * BEGIN STACK DUMP:

2015-12-14 09:57:47.18 spid52 * 12/14/15 09:57:47 spid 52

2015-12-14 09:57:47.18 spid52 *

2015-12-14 09:57:47.18 spid52 * Location:     HadrFstrVnnUtils.cpp:479

2015-12-14 09:57:47.18 spid52 * Expression:     SUCCEEDED (hr)

2015-12-14 09:57:47.18 spid52 * SPID:         52

2015-12-14 09:57:47.18 spid52 * Process ID:     5412

2015-12-14 09:57:47.18 spid52 *

2015-12-14 09:57:47.18 spid52 * Input Buffer 255 bytes -

2015-12-14 09:57:47.18 spid52 * 16 00 00 00 12 00 00 00 02 00 00 00 00 00 00 00 00 00

2015-12-14 09:57:47.18 spid52 * ÿÿ & ç 01 00 00 00 ff ff 0d 00 00 00 00 01 26 04 00 00 00 e7

2015-12-14 09:57:47.18 spid52 * ÿÿ     þÿÿÿÿÿÿÿF ff ff 09 04 00 02 00 fe ff ff ff ff ff ff ff 46 00 00

2015-12-14 09:57:47.18 spid52 * @ P 1 n v a r c 00 40 00 50 00 31 00 20 00 6e 00 76 00 61 00 72 00 63

2015-12-14 09:57:47.18 spid52 * h a r ( 8 0 ) , @ 00 68 00 61 00 72 00 28 00 38 00 30 00 29 00 2c 00 40

2015-12-14 09:57:47.18 spid52 * P 2 b i g i n t 00 50 00 32 00 20 00 62 00 69 00 67 00 69 00 6e 00 74

2015-12-14 09:57:47.18 spid52 * , @ P 3 i n t 00 2c 00 40 00 50 00 33 00 20 00 69 00 6e 00 74 00 00

2015-12-14 09:57:47.18 spid52 * çÿÿ     þÿÿÿÿ 00 00 00 00 00 e7 ff ff 09 04 00 02 00 fe ff ff ff ff

2015-12-14 09:57:47.18 spid52 * ÿÿÿx e x e c s ff ff ff 78 00 00 00 65 00 78 00 65 00 63 00 20 00 73

2015-12-14 09:57:47.18 spid52 * p _ a v a i l a b 00 70 00 5f 00 61 00 76 00 61 00 69 00 6c 00 61 00 62

2015-12-14 09:57:47.18 spid52 * i l i t y _ g r o 00 69 00 6c 00 69 00 74 00 79 00 5f 00 67 00 72 00 6f

2015-12-14 09:57:47.18 spid52 * u p _ c o m m a n 00 75 00 70 00 5f 00 63 00 6f 00 6d 00 6d 00 61 00 6e

2015-12-14 09:57:47.18 spid52 * d _ i n t e r n a 00 64 00 5f 00 69 00 6e 00 74 00 65 00 72 00 6e 00 61

2015-12-14 09:57:47.18 spid52 * l @ P 1 , 1 , 00 6c 00 20 00 40 00 50 00 31 00 2c 00 20 00 31 00 2c

2015-12-14 09:57:47.18 spid52 * @ P 2 , @ P 3 00 20 00 40 00 50 00 32 00 2c 00 20 00 40 00 50 00 33

2015-12-14 09:57:47.18 spid52 * ç       H 8 00 00 00 00 00 00 00 e7 a0 00 09 04 00 02 00 48 00 38

2015-12-14 09:57:47.18 spid52 * e a 6 b e b 5 - 0 00 65 00 61 00 36 00 62 00 65 00 62 00 35 00 2d 00 30

2015-12-14 09:57:47.18 spid52 * d e 3 - 4 f 7 1 - 00 64 00 65 00 33 00 2d 00 34 00 66 00 37 00 31 00 2d

2015-12-14 09:57:47.18 spid52 * 9 0 b 5 - 3 5 d f 00 39 00 30 00 62 00 35 00 2d 00 33 00 35 00 64 00 66

2015-12-14 09:57:47.18 spid52 * d 1 0 3 6 5 c 2 00 64 00 31 00 30 00 33 00 36 00 35 00 63 00 32 00 00

2015-12-14 09:57:47.18 spid52 * & ø ¨ & © 00 26 08 08 f8 06 a8 0d 00 00 00 00 00 00 26 04 04 a9

2015-12-14 09:57:47.18 spid52 * UM 03 55 4d

2015-12-14 09:57:47.18 spid52 *

 

所以首先分析了dump檔案。生成dump的callstack 內容如下:

Callstack
===
sqlmin!HadrFstrVnnUtils::GetRsFxEndpointPath+0x7e           
sqlmin!HadrFstrVnnUtils::SetClusterResourceProperties+0x153 
sqlmin!HadrFstrVnnUtils::RefreshWsfcConfig+0x299            
sqlmin!CHadrArProxy::RefreshFilestreamInWsfc+0xff           
sqlmin!CHadrArController::RefreshFilestreamInWsfc+0x4f      
sqlmin!CFstrSubscriber::Publish+0x138                       
sqlmin!CHadrPublisher::Publish+0x333                        
sqlmin!CHadrArProxy::PublishRoleChangeEvent+0x19d           
sqlmin!CHadrArProxy::Signal+0x469                           
sqlmin!CHadrArController::Online+0x1b5                      
sqlmin!CHadrArManager::OnlineAg+0x12d                       
sqlmin!SpAvailabilityGroupCommand+0x2f5    

   

經過測試和排查, 終於發現了原因:

p1和p2均配置了Filestream和Windows Share,但p3沒有這些配置.

 

解釋:

Alwayson以及SQL Cluster中有一個概念叫做WSFC Storage(儲存在登錄檔內),用於儲存一些共享資訊。在Alwayson中,如果primary的一些配置發生變化,這些變化也會反映到wsfc storage裡,並在同步到其他的secondary replica中。

如果primary replica啟動了Filestream和windows share name,那麼這些資訊會儲存在WSFC store(登錄檔)。這些資訊會被同步到所有的replica

secondary replica接收到failover命令時,他會去讀取本地的WSFC Store。如果WSFC Store顯示Filestream和windows share沒有啟動,那麼執行正常的failover操作。如果已經啟動,那就會去嘗試得到相應的windows share。如果這當前的replica沒有啟動Filestream,或沒有啟動windows Share,那麼就會出現異常,導致failover失敗並生成dump檔案。

 

 

重現方式如下:

建立兩個replica的,

P1為primary replica

P2為secondary replica

同步模式。

Failover的方式均為手動(manual)。

其中P1的配置如下

啟用了Filestream,並且設定了Windows Share name.

如果p2的配置和p1不同,那麼failover就會失敗。

 

 

解決方法也很簡單:

保持replica的配置一致.

如果不需要使用這些功能,那麼將這些工作在所有的replica上禁用即可。

或者在所有的replica上都開啟這些功能。

相關文章