[AlwaysOn] AlwaysOn可用性組的故障轉移和故障轉移模式[中英文對照] 6

cow977發表於2019-08-27

5 、強制故障轉移(可能丟失資料)

5 、Forced Failover (with Possible Data Loss)

強制執行可用性組的故障轉移(可能丟失資料)是一種災難恢復方法,可使你使用次要副本作為熱備用伺服器。因為強制執行故障轉移可能面臨資料丟失的風險,因此應審慎使用它。建議僅當您必須立即將服務還原到可用性資料庫並願意承擔資料丟失的風險時,才執行強制故障轉移。有關強制故障轉移的先決條件和建議,以及使用強制故障轉移從災難性故障中恢復的示例應用場景的詳細資訊,請參閱 。

Forcing failover of an availability group (with possible data loss) is a   disaster recovery method that allows you to use a secondary replica as a warm   standby server.Because forcing failover risks possible data loss, it should   be used cautiously and sparingly. We recommend forcing failover only if you   must restore service to your availability databases immediately and are   willing to risk losing data. For more information about the prerequisites and   recommendations for forcing failover and for an example scenario that uses a   forced failover to recover from a catastrophic failure, see .

警告

Warnings

強制故障轉移要求WSFC群集具有仲裁。有關配置仲裁和強制仲裁的資訊,請參閱 。

Forcing failover requires that the WSFC cluster have quorum. For   information about configuring quorum and forcing quorum, see .

5.1 、強制故障轉移的原理

5.1 、How Forced Failover Works

強制故障轉移會啟動一個將主角色轉換為角色處於輔助或正在解析狀態的目標副本的過程。故障轉移目標成為新的主副本,並立即將其資料庫副本提供給客戶端。當以前的主副本變得可用時,它將轉換為輔助角色,並且其資料庫將成為輔助資料庫。

Forcing failover initiates a transition of the primary role to a target   replica whose role is in the SECONDARY or RESOLVING state. The failover   target becomes the new primary replica and immediately serves its copies of   the databases to clients. When the former primary replica becomes available,   it will transition to the secondary role and its databases will become   secondary databases.

所有輔助資料庫(包括現在變得可用的以前的主資料庫)將掛起。根據掛起的輔助資料庫以前的資料同步狀態,它可能適合於補救該主資料庫的未能提交的資料。在配置為只讀訪問的輔助副本上,您可以查詢輔助資料庫以手動發現丟失的資料。然後,您可以對新的主資料庫發出Transact-SQL語句來進行必要的更改。

All secondary databases (including the former primary databases, when   they become available) are SUSPENDED. Depending on the previous data   synchronization state of a suspended secondary database, it might be suitable   for salvaging missing committed data for that primary database. On a   secondary replica that is configured for read-only access, you can query the   secondary databases to manually discover missing data. Then you can issue   Transact-SQL statements on the new primary databases to make any necessary   changes.

5.2 、強制故障轉移的風險

5.2 、Risks of Forcing Failover

一定要注意,強制故障轉移可能會造成資料丟失。這是因為目標副本無法與主副本進行通訊,從而不能保證兩個資料庫同步。強制故障轉移啟動新的恢復分叉。因為原始主資料庫和輔助資料庫位於不同的恢復分叉上,所以每個資料庫現在包含另一個資料庫不包含的資料:每個原始主資料庫包含任何尚未從其傳送佇列傳送到以前的輔助資料庫的更改(未傳送的日誌);以前的輔助資料庫包含任何強制故障轉移之後發生的更改。

It is essential to understand that forcing failover can cause data loss.   Data loss is possible because the target replica cannot communicate with the   primary replica and, therefore, cannot guarantee that the databases are   synchronized. Forcing failover starts a new recovery fork. Because the   original primary databases and secondary databases are on different recovery   forks, each of them now contains data that the other database does not   contain: each original primary database contains whatever changes were not   yet sent from its send queue to the former secondary database (the unsent   log); the former secondary databases contain whatever changes occur after   failover was forced.

如果因為主副本出現故障而強制進行故障轉移,則潛在的資料丟失取決於是否在出現故障之前已將所有事務日誌傳送到輔助副本。在非同步提交模式下,可能會始終存在累積的未傳送日誌。在同步提交模式下,可能僅在輔助資料庫同步之前會出現這種情況。

If failover is forced because the primary replica has failed, potential   data loss depends on whether or not any transaction logs had been sent to the   secondary replica before the failure. Under the asynchronous-commit mode,   accumulated unsent log is always a possibility. Under synchronous-commit   mode, this is possible only until the secondary databases become   synchronized.

下表總結了在強制故障轉移到該副本上時特定資料庫丟失資料的可能性。

The following table summarizes the possibility of data loss for a   particular database on the replica to which you force failover.

輔助副本的可用性模式

資料庫是否同步?

是否可能發生資料丟失?

同步提交

同步提交

非同步提交

Availability mode of Secondary Replica

Is database synchronized?

Is data loss possible?

Synchronous-commit

是Yes

否No

Synchronous-commit

否No

是Yes

Asynchronous-commit

否No

是Yes

輔助資料庫僅跟蹤兩個恢復分叉,因此,如果您執行多個強制故障轉移,則確實已與先前的強制故障轉移啟動資料同步的任何輔助資料庫都可能無法恢復執行。如果發生這種情況,則需要從可用性組中刪除無法恢復的所有輔助資料庫,還原到正確的時間點,然後重新加入可用性組。在此方案中,可能會發生狀態為103的錯誤1408(錯誤:1408,嚴重性:16,狀態:103)。還原不能跨多個恢復分叉執行,因此請確保在執行多個強制故障轉移後執行日誌備份。

Secondary databases track only two recovery forks, so if you perform   multiple forced failovers, any secondary database that did start data   synchronization with the previous force failover might not be able to resume.   If this occurs, any secondary databases that cannot be resumed will need to   be removed from the availability group, restored to the correct point in   time, and rejoined to the availability group. Error 1408 with state 103 may   be observed in this scenario (Error: 1408, Severity: 16, State: 103). A   restore will not work across multiple recovery forks, therefore, be sure to   perform a log backup after performing more than one forced failover.

5.3 、強制仲裁後需要強制故障轉移的原因

5.3 、Why Forced Failover is Required After Forcing Quorum

在對WSFC群集強制執行仲裁(強制仲裁)後,你需要在每個可用性組上執行強制故障轉移(可能會丟失資料)。強制故障轉移是必需的,因為WSFC群集的真實狀態值可能已丟失。在強制仲裁後需要防止常規故障轉移,因為在重新配置的WSFC群集上未同步的輔助副本很可能顯示為“已同步”。

After quorum is forced on the WSFC cluster ( forced quorum) you   need to perform a forced failover (with possible data loss) on each   availability group. The forced failover is required because the real state of   the WSFC cluster values might have been lost. Preventing normal failovers   after a forced quorum is required because of the possibility than an unsynchronized   secondary replica would appear to be synchronized on the reconfigured WSFC   cluster.

例如,考慮在3個節點上承載可用性組的WSFC群集:節點A承載主要副本,而節點B和節點C分別承載一個次要副本。節點C斷開了與WSFC群集的連線,而此時該節點上的本地輔助副本處於同步狀態。但是節點A和節點B仍可以正常仲裁,可用性組仍處於聯機狀態。在節點A上,主副本繼續接受更新,在節點B上,輔助副本繼續與主副本同步。節點C上的輔助副本就會變得不同步,並且越來越滯後於主副本。但是,由於節點C已斷開連線,該副本仍錯誤地處於同步狀態。

For example, consider a WSFC cluster that hosts an availability group on   three nodes: Node A hosts the primary replica and Node B and Node C each   hosts a secondary replica. Node C gets disconnected from the WSFC cluster   while the local secondary replica is SYNCHRONIZED. But Node A and Node B   retain a healthy quorum and the availability group remains online. On Node A,   the primary replica continues to accept updates, and on Node B, the secondary   replica continues to synchronize with the primary replica. The secondary   replica on Node C becomes unsynchronized and falls increasingly behind the   primary replica. However, because Node C is disconnected, the replica   remains, incorrectly, in the SYNCHRONIZED state.

如果仲裁丟失,然後在節點A上強制執行,則WSFC群集上可用性組的同步狀態應是正確的(節點C上的輔助副本顯示為未同步狀態)。但是,如果在節點C上強制執行仲裁,則可用性組的同步狀態將是不正確的。群集上的同步狀態將恢復為節點C斷開連線時所處的狀態(節點C上的輔助副本“錯誤地”顯示為同步狀態)。由於計劃的手動故障轉移確保了資料的安全性,在強制仲裁後它們不允許將可用性組恢復為聯機狀態。

If quorum is lost and is then forced on Node A, the synchronization   state of the availability group on the WSFC cluster should be correct-with   the secondary replica on Node C shown as UNSYNCHRONIZED. However, if quorum   is forced on Node C, the synchronization of the availability group will be   incorrect. The synchronization state on the cluster will have reverted back   to when Node C was disconnected-with the secondary replica on Node C incorrectly   shown as SYNCHRONIZED. Since planned manual failovers guarantee the safety of   the data, they are disallowed for bring an availability group back online   after quorum is forced.

5.4 、跟蹤可能的資料丟失

5.4 、Tracking Potential Data Loss

WSFC 群集正常仲裁時,您可以估計資料庫上當前可能的資料丟失量。對於給定的輔助副本,當前可能的資料丟失量取決於本地輔助資料庫滯後相應主資料庫的程度。因為滯後程度隨時間而變化,我們建議您定期跟蹤未同步的輔助資料庫可能的資料丟失情況。跟蹤滯後情況涉及比較每個主資料庫和輔助資料庫的上次提交LSN和上次提交時間,如下所示:

When the WSFC cluster has a healthy quorum, you can estimate the current   potential for data loss on databases. For a given secondary replica, the   current potential for data loss depends on how far the local secondary   databases are lagging behind the corresponding primary databases. Because the   amount of lag varies over time, we recommend that you periodically track   potential data loss for your unsynchronized secondary databases. Tracking lag   involves comparing the Last Commit LSN and Last Commit Time for each primary   database and its secondary databases, as follows:

1. 連線到主副本。

1.Connect to the primary replica.

2. 查詢 sys.dm_hadr_database_replica_states動態管理檢視的 last_commit_lsn(上次提交事務的LSN)和 (上次提交時間)列。

2.Query the last_commit_lsn (LSN of the last committed   transaction) and last_commit_time (time of the last commit) columns of   the   dynamic management view.

3. 比較為每個主資料庫和它的每個輔助資料庫返回的值。它們的上次提交LSN的差值指示滯後的程度。

3.Compare the values returned for each primary database and each of its   secondary databases. The difference between their Last Commit LSNs indicate   the amount of lag.

4. 當某個或某組資料庫上的滯後程度超過指定時間段的最大滯後程度時,您可以觸發一個警報。例如,可以透過每分鐘在每個主資料庫上執行的一個作業來執行查詢。如果自上次執行該作業以來,主資料庫的 last_commit_time和任意輔助資料庫的相應值的差值超過恢復點目標(RPO)(例如,5分鐘),該作業可能引發一個警報。

4.You can trigger an alert when the amount of lag on a database or set   of databases exceeds your desired maximum lag for a given period of time. For   example, the query can be run by a job that executes every minute on each   primary database. If the difference between the last_commit_time of a   primary database and any of its secondary databases has exceeded the recovery   point objective (RPO) (for example, 5 minutes) since the last time the job   executed, the job can raise an alert.

重要

Important

當WSFC群集缺少仲裁或已強制執行仲裁時, last_commit_lsnlast_commit_time為NULL。有關在強制仲裁後如何避免資料丟失的資訊,請參閱 。

When the WSFC cluster lacks quorum or quorum has been forced, last_commit_lsn   and last_commit_time are NULL. For information about how you might be   able to avoid data loss after you forced quorum, see "Potential Ways to   Avoid Data Loss After Quorum is Forced" in .

5.5 、管理潛在的資料丟失

5.5 、Managing the Potential Data Loss

強制故障轉移後,所有輔助資料庫都將掛起。這包括以前的主資料庫(在以前的主副本返回到聯機狀態並且發現它現在是輔助副本後)。您必須單獨在每個輔助副本上手動恢復每個掛起的資料庫。

After failover is forced, all secondary databases are suspended. This   includes the former primary databases, after the former primary replica comes   back online and discovers that it is now a secondary replica. You must   manually resume each suspended database individually on each secondary   replica.

以前的主副本可用後,假設其資料庫沒有損壞,則可以嘗試管理可能的資料丟失。管理潛在資料丟失的可用方法取決於原始主副本是否已連線到新的主副本。假設原始主副本可以訪問新的主例項,則會自動透明地進行重新連線。

Once the former primary replica is available, assuming that its   databases are undamaged, you can attempt to manage the potential data loss.   The available approach for managing potential data loss depends on whether   the original primary replica has connected to the new primary replica.   Assuming that the original primary replica can access the new primary   instance, reconnecting occurs automatically and transparently.

已重新連線原始主副本

The Original Primary Replica Has Reconnected

通常,出現故障後,原始主副本在重新啟動時便會迅速重新連線到其夥伴。重新連線後,原始主副本將成為輔助副本。其資料庫將成為輔助資料庫,然後進入掛起狀態。除非您恢復新的輔助資料庫,否則不會回滾它們。

Typically, after a failure, when the original primary replica restarts   it quickly reconnects to its partner. On reconnecting, the original primary   replica becomes the secondary replica. Its databases becomes the secondary   databases and enter the SUSPENDED state. The new secondary databases will not   be not rolled back unless you resume them.

但是,無法訪問掛起的資料庫;因此,不能對其進行檢查以確定恢復給定資料庫時可能丟失的資料。因此,確定是恢復還是刪除輔助資料庫取決於您是否能夠完全接受資料丟失,如下所示:

However, the suspended databases are inaccessible, so you cannot inspect   them to evaluate what data would be lost if you were to resume a given   database. Therefore, the decision on whether to resume or remove a secondary   database depends on whether you are willing to accept any data loss, as   follows:

·        如果資料丟失不可接受,則應該從可用性組中刪除資料庫以對資料進行補救。

·        If losing any data would be unacceptable, you should remove the   databases from the availability group to salvage them.

資料庫管理員現在可以恢復以前的主資料庫,並嘗試恢復可能已丟失的資料。但是,當以前的主資料庫處於聯機狀態後,它與當前主資料庫存在偏差,因此,資料庫管理員需要使客戶端無法訪問刪除的資料庫或當前主要資料庫,以免資料庫之間出現更大偏差並防止出現客戶端故障轉移問題。

The database administrator can now recover the former primary databases   and attempt to recover the data that would have been lost. However, when a   former primary database comes online, it is divergent from the current   primary database, so the database administrator needs to make either the   removed database or the current primary database inaccessible to clients to   avoid further divergence of the databases and to prevent client-failover   issues.

·        如果資料丟失對於您的業務目標是可以接受的,您可以恢復輔助資料庫。

·        If losing data would be acceptable to your business goals, you can   resume the secondary databases.

恢復輔助資料庫會導致它如同步資料庫第一步所述那樣回滾。如果出現故障時日誌記錄在傳送佇列中等待,則相應的事務將會丟失,即使已提交這些事務也會如此。

Resuming a new secondary database causes it to be rolled back as the   first step in synchronizing the database. If any log records were waiting in the   send queue at the time of failure, the corresponding transactions are lost,   even if they were committed.

未重新連線原始主副本

The Original Primary Replica Has Not Reconnected

如果可以暫時阻止原始主副本透過網路重新連線到新的主副本,則可以檢查原始主資料庫以確定恢復它們時可能丟失的資料。

If you can temporarily prevent the original primary replica from   reconnecting over the network to the new primary replica, you can inspect the   original primary databases to evaluate what data would be lost if they were   resumed.

·        如果潛在的資料丟失可以接受

·        If the potential data loss is acceptable

允許原始主副本重新連線到新的主副本。重新連線會導致新的輔助資料庫被掛起。要啟動資料庫的資料同步,只需恢復它。新的輔助副本會刪除該資料庫的原始恢復分叉,從而丟失從未傳送到以前的輔助副本或由其接收的所有事務。

Allow the original primary replica to reconnect to the new primary   replica. Reconnecting causes the new secondary databases to be suspended. To   start data synchronization on a database, simply resume it. The new secondary   replica drops the original recovery fork for that database, losing any   transactions that were never sent to or received by the former secondary   replica.

·        如果資料丟失不可接受

·        If the data loss is unacceptable

如果原始主資料庫包含在恢復掛起的資料庫時可能丟失的重要資料,則可以從可用性組中刪除它,以保留原始主資料庫中的資料。這樣會導致資料庫進入“正在還原”狀態。此時,我們建議您嘗試備份已刪除資料庫的日誌尾部。然後,透過從原始主資料庫中匯出要補救的資料,並將其匯入當前主資料庫來更新當前主資料庫(以前的輔助資料庫)。建議儘快對已更新的主資料庫執行完整資料庫備份。

If the original primary database contains critical data that would be   lost if you resumed the suspended database, you can preserve the data on the   original primary database by removing it from the availability group. This   causes the database to enter the RESTORING state. At this point, we recommend   that you attempt to back up the tail of the removed database's log. Then, you   can update the current primary (the former secondary database) by exporting   the data you want to salvage from the original primary database and importing   it into the current primary database. We recommend taking a full database   backup of the updated primary database as quickly as possible.

然後,在承載新的輔助副本的伺服器例項上,您可以使用RESTORE WITH NORECOVERY來還原此備份(以及至少一個後續日誌備份),從而刪除掛起的輔助資料庫並建立新的輔助資料庫。我們建議延遲當前主資料庫的其他日誌備份,直到恢復相應的輔助資料庫。

Then, on the server instance that hosts the new secondary replica, you   can delete the suspended secondary database and create a new secondary   database by restoring this backup (and least one subsequent log backup) using   RESTORE WITH NORECOVERY. We recommend delaying additional log backups of the   current primary databases until the corresponding secondary databases are   resumed.

警告

Warnings

在其任何輔助資料庫被掛起時,事務日誌截斷在主資料庫上被延遲。此外,只要任何本地資料庫保持掛起狀態,同步提交輔助副本的同步執行狀況就無法轉換到“正常”。

Transaction log truncation is delayed on a primary database while any of   its secondary databases is suspended. Also the synchronization health of a   synchronous-commit secondary replica cannot transition to HEALTHY as long as   any local database remains suspended.


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/81227/viewspace-2655027/,如需轉載,請註明出處,否則將追究法律責任。

相關文章