IT專案啟示錄——來自泰坦尼克號的教訓(第十二篇)(轉)

ger8發表於2007-08-13
回顧泰坦尼克號當時的情形:當船重新起航後(見第10部分),滲水演變成一場大災難。當晚12時45分左右,即在船體擱在冰架上65分鐘後,船長令指揮員們開啟救生艇並把所有乘客和船員召集到甲板上。船員們因不清晰的溝通而處於困惑之中(見第11部分),行動遲疑,不相信一切已經不對頭了。畢竟,其時大災難的跡象尚未顯見。

在今天,災難恢復的概念是把線上執行轉移到另一個替代性的服務環境。但是形式卻是多種多樣的,從數天內完成單個應用的資料/檔案的簡單恢復,到數分鐘小時內就得完成整個業務執行的相對複雜的恢復。災難可能呈現三種態勢,即:完全狀態(絕對而立即),急迫而逼近,緩慢而無毒害。當災難被確認後,應急計劃就啟動了,災難也將被公諸於眾。

在泰坦尼克號上,災難屬於緩慢而無毒害型的。雖然全面的恢復計劃不再可行,船長與指揮官們仍可展開區域性的恢復。而在缺乏正式的撤離或災難恢復計劃的情況下,他們能做的也只能是在災難跡象明顯之前,發令阻止恐慌和混亂的蔓延。在設計時(見第3部分)對災難恢復的場景假想,是用救生艇把乘客們轉移到另一艘船上並帶回港岸,就是說,救生艇會往返運載乘客,因此對其數量的要求就很小。但這一假想的前提是基於泰坦尼克號是不會沉沒的,至少能自己漂浮在海上待援。

而今我們開發一個災難恢復計劃時,必須考慮全IT方案中可能引發災難的所有形式的故障。例如:
●技術上的物理故障或有形缺陷
●設計錯誤,含系統/應用程式軟體設計的失敗和程式碼問題
●由執行操作人員因事故,不熟練,培訓不足,不按規程甚至蓄意惡意造成的執行失敗

環境(如動力系統,冷卻系統,連同網路)的故障,可以和自然災害、恐怖行動一樣,對執行中心造成同等的破壞。

在過去400年中,絕大部分與橫渡大西洋有關的環境因素,都已經被發現,植入圖表和載入文件了。內容包羅永珍,從全年的自然情況(如海流的變化),天氣情形(如風暴和颶風),到自然危害(如海上濃霧,冰原,冰山帶和危險的海岸線,礁石等等)。然而,在泰坦尼克號專案中瀰漫的一種信念就是,這艘不會沉的巨大鐵船能應對一切自然問題。

在設計一個災難恢復計劃時,還需考慮災難的級別。比如,當較小的風暴,火災或者水淹來襲時,你的顧客希望得到某種相對迅速的應急服務。現在,你就需要對所有這些都準備應急措施,以至對更大的災難也一樣。

災難恢復的相關費用,會因耗時,引發原理,恢復程度的不同而相異。這些費用,應作為計劃的一部分,針對每個特定的IT方案物件,仔細確定。

對泰坦尼克號而言,按海運慣例本應有一個考慮到了上述一切情況的災難恢復計劃,來將所有人帶到救生甲板,把他們轉移到座位寬綽有餘的救生艇上,安全放下並讓訓練有素的船員帶走他們。在金斯頓的救生艇訓練中,應該已經測試過計劃中的這後一部份(見第5部分)。

在生產環境下大量的嚴重問題都開始於無毒無害的狀態,即在問題剛開始時,你的組織也許甚至都不會留意到它及其影響後果。如,IT方案中一個不緊要的部分停下來了,未被注意,但是因為各個部件和應用之間的內在關聯,出現一種連鎖效應並很快使得該方案的其他部分受到影響,這將在極短時間內引發大的災禍。

在泰坦尼克號上,救生艇的釋放明顯晚了,說明方式猶豫到最後才不得不發放的。指揮員的緩慢反應,可能因為總覺得該船不可能沉沒,事態也不明顯,當時一切都尚顯正常。還有,900船員中,真正意義上的水手只有83個(見第5部分),只有這些人掌握了把30英尺長的救生艇(可乘65人)怎樣放到60英尺下海面上的複雜操作。這樣的救生艇一共16艘,此外另有4艘較小的可拆裝式的稱作Englehardts的救生艇(可乘45人)。

結論

如今,不少IT專案完全忽視災難恢復,其理由是不在專案範疇內,和另有年度計劃流程來覆蓋。IT專案本身除了確立商務理由,針對IT方案進行設計外,其實也包括了對所需恢復展開深入的瞭解。對影響IT方案的災難後果所作的嚴肅思考,需在專案早期儘早完成,以便對整體的災難恢復計劃進行調整。下一部分我們仍將著眼於災難恢復。

原文:

In recapping Titanic’s situation, following the restart of the ship (Part 10) the flooding became catastrophic. Around 12:45 p.m. , 65 minutes after the initial grounding on the ice shelf, the captain gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. The crew, confused by unclear communication (Part 11), operated in a state of disbelief, refusing to believe that anything was wrong. After all, there were still few signs of the disaster.

In today’s world, disaster recovery is the concept of switching the online operation to an alternate service-delivery environment. However, it takes many shapes and forms, from the relatively simple recovery of data and files from a single application in a timeframe measured in days, to the relatively complex recovery of a complete business operation in a timeframe measured in minutes or hours. A disaster can take three forms, namely: total (absolute and immediate), rapid and imminent, slow and innocuous. When a disaster is recognized, contingency plans are invoked and a disaster is declared.

On board Titanic, the disaster was slow and innocuous. Although a full recovery was not feasible anymore, the captain and officers could enact a partial recovery. But without a formalized evacuation or disaster recovery plan, the best they could do was to bring some order to prevent widespread panic and chaos once the disaster signs became more obvious. The envisioned scenario for disaster recovery, at the time of the design (Part 3), was to transfer passengers through lifeboats to another ship and then deliver them to port. The lifeboats would ferry passengers back and forth to the rescue ship, requiring a much smaller total lifeboat capacity. This scenario was based on the perception that Titanic could not possibly sink, but would float in an incapacitated state waiting for help.

In today’s world in defining a disaster recovery plan, thought needs to be given to all the types of failures that could possibly happen to an IT solution and lead to a disaster. For example:
· Physical faults or failures in the technology
· Design errors which include system or application software design failures and bugs
· Operations errors caused by operations services staff because of accidents, inexperience, lack of due diligence or training, not following procedures or even malice
Environmental failures can be equally devastating, such as those in power supplies, cooling systems and network connections--as can natural disasters and terrorist activities against the operation center itself.

In the past 400 years, most environmental factors related to crossing the Atlantic had been observed, charted and documented. This included everything from year-round natural conditions like changing ocean currents and weather patterns like storms and hurricanes to natural hazards like fogbanks, ice fields and iceberg areas, and dangerous shorelines and rocky outcrops, etc. However, a belief had evolved during Titanic’s project (Part 4) that anything that nature could hand out could be handled by this enormous iron ship that was practically unsinkable.

In defining a disaster recovery plan, the scale of disaster is important to consider as well. For example, if a relatively minor storm, fire or flood knocks out your online operation, your customers are going to expect some contingency of service relatively quickly. In today’s world, you need contingency for all of these, even the most catastrophic disasters.

The associated costs of disaster recovery vary, based on the window of recovery (time), the elements of the disaster and the degree of recovery required. As part of a plan, these costs need to be carefully determined specifically for the IT solution created.

For Titanic, under maritime convention there should have been a disaster recovery plan defined for all the above situations that brought everyone onboard to the lifeboat deck, loaded them into the lifeboats with places to spare, lowered the lifeboats safely, and put them adrift with experienced crews to handle them. The life boat drill in Queenstown should have tested the latter part of the plan (Part 5).

Many serious problems with a production environment can start so innocuously that, in the first hour, your organization might not even be aware of it or its implications. For example, a less-critical part of the IT solution might be "down," so it goes unnoticed. However, because of interdependencies between components and applications, there tends to be a "knock on" effect and very quickly other parts of the IT solution can become affected. This leads to a catastrophic failure in a very short time.

On board Titanic there was a major delay in getting the lifeboats down, indicating a hesitation to launch the boats until as late as possible. It is likely the officers reacted slowly for several reasons: the ship was believed to be unsinkable, the gravity of the situation was not apparent and everything appeared so normal at the time. Also, only 83 of the crew of 900 were actual mariners (Part 5) and therefore familiar with the somewhat complex drill of lowering a 30 foot (65 person) lifeboat 60 feet to the water. There were 16 of these lifeboats in total, plus four smaller collapsible lifeboats (45 person) or "Englehardts."

Conclusions
Today, many IT projects completely ignore disaster recovery as something beyond their scope and covered off by a yearly IT planning process. Yet it is the IT project that determines the business justification and design around the IT solution, and develops an in-depth understanding of the kind of recovery that is required. Serious thought needs to be given to the consequences of a disaster impacting the IT solution, and this needs to be done early enough in the project so that adjustments to the overall disaster recovery plan can be made. The next installment will continue to look at disaster recovery.
[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7839396/viewspace-955628/,如需轉載,請註明出處,否則將追究法律責任。

相關文章