IT專案啟示錄——來自泰坦尼克號的教訓(第九篇)(轉)

ger8發表於2007-08-13
再扼要回顧一下把當時的情況:泰坦尼克號的指揮官們拼命想躲過一場撞擊(見第8部分)。但是,“S型轉向”這個正確的決策仍未能使船足夠減速。數以百計的旅客在事後說,泰坦尼克的船體幾乎平白無故地來了個停頓,顫動著、響起數秒鐘咕嚕嚕的滾動和摩擦聲音,如同船體正從大量的石頭彈子上翻側過去似的。

並沒出現所謂“驟然急停”、災禍、或者哪怕是輕微的受傷什麼的。也沒出現猛烈的側向搖晃,或沿船體側線的重複衝撞。這些情況,本會在船體要費力避開從側面撞來的冰山的時候出現的。放在飯廳的早餐餐具幾乎沒顫動,頭等吸菸室和休閒廳內的飲料也一點沒灑漏。一切跡象說明,船底剛好給擱在位於水下冰山基部的某一處冰架上了。默多克大副成功地規避了一場本可能讓頭前四個船廂粉身碎骨、並將殺傷數百名旅客的“迎頭一擊”。

同樣地,一個IT解決方案在生產營運階段出現不穩定時,根據專案本身預先準備、計劃、和測試過的規程,會採取一系列行動(見第4部分)。這種規程應以所謂MTTR (平均恢復時間)為基準,其主旨是使得該IT解決方案能從故障中儘快恢復上線,以滿足所謂的“服務水準協議”SLAs。繼而在後臺透過某種臨時、或者長遠的修補來得到完善。

誠然,正式上線前,方案的完整性首先要得到確立,以防故障再次出現。以時間為基準,運營團隊將走完前述流程和故障的四個區間,即“故障探測”、“故障確定”、“故障解決”和“從中恢復”。 “平均恢復時間”MTTR一旦開始計時,意味著“服務中斷”(一種損失,見第二部分)的開始,應按“使用者損失分鐘數”來採集評測指標,該“使用者損失分鐘數”可衡量出多少使用者得不到服務以及持續了多久。

這種方法,遠比常用的所謂“服務可用性百分比”,如99.999%的評測方式來得更精確。那末,泰坦尼克(“平均恢復時間”MTTR中)的“故障探測”期間,就是瞭望觀察哨給出警報的那37秒。但在IT解決方案中如此(長的“故障探測”期)並不常見,通常的情況倒是會在大問題出現前,就為處理故障給出了較好的提示警報。這給了自動化的、或者是人力的營運者以時間來首先防止問題的發作。(見第八部分)。

接下來,泰坦尼克號的船長、主管和指揮人員們在艦橋部集積確定行動步驟。作為“確定故障”的一部分,兩組人員分別被派往船的頭部、中部調查受損程度。第一組在10分鐘內就帶回了積極的報告:無大損傷,無漏水。在主管布魯斯-埃斯梅的頭腦中,故障的“探測”和“確定”期就此結束了。至於以發出遇險或者求救訊號的方式來完成隨後的“故障解決”環節,對他來說卻真是個大問題了,因為那會給泰坦尼克招來大量流言蜚語,將有損白星公司的市場位置,並且那種吸引了滿世界富豪精英都來乘坐這有史以來最安全航班的輝煌市場效應,也將毀於一旦。

其實,此時更好的“故障解決”方案,應該是把船開回加拿大哈里發克斯港,避開紐約這一世界新聞中心。這樣,他也可編出一個更好的新故事,把此次事故邊沿化成一樁小事而已。他還能讓乘客們都棄舟而改上火車,把船體修補一下後就開回貝爾法斯特作大修。事實上,他甚至可以大談裝備了最新式應急系統的泰坦尼克、本身就是一艘怎樣的救生船,是如何從一場巨大災難的邊沿中成功自救的,還能把白星公司航線的安全性更進一步地加以宣傳推廣。

現今的IT解決方案中,“故障的確定”要評估其給使用者帶來的影響。“確定”本身,必須有“證據”可支援,在確定問題是否惡化升級了、引發源頭是什麼上面,重新調查反饋機制和日誌是至關重要的。

在一個大型複雜的IT解決方案中,常現所謂“多米諾聯動效應”,即一個小的故障點比如某個子系統,會波及其相關鄰接者,從而引發大量的後續問題。如果不準確地理清這些故障事件之間的關聯順序,將導致誤判乃至做出錯誤修補,以及問題的再次發生。只有當對問題根因的估計得到測定和證實後,故障的“確定”期間才算正式完成。

對一個IT解決方案,重要的是保證掌握了“證據”,並提出下列問題:該方案是否預知自己將出現故障?如果是,那末是否有任何(自動的)防範行為發揮作用了?這些防範行為是否通知了人、或自動化操作者?反饋機制是否本身有問題、或反饋了不可信的資料?“故障的探測”
是否正確完成了?


泰坦尼克已處於緊要關頭,但還未陷災難。埃斯梅為保全面子所累,而他對白星公司好名聲的渴求所造成的環境氛圍,使得任何問題都容易發生。泰坦尼克蹲在水下的冰架上,似乎完全沒事;如果報安全為上、以防萬一的態度收船回航,也可能發現不過是小問題而已;埃斯梅倉促之間作出決策。而此時第二組帶了結構師和木工的損傷調查組尚未有評估報告返回。

今天的IT專案可吸取的教訓在於:“故障的解決”中,重要的是在對可供選擇的行動方案一一考察時,要在所有的“證據”基礎上、考慮相關風險。唯此後,才可開始這最後“從故障中恢復”的環節,即營運團隊根據“服務水準協議”SLAs讓IT方案重新上線恢復服務。

在泰坦尼克上,作為故障解決的一環,並未對所有的可選行動方案進行充分考慮。埃斯梅做出了錯誤的決策,讓船繼續前進,並電告引擎室“以最低速度前進”來完成“從故障中恢復”的環節。工程師時候證實,船以伴有碾摩雜音的3節速度繼續前進。

結論

今天,許多IT專案在營運階段大打折扣,是因為專案計劃的某種不充分:即沒有以MTTR時間為基礎來計劃“故障解決流程”。這樣(計劃充分的)流程,在幫助營運團隊迅速恢復服務並保持一定的服務水準方面都至關重要。這種(計劃充分的)流程,也應透過系列檢查來實施各部門之間的相互制衡,以將在壓力狀態下犯錯的可能性降到最低。這種(計劃充分的)流程,還要析構出“角色與職責”結構,以保證讓正確的職員作正確的決策。

下一部分,將著眼於災難狀況中的泰坦尼克指揮人員是如何作反應的。

原文:

In recapping the famous ship’s situation, Titanic’s officers tried desperately to avoid a collision (see Part 8). However, the S-turn, a good decision, failed to decelerate the ship enough. Titanic almost innocuously came to a halt later described by hundreds of passengers as a quiver, rumble or grinding noise that lasted a few seconds as if the ship was rolling over a thousand marbles.

There was no "crash stop," fatalities or even minor injuries. There was no violent jolt sideways or repeated strikes along the ship’s length. This is common with a side swipe against an ice spur when a ship is turning very hard away from it. The breakfast cutlery that was laid out in the dining salons barely trembled, and drinks remained unspilled in the first class smoking rooms and lounges. All the evidence indicates that the ship came to rest on an underwater ice shelf at the base of the iceberg. Murdoch had prevented a head on crash that could have demolished the first 4 compartments, and killed and maimed hundreds of passengers.

Likewise, when an IT solution falters in production steps are taken according to a process prepared, planned and tested in the project itself (see Part 4). The process should be based around a Mean Time To Recovery (MTTR) clock were the principal objective is to get the IT solution back on-line as quickly as possible to meet Service Level Agreements (SLAs). The solution is then patched up in the background and a temporary or permanent fix applied.

However, before going on-line, the integrity of the solution needs to be first established so the problem does not reoccur. With an eye on the clock, the operations group steps through the process and the four "problem" quadrants of detection, determination, resolution and recovery. When the MTTR clock starts ticking, signifying the beginning of loss of service (an outage, see Part 2), metrics should be captured as User Outage Minutes (UOMs), which measure how many users experience service loss and for how long.

This is far more accurate than measuring with the more commonly used percentage of service availability, e.g., 99.999 percent. Problem detection on Titanic was 37 seconds of warning given by the lookouts. This is not typical with an IT solution, which is likely to put out errors and warnings well before any significant failure occurs. This provides operators, automated or human, time to prevent the problem from occurring in the first place (see Part 8).

Titanic’s captain, director and officers gathered on the bridge to determine a course of action. As part of problem determination to the extent of the damage, two search parties were dispatched into the bowels of the ship, front and mid-ship. The first party returned within 10 minutes with a positive report of no major damage or flooding. In director Bruce Ismay’s mind, problem detection and determination were now complete. Resolution with a distress call was a problem for him as it would compromise White Star’s position by shattering the hype around Titanic and destroy the brilliant marketing (see Part 2 and Part 5) that had lured the world’s wealthy elite onto the safest liner ever built.

A better resolution would be to get the ship back to Halifax, away from New York and the center of the world’s press. He could then better contain the news story, and marginalize it as a minor incident. He would be able to disembark passengers onto trains, patch the ship up and sail her back to Belfast for repairs. In fact, he could boldly claim that Titanic, a lifeboat in itself with all the latest in emerging technologies, was able to save herself from a potential disaster and further push the safety claims of White Star lines.

With an IT solution today, determination of the problem assesses the impact of the solution on users. Determination has to be consistent with the available evidence. Reinvestigation of feedback mechanisms and logs is vital to determine if the problem has been building up and what is causing it.

In a complex IT solution, it is common to see the domino effect, where a small faulty element like a subsystem knocks out elements around it and triggers a cascade of problems. Not working out this precise sequence of events could lead to a misdiagnosis where a wrong fix is applied and the problem reoccurs. Determination is completed when the root cause assumptions of the problem are tested and proven to be correct.

With an IT solution it is important to be sure of the evidence at hand and to ask the following questions. Was the IT solution aware it was going to fail? If so, were any (automated) preventative actions attempted? Did it alert human or automated operators? Were any of the feedback mechanisms faulty and provide unreliable data? Is the diagnosis of the problem correct?

Titanic’s situation was critical but not catastrophic. Ismay was hell bent on saving face and his anxiety over White Star’s reputation created an atmosphere where mistakes were easily made. Titanic appeared to be completely stable, sitting snugly on the underwater ice shelf. May be with due care they could dislodge the ship with a minimum of damage. Ismay rushed into making a decision. The second search party with the architect and carpenter had not even returned with an assessment.

The lesson from this for IT projects today is that in resolving the problem it is important to consider the alternative courses of action available with the risk associated with each based on all the collected evidence. Only then should the last quadrant of recovery commence. This is where the operations group puts the IT solution back on-line and resumes services, according to SLAs.

On Titanic, not all courses of action were adequately explored as part of the problem resolution. Ismay made the fateful decision to sail forward and telegraphed the engine room "dead slow ahead" in recovering the situation. Engineers later testified the ship moved forward at 3 knots with a grinding noise.

Conclusions
Today, many IT projects severely compromise the operation stage by not planning adequately in the project for a process to deal with problems around a MTTR clock. A process is critical for enabling the operations group to quickly restore service and maintain service levels. A process should also carry the checks and balances (through reviews) to minimize the likelihood of mistakes made in a pressure situation. A process should outline responsibilities and roles to ensure the right personnel make the right decisions.

The next installment will look at how the officers reacted to the disastrous situation.
[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7839396/viewspace-955621/,如需轉載,請註明出處,否則將追究法律責任。

相關文章