IT專案啟示錄——來自泰坦尼克號的教訓(第十篇)(轉)

ger8發表於2007-08-13
回顧當時泰坦尼克號的情形:與冰山相撞後(見第8部分)船體仍顯無恙,沒人受傷。在船橋指揮部看來,船的完整性保持如初。白星公司主管布魯斯-埃斯梅死守自己面子和公司的聲譽(見第9部分)。當晚11點45分,在撞擊發生的10分鐘後,埃斯梅催促啟航,泰坦尼克號踉踉蹌蹌駛離冰架。對危險一無所知的乘客們在開船之中鬆了口氣,對撞擊及其潛在的損害、後果都少有擔憂。

如今的IT專案中至關重要的一點是,確保IT解決方案的平均故障恢復(MTTR)規程(見第9部分)已經在專案本身(見第4部分)之中被建立,準備,計劃和測試過了,並被配以專人(執行團隊/技術支援)“制度化”了。在故障的第2區間(4個區間分別為故障的探測,確定,解決,和從中恢復)內,資料的採集應經過嚴格的檢驗。

在修復有問題的產品前,團隊需要對修復本身的總體風險進行評估。對待上級的干預,應同對待其他方面來的意見一樣,經過仔細的檢驗,以免造成問題的惡化。重要的是,這些意見一旦可疑,就應立即予以挑戰。

史密斯船長是否也是重起航程的決定者之一,已不重要了。因為埃斯梅已按照自己的意願左右了大局。史密斯到無線電報室向波士頓公司總部彙報情況時仍顯樂觀,畢竟這艘有73個水密艙的大船在設計上具有很大的信心。他發出的無線訊息中稱,船撞冰了但受損很小,大家都很安全,為預防起見正駛向加拿大海爾法克斯。這條訊息應該給了白星公司足夠的時間去安排火車和馬車,把乘客們轉往紐約。該無線訊息沒有加密所以為各地媒體所悉。這也是歐洲新聞中對撞擊的早期報導都普遍樂觀的原因。

如今的IT專案中,平均故障恢復(MTTR)規程應完全取決於對it方案服務負責的團隊。與故障有關的溝通、訊息釋出都需先經他們的密切配合,只有在與方案的服務物件做了外部溝通後,才能作後續支援的決定。不準確的資訊將迅速瓦解服務提供者的信譽。

第2組調查人員,包括結構師托馬斯-安德魯斯和木匠約翰-哈金森,帶回了更準確地事故評估和更好的資料。而第1組調查人員則尚未檢視完足夠的地方來獲悉更大範圍的損傷。實際上撞擊後數秒內,煤料燃燒房和第5鍋爐房已經滲水。一名消防員事後證實,在煤料燃燒房地板上見到2英尺深的裂口。抽水機立刻開始工作,似乎能應付滲水、維持船體的上浮。托馬斯-安德魯斯深知一旦郵件室淹水,船也就完蛋了。

如今的IT專案可從中吸取的教訓是,為了查明事故,支援團隊必須對整合的可行方案知之甚祥,必須能將之邏輯分層,分解成一系列產品和部件。要訣在於,專案各個階段工作文件化的重要性,和把文件作為知識下傳後續執行階段的支援團隊。

重新起航後,第6鍋爐房也開始滲水。僅僅20分鐘後,當初的決策有多不準確就已經很顯見了。補救措施已無濟於事,郵件室終被水淹。史密斯與托馬斯-安德魯斯及指揮員們開會決定讓8節航速的船慢慢停下來。續航的行動終嘗惡果,災難性上漲的海水讓船吃水更多,其他本未受撞擊影響的部分也在水壓下開始漏水了。

而今IT方案在不穩定時,在一個MTTR狀態下,重要的是不斷評估、再評估執行環境的資料(證據),並監視環境的變化。第1個修補通常是臨時性的(見第9部分)、只為讓方案重新開始服務。替代的永久性修補,可能需要數小時、數天才能到位,方案本身可能需要在後臺打補丁。如,程式碼可能需要重作,或者一個新的部件需要整合進方案的整體中。這樣的話,在按照規程使之產品化之前,必須經過一個嚴格的計劃、測試(見第4,5部分)。因此要求一個強有力的變更管理流程和測試/演示環境。

安德魯斯向史密斯準確預測了船距離沉沒還有2小時,這是死刑判決。而史密斯終於也認識到情況已經無可挽救,不像撞擊剛發生時那樣尚有所可為了。

如今的IT專案可從中吸取的教訓在於,MTTR規程是可迴圈的,顧及了在有限時間內的多次嘗試。但是,埃斯梅迫使情況發展到超出了MTTR規程或者說是可恢復的限度。

結論

如今許多IT專案在緊急情況下大打折扣,因為不按照預定的執行和方案恢復規程行事。制度化的MTTR規程,本來應有助於弱化如泰坦尼克號執行的那種亡命決策,並防止緊急狀況惡化成大災禍。因此,支援團隊人員都應對方案的細節知之甚祥。下一部分將著眼於IT專案的災難性恢復階段。

原文:

In recapping Titanic’s situation, following the collision (Part 8) the ship appeared to be in remarkably good shape. No one had been injured and from the bridge the integrity of the ship appeared to be sound. White Star Director Bruce Ismay was hell bent on saving face--and his company’s reputation (Part 9). At 11:50 p.m., 10 minutes after the collision, Ismay pushed to restart the ship and limp Titanic off the ice shelf. Passengers, unaware of any dangers, later testified their initial relief that the ship was restarting the journey again, with little concern about the collision, the potential damage and consequences.

In today’s IT projects it is vital that Mean Time To Recovery (MTTR) procedures for the IT solution (see Part 9) are set up, prepared, planned and tested--in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support). Data collected in the second "problem" quadrant (the four quadrants are: detection, determination, resolution and recovery) has to stand up to rigorous review.

Before a resolution or fix is applied into production, the team needs to assess the overall risk of proceeding with it. Executive intervention is handled like any other input and needs to stand up to careful examination so as not to further deteriorate the situation. Importantly, it needs to be challenged if it does not make sense, without any repercussions.

Whether Captain Smith was part of the decision to restart Titanic was not really relevant as Ismay was in control of the situation driving forward his own agenda. Smith proceeded to the wireless room to inform the White Star Line in Boston of the situation. Smith was still optimistic; after all, there was a great confidence in the design of the ship with the 73 water tight compartments. Smith sent a wireless message outlining that Titanic had struck ice but with little damage. Everyone was safe aboard, and as a precaution the ship was proceeding to Halifax. The message would give White Star time to organize trains and carriages to transport the passengers to New York. Wireless messages were not encrypted and this one was intercepted by the world media. It was the reason why early reports of the collision that appeared in the European press were overwhelmingly optimistic.

In today’s IT projects, MTTR procedures need to be completely controlled by the groups responsible for the IT solution and the services it provides. Communications or announcements related to an outage situation need to be made in close conjunction with these groups and support decisions made when communicating externally to the service recipients of a solution. Inaccurate information can quickly erode confidence in the service provider.

The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, returned with a more accurate assessment of the situation and better data. The first search party had not descended enough decks to see the full extent of the damage. Within seconds of the collision, flooding had occurred in the coal bunkers and Boiler Room 5. One of the firemen later testified seeing a gaping hole 2 feet into the floor of the coal bunker. Suction lines were set up right away and the pumping seemed to be coping with the rate of flooding to keep the ship afloat. Andrews knew that if the mail room was lost to flooding, the ship was doomed.

The lesson from this for IT projects today is in order to pinpoint faults the support team needs a detailed knowledge of the integrated working solution, and the ability to break it down into logical layers and decompose it into a sequence of products and components. The importance of creating documentation at each stage of the project, and then transferring it as knowledge to support staff for later use in the operation, is key.

After restarting the ship, Boiler Room 6 had started to flood. Around 20 minutes later it was apparent that the initial determination was grossly inaccurate, and the fix was not resolving the situation. The mail room was lost to flooding. Smith conferred with Andrews and the officers, determining that the ship--sailing now at 8 knots--should come to a gradual stop. The forward motion had taken its toll. The ship had taken on more water resulting in increased flooding that was becoming catastrophic. Other parts of the ship, which were initially unaffected, had started to spring leaks under the strain of the water.

In today’s world in a MTTR situation where an IT solution falters, it is important to keep assessing and reassessing the environmental data (evidence) and monitoring the environment for any changes. The first fix applied is usually temporary (Part 9) as so to get the solution online and back into service. It may take hours or days to get a permanent fix in place. The solution may have to be patched up in the background. For example code may have to be reworked or a new component integrated into the solution. This then needs to go through rigorous planning and testing (Part 4 and Part 5) before implementing into production using the procedures from the project, hence the requirement for a robust change management process and a test/staging environment.

Andrews rightly predicted to Smith that the ship had approximately two hours before foundering. This was a death sentence, and Smith finally recognized the situation was hopeless and not recoverable like it had been right after the collision.

The lesson from this for IT projects today is that MTTR procedures are cyclical and allow for several attempts at recovery, in a limited time frame. However, Ismay forced a situation where the ship went beyond MTTR or recovery.

Conclusions
Today, many IT projects severely compromise a critical situation by not following an established process in operation and recovery of a solution. Institutionalized MTTR procedures should help minimize disparate decision making as carried out on Titanic and prevent a critical situation from becoming catastrophic. So should the support staff’s detailed knowledge of the solution. The next installment will look at the disaster recovery stage of the IT project.
[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7839396/viewspace-955624/,如需轉載,請註明出處,否則將追究法律責任。

相關文章