IT專案啟示錄——來自泰坦尼克號的教訓(第八篇)(轉)

ger8發表於2007-08-13
把當時的情況再扼要地回顧一下:由於沒能從關鍵的反饋機制[見第7部分]發現各種問題,泰坦尼克號的提前警報系統實際上已經失效,這可能是緣於害怕報復;除此因素外,對該船的安全係數[見第4部分]存在普遍的過分相信,對法國船班尼亞加那號那樣的結局也冷漠不驚[見第6部分],對有關巨型冰原規模的相關資訊又不準確[見第7部分],所有這一切導致了總狀況上的截然不同。最終,Ismay的壓力和新的SLO(“服務水準目標”)[見第5部分]則把泰坦尼克推向其最高的航速,超出了其執行極限。

泰坦尼克號就這樣駛向那場撞擊。這其實已幾乎無法避免。在漫布著小冰川和碎冰團的靜固冰水中,船體仍以全速前行。瞭望監視哨兵們在缺乏雙筒望遠鏡,刺骨的寒風不停擊打眼睛的情況下,還試圖從此時常會出現的霧層中分辨出地平線之所在。因此在他們費力地想從蜃景般若隱若現的前方視野中辨認出那團巨型黑影的過程中,向艦橋指揮部報告的時間早已被耽誤了。

在此,而今的IT專案可吸取的教訓是,對一個新的執行方案,執行操作人員只有在非常熟悉它以後才能掌控之。(對新的執行方案),他們應持的姿態是首先要防患於未然,並保證該方案符合服務的級別和水準。同時,對該執行方案的內部以及周遭相關的環境,他們也需好好加以洞察。面對從建立於專案計劃和測試階段的反饋機制中收集上來的資料[見第4部分],他們也應能迅速地加以分析和評估。而當反饋機制中信噪變得交雜不清的時候,他們須對情況進行診斷,確定出與標準的偏差,確定出潛在的影響和影響的綜合程度。他們還需對問題是否該上報了、以及上報各種問題時的優先順序做出正確的決策。

由於當時不僅海面平靜,而且也沒有浪花能湧現於“殘冰山”的基部,所以幾乎不可能從遠處的霧層中及早發現這樣的“殘冰山”。泰坦尼克的瞭望哨兵認定那一大團黑影其實就是“殘冰山”,或所謂“黑冰山”----一種翻倒遊動的黑色冰山的時候,情況已直轉危急了。哨兵們一旦確信自己的觀察後,就向艦橋指揮部發出了那句著名的報告“前面有冰山!”。而指揮官和值班大副默多克,鎮定地聽完報告並用雙筒望遠鏡目測出了與冰山的距離為900碼。今天,從可獲得的所有證據看,當時默多克大副採取瞭如下行動來應對:
·首先,他關掉了引擎。這是合理的,因為如果此時直接倒車,不僅只會攪拌船下的海水,還會抑制方向盤的轉動,使船難於控制。
·接下來,由於已不夠距離讓船停下來,又沒法繞過冰山,因此他試著轉左舵,或走一個s型---先急打左轉舵,緊接急打右轉舵---以設法使船能驟然減速。在僅有的短短40秒反應時間裡,這樣的動作可能讓他的船能與冰山平行起來,而不是迎頭撞上。
·第三,為防範計,他把電控開關打到了關閉艙壁水密艙門的檔位。事後看來,這些可能都是當時所能做的最好的應急措施了.

現今的IT專案從這裡可吸取的教訓是,在緊急情況下發現的任何異常,都應在執行操作員(瞭望哨)和各級技術支援人員(艦橋指揮官員)之間平滑地逐級上報。這種為安全起見的逐級上報系統,須在專案的測試階段,就透過對其可操作性的測試和實際執行操作的測試來建立好。只有當操作人員對解決方案和工作環境都熟悉了以後,才可建立更簡捷的上報程式。

在此節上,泰坦尼克號專案本身也明視訊記憶體在欠缺。比如,為測試所留出的時間太短,海上試驗中指揮官員也根本沒有嘗試過操縱這艘船走“s型”;也未曾把在困難、可怕、或突發緊急狀況下模擬對船隻的操縱,作為事故預防工作的一部分來完成。

現今的IT專案從這裡可吸取的教訓是,對與解決方案的可操作性有關的各種危急情形,執行操作員和技術支援職員都需專門花時間來予以設想,為故障的預防制定出策略、定出設想中的和檢驗過的行動步驟。所有這些工作,都需在專案執行和實施之前就完成並透過驗證。其間還要考慮對自動化操作員的遮蔽,否則在緊要情況下他們的操作可能使問題變得更大化。總而言之,最終目標就是首先要防止停運,或整個服務的終止。

當泰坦尼克號搖轉回右舷時,默多克大副已避不開冰山了,他和他的艦橋同事們只好打起精神來應付一場撞擊了。

結論

今天許多IT專案,因沒有足夠重視其執行操作期而大打折扣。對執行操作平臺的設定,變成了事餘的工作。而執行操作平臺中的相關職員,晚到專案的具體實施才進入專案組,而沒有在專案計劃和測試階段就加入並扮演重要的角色。可是在商務上,執行操作平臺畢竟對維持服務的水準負有直接而根本的責任。對某個解決方案,如果沒能首先為其設立起足夠的執行操作平臺(人,工作程式,工具),那末其結果不可避免將導向成日,成周,甚至成月不斷出現執行問題和潛在的故障,甚至於整個服務的停運。

泰坦尼克號的各支援階層沒時間來熟悉他們的這艘船。他們沒能弄清楚相關異常的範圍,沒能集思眾智。默多克的最後指示和嘗試雖被很好地執行了,但如果他的這一嘗試經過些事先的測試,也許能使船倖免遇難。在一線執行操作員和技術支援階層之間關於失蹤雙筒望遠鏡的摩擦,也於事無補,瞭望哨位的猶豫則浪費了最寶貴的最後數秒時間。

下一部分將著眼於一個可控的局面如何演變成了一場災難。

原文:

In recapping the situation, Titanic’s early warning system had failed because of the failure to report problems with key feedback mechanisms (see Part 7), possibly because of the fear of reprisal. This, coupled with general over-confidence in the safety of the ship (see Part 4), apathy to the fate of the French Liner Niagara (see Part 6), and inaccurate information on the extent of the giant ice field (see Part 7) led to a state of gross indifference. Finally, Ismay’s pressure and new SLO (see Part 5) pushed Titanic to her highest speed and past her operational limits.

Titanic was heading for a collision. In fact, it was almost inevitable. The ship, at its maximum speed, raced through icy still waters littered with small bergs and pieces of ice. The lookouts, without binoculars and a freezing wind hitting their eyes, were trying to outline the horizon through the haze common in these conditions. As they struggled to make out the shape of a dark mass looming in front of them they delayed reporting this to the bridge.

The lesson for today’s IT projects is that in monitoring a newly operational solution, operations staff needs to be very familiar with it. They need to be in a position to proactively prevent failures from happening in the first place and ensure it meets its service levels. They need good visibility into the solution and surrounding environment around it. They need to be able to quickly assess and analyze data in front of them, collected from feedback mechanisms set up during the planned testing stage of the project (see Part 4). As the mechanisms become noisy they need to diagnose situations and determine deviations from set norms, any potential impacts and overall extent. They need to clarify whether there is something actually wrong or just problematic. They need to make the right decision as to whether to escalate, and at what priority.

Titanic’s lookouts determined the dark mass was in fact a "growler," or "black iceberg"--an iceberg that has flipped over and is dark in color. With a calm sea and no breakers against the base of the growler it was practically invisible in the haze. This had now turned into a critical situation. Once sure of their sighting they notified the bridge with the infamous "Iceberg dead ahead!" Officer Murdoch, chief duty officer, calmly took the call and with his binoculars confirmed the sighting about 900 yards ahead. From all the evidence available today, Murdoch took the following actions:
· First, he cut power to the engines. This made sense as putting the engines into reverse would just churn up the water and limit the steering and handling capability of the ship.
· Second, there was not enough distance to stop the ship and he could not get around the iceberg. So he attempted a port-around or an S-turn first steering hard a port, and then hard a starboard in an effort to sharply decelerate the ship. With only 40 seconds of reaction time this would bring him parallel to the iceberg rather than a head on collision.
· Third, he threw the electric switch to close bulkhead watertight doors as a precaution.
In hindsight these were probably the best possible course of actions.

The lesson for today’s IT projects is that in a critical situation, any anomalies spotted are enacted on with a smooth escalation between operations (lookouts) and the levels of technical support staff (bridge officers). This trouble-free escalation needs to be established in the project testing stage (see Part 4) attained through operability and operational testing. As operations become familiar with the solution and environment they set up more effective procedures.

At this point it is evident that there were serious deficiencies in Titanic’s project itself. For example, time set aside for testing was too short, the officers did not go through any s-turn maneuvers during sea trials, or simulate handling the ship under rough or dire conditions, or an emergency situation as part of accident prevention.

The lesson for today’s IT projects is that operation and technical support staff need time to map out critical scenarios for the operability of the solution, work out strategies for failure prevention and determine preset and proven courses of action. These need to be carefully carried out and tested prior to implementation. This includes considering automated operators which need to be overridden, otherwise they could cause more problems in a critical situation. After all, the ultimate goal is preventing an outage from occurring, or loss of service, in the first place.

As Titanic swung back to starboard, Murdoch just failed to clear the iceberg and he and the bridge staff braced themselves for a collision.

Conclusions
Today, many IT projects severely compromise the operations stage by not paying enough attention to it. Setting up operations is an afterthought and staff is not brought into the project until implementation rather than taking a prominent role in the planning and testing stages. After all, operations are ultimately responsible for upholding the service levels of the solution to the business. The inability to set up an adequate operation (people, processes, tools) around a solution in the first place will inevitably lead to operational problems that manifest themselves days, weeks or even months after going live and a potential failure or a worst case outage.

Titanic’s levels of support had little time to familiarize themselves with the ship. They had failed to clarify the scope of anomalies and put together the intelligence. Murdoch’s maneuver was well executed, but perhaps with some testing he could have pulled it off. The friction between operations and technical support over the missing binoculars did not help in the situation and the lookouts hesitation cost vital seconds.

The next installment will look at how a manageable situation was turned into a disastrous one.
[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7839396/viewspace-955620/,如需轉載,請註明出處,否則將追究法律責任。

相關文章