IT專案啟示錄——來自泰坦尼克號的教訓(第十一篇)(轉)

ger8發表於2007-08-13
回顧一下泰坦尼克號當時的情形:撞擊發生後(見第8部分)船體搖晃駛離冰架,重新啟航,開向海爾法客斯。一切都似乎無礙,但8節航速下20分鐘後,當初的決策有多不準確就已經很顯見了。續航的行動終嘗惡果,船進了更多的水。其他本未受撞擊影響的部分也在水壓下開始漏水了。上漲的海水正演變成一場大浩劫。

如今,第一要務是邊確定永久性的修復方案,邊透過臨時性的補救措施來使服務迅速恢復上線。但是,此時根本之處在於,應密切監視服務環境,觀察補救措施是否見效。

包括結構師托馬斯-安得魯斯和木匠約翰-哈金斯的第二調查組,報告說有5個船部的主體被淹了,並認為這大違泰坦尼克號的設計初衷。沿船底的摩擦已嚴重撕裂了外殼並損壞了雙層船體。6個主要船部進水速度的不同,也說明頂部船體已損。事態竟然會糟糕到如此境地,這超出了設計者的預想。

在如今的IT專案中,至關重要的是專案團隊要對這樣一類任何補救措施都無濟於事、事態發展將超出MTTR規程(見第9部分)的不測,預作計劃。對終端使用者和客戶,服務中斷了且難於修復。針對這樣的情形,在專案之內就應建立、準備、計劃、測試災難恢復規程(見第4部分),並且配以專人(執行團隊/技術支援)使之制度化。

結構師意識到,泰坦尼克號狀況已超一般的事故恢復範圍,已演變成一場大浩劫。他說,船離沉沒還有2個半小時到3個小時。並準確認定已無力迴天。太多的船部破裂,水淹至抽水機都不及挽救。各船部之間的防水隔牆,沒做到水密水平橫斷線的高度,所以當船鼻下沉時,水從一個船部滲進另一個,就像水浸過製冰格盤一樣。舞廳實際上成為讓水向各部分派發的大通道。

此時我們已可發現,專案建設階段(見第3部分)在非功能性需求上的那種妥協,在這場浩劫中是如何引發巨大惡果的。

只有船長和部分指揮官確知損壞程度,而眼下只能眼睜睜看著船的下沉。沒有發出過“棄船”或其它正式的災難公告。只在撞擊後的65分鐘時,船長命指揮官們開啟救生艇的遮布,並讓乘客和船員們都到甲板上。泰坦尼克號上沒有正式的災難恢復計劃。

如果發生在今天,接下來應啟動災難恢復計劃,並向所有人溝通該計劃。每個災難恢復計劃都應有考慮周全的溝通計劃,需向不同的聽眾清楚無疑地進行溝通。

泰坦尼克號的船長在碰撞後很快就明白了問題的嚴重性,但是,他沒有透過其船員與乘客們完成溝通。這船上人們的困惑加劇了,尤其是船員們。比如,引擎室向甲板派出了工程師,可指揮部卻讓他們返回去。對船上這樣糟糕的溝通問題,可能的解釋有:
●船上裝備的溝通系統有限,沒有公告系統。重要資訊只能透過船員們到各個艙位敲門後口傳給乘客。考慮到艙位數以百計,這太費時了。
●船員們本身就對實情不清楚,所以乘客們所能知曉的就莫衷一是。這個老船長對船體的安全系統太有信心,也許難於相信結構師的判斷,因此開始的時候一切似乎都還正常。船長的表現幾乎就相當於好像一切正常。
●船長深知救生艇數量不敷所需,大約只夠帶走全船2223人中的一半。所以,也許最好還是不製造恐慌,而在適當時候讓救生艇在一片平和中有秩序地載走乘客。船體水平狀的結構,和艙位等級的界別,意味著頭等艙的乘客們可更優先得到救生艇位。
●船長擔心恐慌的擴散。他同下屬都知道14年前法國客輪La Bourgogne下沉的故事。當時也只有一半乘客有救生艇位,引發一片恐慌。史密斯船長知道,他可以透過讓那些足夠幸運者都上到救生艇上,來挽救儘量多的人。所以,他沒告訴所有乘客,尤其是3等艙的那些人。

如今,溝通計劃可能與災難恢復計劃一樣重要。原因如下:
●與僱員的內部溝通極有助於控制災難的影響度。同時,溝通的速度也很重要,比如可首先讓面向客戶的那些僱員獲悉訊息,因而他們能轉達客戶。
●與客戶的外部溝通也很重要。溝通計劃需要根據問題或災難的大小範圍,以不同渠道來向顧客各個層級傳達。
●根據服務中斷的嚴重程度,和公眾媒體的溝通也許是必要的。這需要確定什麼是關鍵資訊,如何溝通釋出,透過什麼渠道。許多公司不再設防,流動通訊員帶著一些陷阱問題訪問不知情的僱員們。

結論

如今,許多IT專案由於沒有對最壞情況準備對策,而在執行中大打折扣。光有MTTR規程還不夠。除了災難恢復計劃,一個考慮周全的溝通計劃也必須到位。下一部分將著眼於災難恢復的啟動。

原文:

In recapping Titanic’s situation, following the collision (Part 8) the ship was restarted and limped off the ice shelf with the objective of sailing back to Halifax. Everything appeared to be in good shape, but after 20 minutes of sailing at 8 knots it was apparent that the initial determination was grossly inaccurate. The forward motion had taken its toll and the ship had taken on more water. Parts of the ship initially unaffected under the strain of the water had started to spring leaks and the increase in flooding was becoming catastrophic.

In today’s world, getting service back online is a top priority by applying a temporary fix whilst a permanent fix is created. However, in such a situation it is essential the service delivery environment is closely monitored to whether the fix is holding.

The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, reported major flooding in five compartments and recognized that Titanic was not designed for this. The grinding along the bottom had badly ruptured the outer skin and damaged the double hull. The different rates of flooding in the six primary compartments indicated the top hull or tank top was damaged. It was beyond the expectations of the designer that something in nature could inflict so much damage.

In today’s IT projects, it is vital that the project team plan for such an eventuality where the fix is not resolving the problem and the situation goes beyond the Mean Time To Recovery (MTTR) for the IT solution (see Part 9). The service is unavailable, to end-users and customers, and not readily recoverable any more. For this situation disaster recovery procedures need to be set up, prepared, planned and tested in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support).

The architect realized the situation onboard Titanic had gone beyond normal problem recovery and had become a disaster. He stated that the ship had 2.5 to 3 hours before completely sinking, and accurately determined that the problem could not be fixed. Too many compartments were ruptured and were rapidly flooding beyond the capacity of all the pumps. The bulkhead walls, separating the compartments, had not been carried up to watertight horizontal traverses. Therefore, as the ship’s nose went down, water spilled from one compartment to another rather like an ice cube tray filling with water. The ballroom acted as massive channel for distributing water horizontally across the ship.

At this point in the story we see how the compromises to the non-functional requirements during the construction phase (see Part 3) of the project had a massive consequence in the disaster.

Only the captain and a few officers knew the extent of the damage and were now resigned to the ship sinking. No "abandon ship" command or formal declaration of a disaster was given. Around 65 minutes after the collision the captain just gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. No formalized disaster recovery plan was in place on board Titanic.

In today’s world, the next step would be to invoke a disaster recovery plan and communicate it to all onboard. Every disaster recovery plan needs to be accompanied with a well-thought-out communication plan. This needs to clearly communicate with different audiences.

Titanic’s captain knew the seriousness of the situation relatively quickly from the collision, but did not communicate this through the ranks of crew and passengers on board. This increased the confusion, particularly with the crew. For example, the engine room sent some engineers to the boat deck, but the bridge sent them back down to the engine room. There are number of possible explanations for the poor communication aboard Titanic:
·The ship had very limited communication, with no public-address systems. Important information was communicated to passengers by word of mouth, the crew knocking on each cabin door and common room. Considering there were hundreds of cabins, this could take hours.
·The crew didn’t have accurate information on the situation, so varying degrees of information were passed to passengers. The experienced captain believed in the safety systems of the ship and might have found the architect’s verdict very hard to accept because everything appeared so normal in the first hour. The captain acted almost as if the situation was "business as usual."
·The captain realized that the carrying capacity of the lifeboats was inadequate, with only enough room for about half of the estimated 2,223 people on board. Perhaps better to keep things calm, and allow the lifeboats to be filled in an orderly manner when the timing was right. The ship’s hierarchical structure and segregation of classes meant that first-class passengers had the best access to the boats.
·The captain feared widespread panic. He and the other officers were aware of the French liner La Bourgogne, which sank 14 years earlier. With room in the lifeboats for only half the people onboard, widespread panic had broken out. Captain Smith knew he could save the maximum number of lives by loading only those who were lucky enough to reach the boats. So, he may have avoided informing all the passengers, specifically in third class.
In today’s world a communication plan is probably as important as a disaster recovery plan, for several reasons:
·Communicating internally with your employees can greatly help control the impact of a disaster. Also, the speed of communication is essential. For example, get information to customer-facing employees first, so they can inform customers.
·Communicating externally with your customers is essential and the plan needs to cater to customer segments using different channels, depending on the scope of the problem or disaster. A customer-retention strategy might need to be offered.
·Communicating with the press may be necessary depending on how serious the loss of service is. This requires the identification of key messages, how these are communicated, and through what channels. Many companies have been caught off guard when roving reporters trap unaware employees with questions.

Conclusions
Today, many IT projects severely compromise an operation by not preparing for worst case scenarios. In today’s world, MTTR procedures are not enough. Aside from a disaster recovery plan, a well-thought-out communication plan needs to be in place. The next installment will look at invoking disaster recovery.
[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7839396/viewspace-955626/,如需轉載,請註明出處,否則將追究法律責任。

相關文章