


同樣地,一個IT解決方案在生產營運階段出現不穩定時,根據專案本身預先準備、計劃、和測試過的規程,會採取一系列行動(見第4部分)。這種規程應以所謂MTTR (平均恢復時間)為基準,其主旨是使得該IT解決方案能從故障中儘快恢復上線,以滿足所謂的“服務水準協議”SLAs。繼而在後臺透過某種臨時、或者長遠的修補來得到完善。

誠然,正式上線前,方案的完整性首先要得到確立,以防故障再次出現。以時間為基準,運營團隊將走完前述流程和故障的四個區間,即“故障探測”、“故障確定”、“故障解決”和“從中恢復”。 “平均恢復時間”MTTR一旦開始計時,意味著“服務中斷”(一種損失,見第二部分)的開始,應按“使用者損失分鐘數”來採集評測指標,該“使用者損失分鐘數”可衡量出多少使用者得不到服務以及持續了多久。














In recapping the famous ship’s situation, Titanic’s officers tried desperately to avoid a collision (see Part 8). However, the S-turn, a good decision, failed to decelerate the ship enough. Titanic almost innocuously came to a halt later described by hundreds of passengers as a quiver, rumble or grinding noise that lasted a few seconds as if the ship was rolling over a thousand marbles.

There was no "crash stop," fatalities or even minor injuries. There was no violent jolt sideways or repeated strikes along the ship’s length. This is common with a side swipe against an ice spur when a ship is turning very hard away from it. The breakfast cutlery that was laid out in the dining salons barely trembled, and drinks remained unspilled in the first class smoking rooms and lounges. All the evidence indicates that the ship came to rest on an underwater ice shelf at the base of the iceberg. Murdoch had prevented a head on crash that could have demolished the first 4 compartments, and killed and maimed hundreds of passengers.

Likewise, when an IT solution falters in production steps are taken according to a process prepared, planned and tested in the project itself (see Part 4). The process should be based around a Mean Time To Recovery (MTTR) clock were the principal objective is to get the IT solution back on-line as quickly as possible to meet Service Level Agreements (SLAs). The solution is then patched up in the background and a temporary or permanent fix applied.

However, before going on-line, the integrity of the solution needs to be first established so the problem does not reoccur. With an eye on the clock, the operations group steps through the process and the four "problem" quadrants of detection, determination, resolution and recovery. When the MTTR clock starts ticking, signifying the beginning of loss of service (an outage, see Part 2), metrics should be captured as User Outage Minutes (UOMs), which measure how many users experience service loss and for how long.

This is far more accurate than measuring with the more commonly used percentage of service availability, e.g., 99.999 percent. Problem detection on Titanic was 37 seconds of warning given by the lookouts. This is not typical with an IT solution, which is likely to put out errors and warnings well before any significant failure occurs. This provides operators, automated or human, time to prevent the problem from occurring in the first place (see Part 8).

Titanic’s captain, director and officers gathered on the bridge to determine a course of action. As part of problem determination to the extent of the damage, two search parties were dispatched into the bowels of the ship, front and mid-ship. The first party returned within 10 minutes with a positive report of no major damage or flooding. In director Bruce Ismay’s mind, problem detection and determination were now complete. Resolution with a distress call was a problem for him as it would compromise White Star’s position by shattering the hype around Titanic and destroy the brilliant marketing (see Part 2 and Part 5) that had lured the world’s wealthy elite onto the safest liner ever built.

A better resolution would be to get the ship back to Halifax, away from New York and the center of the world’s press. He could then better contain the news story, and marginalize it as a minor incident. He would be able to disembark passengers onto trains, patch the ship up and sail her back to Belfast for repairs. In fact, he could boldly claim that Titanic, a lifeboat in itself with all the latest in emerging technologies, was able to save herself from a potential disaster and further push the safety claims of White Star lines.

With an IT solution today, determination of the problem assesses the impact of the solution on users. Determination has to be consistent with the available evidence. Reinvestigation of feedback mechanisms and logs is vital to determine if the problem has been building up and what is causing it.

In a complex IT solution, it is common to see the domino effect, where a small faulty element like a subsystem knocks out elements around it and triggers a cascade of problems. Not working out this precise sequence of events could lead to a misdiagnosis where a wrong fix is applied and the problem reoccurs. Determination is completed when the root cause assumptions of the problem are tested and proven to be correct.

With an IT solution it is important to be sure of the evidence at hand and to ask the following questions. Was the IT solution aware it was going to fail? If so, were any (automated) preventative actions attempted? Did it alert human or automated operators? Were any of the feedback mechanisms faulty and provide unreliable data? Is the diagnosis of the problem correct?

Titanic’s situation was critical but not catastrophic. Ismay was hell bent on saving face and his anxiety over White Star’s reputation created an atmosphere where mistakes were easily made. Titanic appeared to be completely stable, sitting snugly on the underwater ice shelf. May be with due care they could dislodge the ship with a minimum of damage. Ismay rushed into making a decision. The second search party with the architect and carpenter had not even returned with an assessment.

The lesson from this for IT projects today is that in resolving the problem it is important to consider the alternative courses of action available with the risk associated with each based on all the collected evidence. Only then should the last quadrant of recovery commence. This is where the operations group puts the IT solution back on-line and resumes services, according to SLAs.

On Titanic, not all courses of action were adequately explored as part of the problem resolution. Ismay made the fateful decision to sail forward and telegraphed the engine room "dead slow ahead" in recovering the situation. Engineers later testified the ship moved forward at 3 knots with a grinding noise.

Today, many IT projects severely compromise the operation stage by not planning adequately in the project for a process to deal with problems around a MTTR clock. A process is critical for enabling the operations group to quickly restore service and maintain service levels. A process should also carry the checks and balances (through reviews) to minimize the likelihood of mistakes made in a pressure situation. A process should outline responsibilities and roles to ensure the right personnel make the right decisions.

The next installment will look at how the officers reacted to the disastrous situation.

來自 “ ITPUB部落格 ” ,連結:,如需轉載,請註明出處,否則將追究法律責任。
