Amazon post mortem on Dublin DC outage reconfirms domino theory

As promised Amazon released a post mortem report  on the data center outage in Dublin. It reconfirmed the domino theory: if one availability zone come under the influence of errors and outages, then other availability zone would follow in a domino effect.

Amazon stated, as mentioned in an earlier blog entry about the Dublin outage, that the utility provider now believes it was not a lightning strike that brought down the 110kV 10 megawatt transformer. This outage was the start of a row of incidents that finaly would bring the Dublin data center on its knees.

“With no utility power, and backup generators for a large portion of this Availability Zone disabled, there was insufficient power for all of the servers in the Availability Zone to continue operating. Uninterruptable Power Supplies (UPSs) that provide a short period of battery power quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone. We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the Region.”

In 24 minutes Amazon “were seeing launch delays and API errors in all (emphasis by Infrarati) EU West Availability Zones.” The reason of this was “The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them.”

“Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs.” Later on Amazon was able to restore power to enough of the network services that they were able to re-connect. However the problem they found was that their database cluster was in an unstable condition. The last blow was that “Separately, and independent from issues emanating from the power disruption, we discovered an error in the EBS software that cleans up unused storage for snapshots after customers have deleted an EBS snapshot.”

In their description of actions to prevent recurrence Amazon stated that “Over the last few months, we have been developing further isolation of EC2 control plane components (i.e. the APIs) to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones.” (emphasis by Infrarati).

The Dublin incident shows that Amazon is still developing and improving the isolation between availability zones. Services in one zone are not yet safeguarded from incidents in other availability zones. That is a ‘must know’ for the customer instead of a ‘nice to know’.

Amazon and Microsoft in Dublin down, resiliency of cloud computing

A lightning strike in Dublin, Ireland, has caused downtime for many sites using Amazon’s EC2 cloud computing platform, as well as users of Microsoft’s BPOS.
Amazon said that lightning struck a transformer near its data center, causing an explosion and fire that knocked out utility service and left it unable to start its generators, resulting in a total power outage.
Some quotes from the Amazon dashboard (Amazon Elastic Compute Cloud (Ireland)):

11:13 AM PDT We are investigating connectivity issues in the EU-WEST-1 region.

3:01 PM PDT A quick update on what we know so far about the event. What we have is preliminary, but we want to share it with you. We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up. We’ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.

Notice the 30 minutes difference between the first issue message on the dashboard (11:13) and the statement about when the event began, 10:41

11:04 PM PDT We know many of you are anxiously waiting for your instances and volumes to become available and we want to give you more detail on why the recovery of the remaining instances and volumes is taking so long. Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We’ve been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours (emphasis made by Infrarati) until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.

Remarkably as stated in the Irish Timesan spokes woman of  the Electricity Supply Board (ESB), Ireland’s premier electricity utility, said the incident occurred at 6.15 PM on Sunday and caused a power outage in the area for about an hour. However, she said power to Amazon was interrupted for “less than a second before an automatic supply restoration kicked in.”

Microsoft doesn’t use a public dashboard but their twitter feed  stated “on Sunday 7 august 23:30 CET Europe data center power issue affects access to #bpos“. Then 4 hours later there was the tweet “#BPOS services are back online for EMEA customers“. A pity that there isn’t an explanation how also their data center went down. Is it the same cause as that brought the Amazon data center down?

The idea on cloud computing is basically that the offered services are location independent. The customer doesn’t have to worry and doesn’t have to know on which location the services are produced. He even doesn’t have to know how the services are provided (the inner working of the provided services).
The incident in Dublin shows that at the current moment this assumptions are wrong. As a customer of cloud computing services you still have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Only then you can make the proper choices in which way cloud computing services can help your organization or business without business continuity surprises. Proper risk management when using cloud computing services deserves better attention.

UPDATE 10 august

Now three days later Amazon, as showed on their dashboard, is still struggling with recovery. In Informationweek there is interesting article about the complexity of fail over design according to this article

It’s still possible that having the ability to fail-over to a second availability zone within the data center would have saved a customer’s system. Availability zones within an Amazon data center typically have different sources of power and telecommunications, allowing one to fail and others to pick up parts of its load. But not everyone has signed up for failover service to a second zone, and Amazon spokesman Drew Herdener declined to say whether secondary zones remained available in Dublin after the primary zone outage.(emphasis made by Infrarati)

UPDATE 11 august

ESB Networks confirms that it suffered a failure in a 110kV transformer in City West, Dublin at 6:16 p.m. local time on Sunday, August 7. The Irish Times revealed some new information about the outage.

The cause of this failure is still being investigated at this time, but our initial assessment of lightning as the cause has now been ruled out,” a statement from ESB Networks said. The ESB also said on Monday, that Amazon was interrupted for “less than a second before an automatic supply restoration kicked in”. Yesterday the ESB confirmed that Amazon was one of about 100 customers affected by the ongoing service interruption. “This initial supply disruption lasted for approximately one hour as ESB Networks worked to restore supply. There was an ongoing partial outage in the area until 11pm. The interruption affected about 100 customers in the Citywest area, including Amazon and a number of other data centers.

A second Amazon data centre in south Dublin experienced a “voltage dip which lasted for less than one second”, the ESB said yesterday. However, this data centre was not directly affected by the power cut. This one-second voltage dip that had been cited added to the confusion about Sunday’s events.

ESB made it clear that in referencing a lightning strike, Amazon was sharing its best information at the time. “Amazon accurately reported the information which had been passed to them from workers at the site,” said Marguerite Sayers, Head of Asset Management at ESB Networks. “Both the explosion and fire were localized to the bushings or insulators of the transformer and did not require the attendance of the emergency services. The extent of associated internal damage to the transformer was serious and resulted in supply interruption to a number of customers, and also impacted Amazon’s systems, as they have reported.

The article confirms the statement that you as a customer have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Although the question rise to which level you must have this knowledge. It looks like the outage in Dublin is not only about IT design but also about facility engineering. So how far does the responsibility of customer extend and where begins the responsibility of the provider?