Amazon and Microsoft in Dublin down, resiliency of cloud computing

A lightning strike in Dublin, Ireland, has caused downtime for many sites using Amazon’s EC2 cloud computing platform, as well as users of Microsoft’s BPOS.
Amazon said that lightning struck a transformer near its data center, causing an explosion and fire that knocked out utility service and left it unable to start its generators, resulting in a total power outage.
Some quotes from the Amazon dashboard (Amazon Elastic Compute Cloud (Ireland)):

11:13 AM PDT We are investigating connectivity issues in the EU-WEST-1 region.

3:01 PM PDT A quick update on what we know so far about the event. What we have is preliminary, but we want to share it with you. We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up. We’ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.

Notice the 30 minutes difference between the first issue message on the dashboard (11:13) and the statement about when the event began, 10:41

11:04 PM PDT We know many of you are anxiously waiting for your instances and volumes to become available and we want to give you more detail on why the recovery of the remaining instances and volumes is taking so long. Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We’ve been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours (emphasis made by Infrarati) until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.

Remarkably as stated in the Irish Timesan spokes woman of  the Electricity Supply Board (ESB), Ireland’s premier electricity utility, said the incident occurred at 6.15 PM on Sunday and caused a power outage in the area for about an hour. However, she said power to Amazon was interrupted for “less than a second before an automatic supply restoration kicked in.”

Microsoft doesn’t use a public dashboard but their twitter feed  stated “on Sunday 7 august 23:30 CET Europe data center power issue affects access to #bpos“. Then 4 hours later there was the tweet “#BPOS services are back online for EMEA customers“. A pity that there isn’t an explanation how also their data center went down. Is it the same cause as that brought the Amazon data center down?

The idea on cloud computing is basically that the offered services are location independent. The customer doesn’t have to worry and doesn’t have to know on which location the services are produced. He even doesn’t have to know how the services are provided (the inner working of the provided services).
The incident in Dublin shows that at the current moment this assumptions are wrong. As a customer of cloud computing services you still have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Only then you can make the proper choices in which way cloud computing services can help your organization or business without business continuity surprises. Proper risk management when using cloud computing services deserves better attention.

UPDATE 10 august

Now three days later Amazon, as showed on their dashboard, is still struggling with recovery. In Informationweek there is interesting article about the complexity of fail over design according to this article

It’s still possible that having the ability to fail-over to a second availability zone within the data center would have saved a customer’s system. Availability zones within an Amazon data center typically have different sources of power and telecommunications, allowing one to fail and others to pick up parts of its load. But not everyone has signed up for failover service to a second zone, and Amazon spokesman Drew Herdener declined to say whether secondary zones remained available in Dublin after the primary zone outage.(emphasis made by Infrarati)

UPDATE 11 august

ESB Networks confirms that it suffered a failure in a 110kV transformer in City West, Dublin at 6:16 p.m. local time on Sunday, August 7. The Irish Times revealed some new information about the outage.

The cause of this failure is still being investigated at this time, but our initial assessment of lightning as the cause has now been ruled out,” a statement from ESB Networks said. The ESB also said on Monday, that Amazon was interrupted for “less than a second before an automatic supply restoration kicked in”. Yesterday the ESB confirmed that Amazon was one of about 100 customers affected by the ongoing service interruption. “This initial supply disruption lasted for approximately one hour as ESB Networks worked to restore supply. There was an ongoing partial outage in the area until 11pm. The interruption affected about 100 customers in the Citywest area, including Amazon and a number of other data centers.

A second Amazon data centre in south Dublin experienced a “voltage dip which lasted for less than one second”, the ESB said yesterday. However, this data centre was not directly affected by the power cut. This one-second voltage dip that had been cited added to the confusion about Sunday’s events.

ESB made it clear that in referencing a lightning strike, Amazon was sharing its best information at the time. “Amazon accurately reported the information which had been passed to them from workers at the site,” said Marguerite Sayers, Head of Asset Management at ESB Networks. “Both the explosion and fire were localized to the bushings or insulators of the transformer and did not require the attendance of the emergency services. The extent of associated internal damage to the transformer was serious and resulted in supply interruption to a number of customers, and also impacted Amazon’s systems, as they have reported.

The article confirms the statement that you as a customer have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Although the question rise to which level you must have this knowledge. It looks like the outage in Dublin is not only about IT design but also about facility engineering. So how far does the responsibility of customer extend and where begins the responsibility of the provider?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s