As promised Amazon released a post mortem report on the data center outage in Dublin. It reconfirmed the domino theory: if one availability zone come under the influence of errors and outages, then other availability zone would follow in a domino effect.
Amazon stated, as mentioned in an earlier blog entry about the Dublin outage, that the utility provider now believes it was not a lightning strike that brought down the 110kV 10 megawatt transformer. This outage was the start of a row of incidents that finaly would bring the Dublin data center on its knees.
“With no utility power, and backup generators for a large portion of this Availability Zone disabled, there was insufficient power for all of the servers in the Availability Zone to continue operating. Uninterruptable Power Supplies (UPSs) that provide a short period of battery power quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone. We also lost power to the EC2 networking gear that connects this Availability Zone to the Internet and connects this Availability Zone to the other Availability Zones in the Region.”
In 24 minutes Amazon “were seeing launch delays and API errors in all (emphasis by Infrarati) EU West Availability Zones.” The reason of this was “The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them.”
“Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs.” Later on Amazon was able to restore power to enough of the network services that they were able to re-connect. However the problem they found was that their database cluster was in an unstable condition. The last blow was that “Separately, and independent from issues emanating from the power disruption, we discovered an error in the EBS software that cleans up unused storage for snapshots after customers have deleted an EBS snapshot.”
In their description of actions to prevent recurrence Amazon stated that “Over the last few months, we have been developing further isolation of EC2 control plane components (i.e. the APIs) to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones.” (emphasis by Infrarati).
The Dublin incident shows that Amazon is still developing and improving the isolation between availability zones. Services in one zone are not yet safeguarded from incidents in other availability zones. That is a ‘must know’ for the customer instead of a ‘nice to know’.