Amazon's Sunday Outage - Why Network Availability Zones Matter
Amazon's US-EAST-1 data center continues to be a weak link in the cloud.
Sunday morning was not a good time for Amazon's cloud and for those who rely on it - or who at least have made the mistake of relying on only one availability zone.
Staring at around 5:13 AM ET, alerts started to pop up all over the web about slowdowns and failures on all manner of sites. One by one, the status of multiple Amazon services lit up, with errors ranging from elevated login error rates to increased error rates. All of the alerts were in one location, Amazon's North Virginia, US-EAST-1 data center.
The errors messages cascaded throughout the Amazon data center infrastructure as slowdowns in one service let failures in another. Among the downed services was Amazon Cognito, which reported increased error rates for acquiring identities and credentials and synchronizing sync datasets starting at 5:13am ET. Amazon Workspaces reported increased connectivity issues and API error rates starting at 5:25 AM ET, and Amazon AppStream's errors began around 9:44 AM ET. The main Amazon Web Services Management Console also slowed down starting at 9:01 AM as users "experienced elevated error rates."
Across all the impacted Amazon cloud platform, service was largely restored to normal by 10:55 AM ET. The longest service impacts were nearly 6 hours long before Amazon's status board read all clear.
Among the major sites and services impacted by Amazon's Sunday morning troubles were Reddit, Netflix, Tinder and AirBnB.
At this point, it's not entirely clear what the cause was for the Sunday trouble, though this isn't the first, and likely not the last, time US-EAST has had trouble. The last major Amazon data center outage also impacted US-EAST and also occurred on a Sunday. On August 25, 2013, Amazon Web Services had degraded US-EAST services for four hours. In that instance, the root cause was eventually identified to be a single networking device that failed.
Time and again, Amazon recommends to its customers to make use of more than one data center region, as well as to take advantage of "Availability Zones" (AZs). There are currently nine Amazon regions globally, three of which are in the U.S., including US-EAST-1 in Northern Virginia, US-WEST-1 in Northern California, and US-WEST-2, located in Oregon.
"Amazon operates state-of-the-art, highly-available data centers. Although rare, failures can occur that affect the availability of instances that are in the same location," Amazon states in its online documentation. "If you host all your instances in a single location that is affected by such a failure, none of your instances would be available."
The basic idea for customers is that by using multiple data centers and multiple instances, cloud users get the benefit of high-availability and failover. That's a promise that is sometimes easier said than done, though. Operating a cloud deployment from multiple regions comes with some added expense as well as complexity.
From a networking perspective, the idea of high-availability across multiple sources has long been held as a best practice. Simply put, anytime there is a single point of failure, when that single point fails, you fail.
Sean Michael Kerner is a senior editor at Enterprise Networking Planet and InternetNews.com. Follow him on Twitter @TechJournalist.