What lessons can we learn from the Amazon Web Services outage?
On Tuesday, Amazon's web servers went down and took a lot of the internet with them. Some of the world's most popular websites and apps including Docker, Medium, and Slack went offline or lost functionality for almost four hours while engineers worked to fix the problem.
There’s a lot to learn from an outage like this – and a lot of nonsense being posted by people who, at first glance, sound very convincing, but actually don’t really seem to know what they are talking about. Specifically, those shouting about multi-provider infrastructure and how the cloud is a problem for the internet.
Here are some aspects of the outage that stand out to me:
1. AWS has failures.
All cloud providers do, and that means you should be architecting cloud solutions using the same principles as you would for on-premise solutions. This includes replicating your data outside of it’s primary location to a backup location, to be brought online in the event of failure. You also need to decide at what point a failure will take your site down – is it failure of a single server, data center, or a region? If you’re not prepared to accept downtime when a region (in AWS terminology) fails, then your architecture should reflect that. The key thing here is that the cloud doesn’t magically stay up.
2. Some pretty big sites went down.
Some big sites including Trello, Quora, and IFTTT went offline as a result of the outage. This tells us that those sites have not built resiliency of region failure into their solution, or if they have, it didn’t work. Of course, we don’t know the detail of how those sites are built, but it is possible to tolerate failure of a region on AWS. This could have been achieved for S3 using things like cross-region replication, CloudFront, Route 53, and for their wider infrastructure – EC2 etc – using CloudFormation templates in secondary regions. It’s interesting to note that such large sites haven’t done this. It’s perhaps reasonable to assume a region won’t go down given point 3, but still. There could be a number of reasons they haven’t done this. They may lack operational experience or knowledge in their teams, leading to a reliance on developers and a focus on code rather than infrastructure. Alternatively, they might have simply decided they would accept downtime if a region failed, as the cost of being more resilient wasn’t worth it – which isn’t a strategy completely without merit.
3. The size of this outage.
This was a big outage. There are currently 16 AWS regions, with 2 more in the works. In AWS terminology each region is a collection of Availability Zones. These are basically datacenters. Every AWS region has a minimum of 2 AZs in it – the US-EAST1 (North Virginia) region that went down has 4 AZs. This means an engineer was able to execute a playbook that took down a critical service in 4 datacenters!
4. Post mortem.
The post mortem published by Amazon seems to have been pretty well received and I felt demonstrated a certain degree of ownership and class on their part. It was concise, to the point, and, most importantly, didn’t attempt to divert responsibility – they explained it, owned it, and apologised.