This was an interesting week, with the first “cloud failure” big enough that anyone with an Internet connection took notice. There has been much gnashing of teeth and hyperbole, and several embarrassing moments for a lot of teams, not least at Amazon. I think it is worth spending a few moments to take stock.
The specific cause of the failure is not actually important for lessons learned to anyone but Amazon. We’ll have to wait and see if AWS does the right thing and gives a detailed post-mortem as part of restoring trust. The explanation so far revolves around a cascading failure of EBS infrastructure, but still missing is an explanation as to why the failure crossed multiple “availability zones” (i.e. more than one physically separate data center… maybe). My best guess is EBS, which was already proving to be fragile, was pushed into a cascade failure by tenant applications trying to fail over to another AZ when the first failed. AWS better have as one of their lessons learned adequately communicating status to their customers.
The broader lesson learned, by those who needed to learn it yet, is that “The Cloud” is not a magical place populated by unicorns and data faeries. It is in fact data centers as (TK) noted, based on servers and databases as Larry Ellison is happy to explain.(TK)
The lesson all CEOs and CTOs need to take away is make sure your shit is architected properly. Stuff happens. Hopefully your boards don’t take away the wrong lessons1.
A number of popular services were completely hamstrung by the outage, including Quora and Reddit. (After a five year relationship with Reddit I have to say it being down was probably worth several man-years of additional productivity around the world this week, but it is the one shining exception.)
Try not to take away lessons learned that are Amazon specific, i.e. Don’t trust EBS, or AZs don’t mean what you think they mean. Instead focus on redundancy, avoiding single points of failure, etc. Resiliency costs money, so make your trade-off between a few hours/days of uptime and increasing your costs by 50–100%.
The bottom line in resilient design is assume stuff will fail.
For the edification of anyone trying to bullet-proof their systems and my own future reference here is a round-up of lessons learned etc.:
Twilio was not impacted due to their design principles.2 And there is also an old presentation that explains how they organize their virtual infrastructure on AWS.3 That Twilio did not go down is great news as the cascade failure would have taken out even more services, and it is solid proof of the importance of good engineering.
Sulia stayed up due to their doubly redundant infrastructure planning4.
Some design thoughts from Agile Sysadmin5
Pounding home the assume things will fail lesson, Netflix is AWS based and did not go down. They wrote a tool for themselves months ago that deliberately attacks their infrastructure called Chaos Monkey6. Note that this was posted five months ago. There is also a current discussion of their infrastructure resiliency at Hacker News7. And slides from a presentation given in March8
George Reese wrote a nice summary of design considerations9
Some load balancing and IP routing thoughts related to this from James Cohen10
Cloud services can be designed so that they don’t have single points of failure, and this virtualization is directly underneath your app rather than under your virtual OS. Unfortunately Heroku, Engine Yard, etc. do not yet have an architecture that can truly avoid the virtual infrastructure failures as they themselves were harmed by the AWS outage.
Here’s to sunnier skies…
- Amazon’s Trouble Raises Cloud Computing Doubts(NYTimes) http://www.nytimes.com/2011/04/23/technology/23cloud.html ↩
- Why Twilio Wasn’t Affected by Today’s AWS Issues http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/ ↩
- Twilio Voice Applications with Amazon AWS http://www.slideshare.net/twilio/twilio-voice-applications-with-amazon-aws-s3-and-ec2-presentation ↩
- How our small startup survived the Amazon EC2 Cloud-pocalypse http://xenon.stanford.edu/~silver/ec2outage.html ↩
- Today’s EC2/EBS Outage: Lessons Learned http://agilesysadmin.net/ec2-outage-lessons ↩
- 5 Lessons We’ve Learned Using AWS http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html ↩
- Some quotes regarding how Netflix handled this without interruptions http://news.ycombinator.com/item?id=2470773 ↩
- Escaping the Chaos Monkey http://blogs.vmware.com/rethinkit/2011/03/escaping-the-chaos-monkey-enterprise-vs-commodity-cloud.html ↩
- The AWS Outage: The Cloud’s Shining Moment http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html ↩
- How to work around Amazon EC2 outages http://webmonkeyuk.wordpress.com/2011/04/21/how-to-work-around-amazon-ec2-outages/ ↩
No Comments