Amazon outage and the auto-immune vulnerabilities of resiliency
Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.
Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.
It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.
Amazon.com’s real problem isn’t the outage, it’s the communication
Like many companies running on Amazon’s Web Services, BigDoor has been affected by the AWS outage today. And like most startups, we are braced for bad stuff to happen, and we do our best to learn from the painful stuff. We spent a better part of the day in constant contact via Twitter, emails and phone calls apologizing and updating our more than 250 publishers that were affected. Today has provided plenty of lessons, and because transparency is fast becoming the lifeblood of Seattle’s startup community we thought we’d pass a few along.
AWS is down: Why the sky is falling
Amazon Web Services, “the cloud” to many people, has had a significant issue in one of their datacenters since about 1AM Pacific time April 21st. Some huge websites (reddit, quora, foursquare) are all down or significantly impacted. I’ve seen a lot of misinformation which suggests that this is all purely due to the laziness of the affected sites’ engineers, but that isn’t the case. Here’s why:
Standing on the shoulders of giants and stumbling with them – the Amazon AWS outage’s “pain” statistics
Today, at around 1am Pacific Time, Amazon began having major problems with some of their cloud infrastructure: specifically with their EC2, EBS, and RDS offerings. The issues are ongoing, and many of your favorite internet sites or services are probably still down or at reduced functionality because of it.
This kind of outage is one of PagerDuty’s big “moments”; when a big chunk of the services on the internet say: ”Hey PagerDuty, I’m down, so wake someone up to fix me!”
Cloud crash has a silver lining
If you are at all interested in cloud computing you would have by now no doubt heard that this has been a dark day in the world of Cloud Computing today. Something has gone terribly wrong in the networking in the Amazon US-East-1 region. That caused storage subsystems to go on the fritz and the rest as they say is history. Some of the very prominent web properties have been down and the volume of chatter on the subject has been deafening. I had to turn the sound on my TweetDeck off a I could no longer stand to hear the constant beeping generated by the tweet arrivals.
Today’s EC2 / EBS Outage: Lessons learned
Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called ‘auto-immune disease’. Amazon’s automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.
Why Twilio Wasn’t Affected by Today’s AWS Issues
Starting early this morning, Amazon Web Services experienced several service problems at one of its east coast datacenters. The outage impacted major sites across the Internet. The number of high profile sites affected by the issue shows both the amazing success of cloud services in enabling the current Internet ecosystem, and also the importance of solid distributed architectural design when building cloud services.
Twilio’s APIs and service were not affected by the AWS issues today. As we’ve grown and scaled Twilio on Amazon Web Services, we’ve followed a set of architectural design principles to minimize the impact of occasional, but inevitable issues in underlying infrastructure.
The AWS Outage: The Cloud’s Shining Moment
So many cloud pundits are piling on to the misfortunes of Amazon Web Services this week as a response to the massive failures in the AWS Virginia region. If you think this week exposed weakness in the cloud, you don’t get it: it was the cloud’s shining moment, exposing the strength of cloud computing.
In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.
Amazon’s Trouble Raises Cloud Computing Doubts
As technical problems interrupted computer services provided by Amazon for a second day on Friday, industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control.
“This is a wake-up call for cloud computing,” said Matthew Eastwood, an analyst for the research firm IDC, using the term for accessing services and information in big data centers remotely over the Internet from anywhere, as if the services were in a cloud. “It will force a conversation in the industry.”