Image of Cortney & Jeremy

The EC2/EBS outage: What Amazon didn’t tell you

by Jeremy L. Gaddis on April 29, 2011 · 7 comments

in Cloud Computing

This is from April 23rd, well before Amazon posted their official post-mortem. Take it for what it’s worth:

so this is what happened, … our elastic compute cloud aka ec2 has a subsection (EBS) that contains a block full of storage space. a network engineer (not disclosing any names) was doing an iOS upgrade on one of our aggregation routers. while the firmware was still patching he decided to bring down another one of the agg networks.

Well, this tripped preventative measures and the systems saw that one of the agg’s were down and quickly made a backup, well since the network engineer decided to do what it says NOT to do IN BIG RED TEXT and started the other process while the agg was still in update this caused our systems to create backups FOR EVERY SINGLE HOST. It FILLED the whole EBS storage block and kept attempting to create MORE AND MORE (neverending cycle) this was overflowing everything and the higher-up guys at seattle had to make the last resort decision and they brought down the WHOLE ec2 subsection that controls the storage. so we had to transport hosts from different sites to the primary location to create an extra storage space in the meantime while we are fixing everything. now i and a few dozen other technicians had to go through each host that tripped a fault and fix any issue with it, whether it was networking issues, hardware issues, etc. we’re attempting to fix HUNDREDS of hosts in ONE day. its not going to happen though, we will fix enough to get everything back up and running

I asked the guys when i first started there if i’d ever see a severity level 1, and they said you will NEVER see one in your career with amazon.. well.. what do you know haha. it was pretty interesting though i hope we get it resolved asap. this most likely wont be fixed for the next few days although the websites will be going back up

{ 4 comments… read them below or add one }

Matt Sickler April 29, 2011 at 12:14 pm

I have a feeling that guy got canned for that – change controls and policies are in place to prevent exactly things like this.

That being said, in the official postmortem, they state “Availability Zones are physically and logically separate infrastructure …” but for some reason the ‘EBS Control Plane’ was a single logical unit that oversees all of the AZs (at least that’s how I understand it). So when one AZ hammered the Control Plane, service to other AZs was degraded.
Now, I can see why they made the architecture like that: it’s probably easier to provide the services they do between AZs with a single manager that handles all the AZs. Apparently the AZs aren’t actually as separate as they initially said.

As usual, hindsight is 20/20 and I havn’t ever come close to dealing with something like this, so take my words with a grain of salt.


stretch May 1, 2011 at 1:50 pm

an iOS upgrade on one of our aggregation routers



Dan May 2, 2011 at 3:13 pm

Ahh god it’s good to see Stretch out on the internet in a place other than PL. ಠ_ಠ @iOS indeed Stretch…


Burchard May 21, 2011 at 3:14 pm

And I thought I was the seinslbe one. Thanks for setting me straight.


Leave a Comment

{ 3 trackbacks }

Previous post:

Next post: