This is from April 23rd, well before Amazon posted their official post-mortem. Take it for what it’s worth:
so this is what happened, … our elastic compute cloud aka ec2 has a subsection (EBS) that contains a block full of storage space. a network engineer (not disclosing any names) was doing an iOS upgrade on one of our aggregation routers. while the firmware was still patching he decided to bring down another one of the agg networks.
Well, this tripped preventative measures and the systems saw that one of the agg’s were down and quickly made a backup, well since the network engineer decided to do what it says NOT to do IN BIG RED TEXT and started the other process while the agg was still in update this caused our systems to create backups FOR EVERY SINGLE HOST. It FILLED the whole EBS storage block and kept attempting to create MORE AND MORE (neverending cycle) this was overflowing everything and the higher-up guys at seattle had to make the last resort decision and they brought down the WHOLE ec2 subsection that controls the storage. so we had to transport hosts from different sites to the primary location to create an extra storage space in the meantime while we are fixing everything. now i and a few dozen other technicians had to go through each host that tripped a fault and fix any issue with it, whether it was networking issues, hardware issues, etc. we’re attempting to fix HUNDREDS of hosts in ONE day. its not going to happen though, we will fix enough to get everything back up and running
I asked the guys when i first started there if i’d ever see a severity level 1, and they said you will NEVER see one in your career with amazon.. well.. what do you know haha. it was pretty interesting though i hope we get it resolved asap. this most likely wont be fixed for the next few days although the websites will be going back up