In a previous life, I worked at a .edu and lived, roughly, across the street from the building that housed IT. I don’t recall what it was that caused me to go to work that Monday evening, the third day of a three-day (holiday) weekend, but as anyone who works in IT would likely agree, it’s not unusual to be working at the office or data center at weird hours.
As I unlocked the door to the server room and pushed it open, I was met by a huge blast of HOT air — not just warm; HOT.
Unbeknownst to anyone in IT or our Facilities department, the dedicated cooling system for the data center had failed. It’s been several years ago now but, if memory serves, it was wintertime and the air conditioning unit on the roof had frozen up.
Exactly when it failed wasn’t known but, if the 109 degree (Fahrenheit) temperature was any indication, it had almost certainly been much earlier in the weekend. It certainly didn’t happen ten minutes before I arrived.
A phone call and perhaps 30 minutes later, our Executive Director of Facilities walked into the server room, where I was — now shirtless and sweating — shutting down as many non-critical systems as I could.
I don’t recall many of the details (if I ever even knew them) regarding how they fixed it but that’s somewhat irrelevant to the story anyways.
Fortunately for us, no equipment suffered any damage (as far as we could tell).
As you might imagine, my request to purchase environmental monitoring equipment was quickly approved and, in short order, we had a device from APC installed and configured.
Note: I highly recommend the APC NetBotz line of products.
The APC environmental monitor supported SNMP and we used our (Nagios-based) monitoring system to query the temperature (and graph it) every five minutes. The device itself also had the ability to generate e-mail alerts when thresholds were exceeded. We configured those to be sent to mailing lists of which every employee in the IT and Facilities departments was a member. In addition, I whipped up a little script that pulled the temperature via SNMP and also generated alerts when the thresholds we defined were exceeded — you know, just in case. We certainly didn’t want to have a repeat of this incident.
History Repeats Itself
Fast forward several years to this past Tuesday night. I’m working for a different company in a different city, but the story is similar.
I’m at home and searching for my phone to send a text message to a friend. When I find it, I notice that I had received an e-mail alert from our monitoring system about an hour before. One of the two VMware ESXi servers in our data center had stopped responding to ICMP echo requests.
Note: I use the term data center loosely. The building I’m referring to in this article is far from a “real” data center — “equipment hut” would be more accurate — but it (roughly) meets Wikipedia’s definition of “data center” so I’m rolling with it.
I get dressed, find my keys, jump in my vehicle, and head to the site.
As I unlock the door and open it, I am once again greeted with a blast of hot air — the cooling system, as you might have guessed by now, wasn’t.
I’m not sure what the actual temperature was as the air conditioning system itself was off and the remote control unit has only a two-digit read-out on it. It was certainly well above the indicated 99 degrees Fahrenheit.
The cooling system was powered off.
I decided that perhaps I should turn it on.
Luckily for us, it was somewhat cold outside. I left the external door standing wide open and the combination of the cold outside air and an air conditioning system that was now running balls to the wall managed to cool it down relatively quickly.
Once I was satisfied that everything was okay, I powered the VMware ESXi server back up, waited for the “all clear” alerts to arrive, and headed home.
Arriving home, I sent out an e-mail to my team to let them know what had happened. Early the next morning, I received a reply from one of my co-workers. It read, in part:
“I’ll take the fall for this one. When we [installed some equipment] sometime last week it was like 60 degrees in there. So I shut off the AC and left the door open while we we’re doing stuff. I totally forgot to turn it back on.”
Now, I’m certainly not trying to call him out because we all make mistakes and screw up at times. I can understand how he overlooked forgetting to turn the cooling system back on before he left.
As he wrote in his e-mail, “I guess we wouldn’t be human if we didn’t make a few mistakes.”
He’s absolutely right, of course. Shit happens. I’d be willing to bet, however, that he’ll never make that same mistake again. It was a learning experience and, fortunately, nothing seemed to be damaged. No harm, no foul.
Now, if you’ll excuse me, I need to go install some environmental monitoring equipment.