Image of Cortney & Jeremy

Y U NO MONITOR DATACENTER TEMPERATURE!?

by Jeremy L. Gaddis on October 18, 2012 · 8 comments

in Humor

Post image for Y U NO MONITOR DATACENTER TEMPERATURE!?

In a previous life, I worked at a .edu and lived, roughly, across the street from the building that housed IT. I don’t recall what it was that caused me to go to work that Monday evening, the third day of a three-day (holiday) weekend, but as anyone who works in IT would likely agree, it’s not unusual to be working at the office or data center at weird hours.

As I unlocked the door to the server room and pushed it open, I was met by a huge blast of HOT air — not just warm; HOT.

Unbeknownst to anyone in IT or our Facilities department, the dedicated cooling system for the data center had failed. It’s been several years ago now but, if memory serves, it was wintertime and the air conditioning unit on the roof had frozen up.

Exactly when it failed wasn’t known but, if the 109 degree (Fahrenheit) temperature was any indication, it had almost certainly been much earlier in the weekend. It certainly didn’t happen ten minutes before I arrived.

A phone call and perhaps 30 minutes later, our Executive Director of Facilities walked into the server room, where I was — now shirtless and sweating — shutting down as many non-critical systems as I could.

I don’t recall many of the details (if I ever even knew them) regarding how they fixed it but that’s somewhat irrelevant to the story anyways.

Fortunately for us, no equipment suffered any damage (as far as we could tell).

As you might imagine, my request to purchase environmental monitoring equipment was quickly approved and, in short order, we had a device from APC installed and configured.

Note: I highly recommend the APC NetBotz line of products.

The APC environmental monitor supported SNMP and we used our (Nagios-based) monitoring system to query the temperature (and graph it) every five minutes. The device itself also had the ability to generate e-mail alerts when thresholds were exceeded. We configured those to be sent to mailing lists of which every employee in the IT and Facilities departments was a member. In addition, I whipped up a little script that pulled the temperature via SNMP and also generated alerts when the thresholds we defined were exceeded — you know, just in case. We certainly didn’t want to have a repeat of this incident.

History Repeats Itself

Fast forward several years to this past Tuesday night. I’m working for a different company in a different city, but the story is similar.

I’m at home and searching for my phone to send a text message to a friend. When I find it, I notice that I had received an e-mail alert from our monitoring system about an hour before. One of the two VMware ESXi servers in our data center had stopped responding to ICMP echo requests.

Note: I use the term data center loosely. The building I’m referring to in this article is far from a “real” data center — “equipment hut” would be more accurate — but it (roughly) meets Wikipedia’s definition of “data center” so I’m rolling with it.

I get dressed, find my keys, jump in my vehicle, and head to the site.

As I unlock the door and open it, I am once again greeted with a blast of hot air — the cooling system, as you might have guessed by now, wasn’t.

I’m not sure what the actual temperature was as the air conditioning system itself was off and the remote control unit has only a two-digit read-out on it. It was certainly well above the indicated 99 degrees Fahrenheit.

Image of remote control unit showing the 99 degree (Fahrenheit) temperature in the datacenter

The cooling system was powered off.

I decided that perhaps I should turn it on.

Luckily for us, it was somewhat cold outside. I left the external door standing wide open and the combination of the cold outside air and an air conditioning system that was now running balls to the wall managed to cool it down relatively quickly.

Once I was satisfied that everything was okay, I powered the VMware ESXi server back up, waited for the “all clear” alerts to arrive, and headed home.

Arriving home, I sent out an e-mail to my team to let them know what had happened. Early the next morning, I received a reply from one of my co-workers. It read, in part:

“I’ll take the fall for this one. When we [installed some equipment] sometime last week it was like 60 degrees in there. So I shut off the AC and left the door open while we we’re doing stuff. I totally forgot to turn it back on.”

Now, I’m certainly not trying to call him out because we all make mistakes and screw up at times. I can understand how he overlooked forgetting to turn the cooling system back on before he left.

As he wrote in his e-mail, “I guess we wouldn’t be human if we didn’t make a few mistakes.”

He’s absolutely right, of course. Shit happens. I’d be willing to bet, however, that he’ll never make that same mistake again. It was a learning experience and, fortunately, nothing seemed to be damaged. No harm, no foul.

Now, if you’ll excuse me, I need to go install some environmental monitoring equipment.

{ 7 comments… read them below or add one }

Mitch October 18, 2012 at 11:50 am

Thats is one of the things I would like to do here at our site. We don’t really even have a “health” monitoring system yet. I have a feeling its one of those things that… When you need it, it won’t be there, then it will be the first thing done. But until then, they keep getting pushed back another day and another day… :( Since it’s happened to you twice, maybe I’ll get lucky and you will keep having all the bad luck and I’ll get by unscathed. :)

Reply

Jelmer October 18, 2012 at 4:13 pm

As a poor mans temperature monitoring system, couldn’t you just poll chassis temperature of a few/all of your systems through SNMP?

Reply

ijdod October 19, 2012 at 4:14 am

If the temperature was indeed at 60F with the system running, it may actually be set too low. Aside from becoming more enjoyable to work in the cold isle, upping the temperatures a bit will save some cash on the power bill.

Reply

Michael McNamara October 22, 2012 at 11:38 am

That’s a classic story Jeremy and it’s been repeated more times than people probably care to admit. I can’t tell you how many cooling and power failures we’ve had in our facilities. It always starts out as a network problem until we get onsite and realize either there’s a complete power failure or the room is running at 120F. In many cases the management and monitoring of the data center becomes a network problem and the facilities manager is no where to be found.

I haven’t found many people willing to be that honest unless the bloody corpse is laying at their feet. I just recently installed an APC NetBotz 200 to monitor one of our computer rooms and it’s been working very nicely… the higher end APC NetBotz also provide video surveillance (motion detection) which can also be very handy.

Great article!

Reply

JimmyBeBad December 6, 2012 at 4:41 pm

How about a cheap $20 webcam and a walmart thermometer! Now that’s Monitoring!

Reply

Jim I February 12, 2014 at 12:31 pm

After suffering the same several times in my career, I now have my company on board. We use the USB edition of Temperature@lert. It is a usb thermometer that will plug into any windows host, pc server or otherwise. As long as it has access to a SMTP server it can send out alerts based on my personal thresholds of comfort. I can now respond to a rising temperature well before it reaches critical levels.

Excellent write up!

Reply

Brenden March 3, 2014 at 12:54 am

Had something similar happen on deployment, not to me but to another IT shop. Middle of the summer in Afghanistan, so pretty hot outside, let alone inside a trailor. The servers started dropping pretty quickly one right after another, this was before we started using VM servers so quite a few of them, ended up getting to the trailor and they opened the door. Outside was hot, inside there it was unbearable, a few of the servers had actually caught fire. Apparently when all was said and done, the 2 AC units had both frozen up, They dont know what the temp was inside but it was bad. out of something like 60 + servers, i think 12 survived. Needless to say the moved the server farm to the Helpdesk building and started running VMWare because they had no extra servers left.

Reply

Leave a Comment

{ 1 trackback }

Previous post:

Next post: