Call Before You Dig Sign

A lot of the things we do in IT aren't particularly important. If this or that company doesn't sell enough product and goes under, it sucks for the employees, but life goes on. Sometimes, however, we're faced with building systems that need to be truly resilient: aviation systems, for example, cannot go down for a reboot midflight. Government services also fall toward the "important" end of the scale. The local mayor's homepage might not be important, but services like Fire and Rescue or 911 are mission-critical.

The control room Kit was installing needed to be up 24/7/365, presumably only allowing a maintenance window every four years. The building was designed to be fireproof, terrorist-proof, electronic-evesdropping-proof, you name it. This was going to be one of the most secure, resilient rooms in the entire city, and we're not talking about a small city, either.

Kit hooked up the servers to power. The power had been designed with two independent feeds from two separate substations, with a huge UPS in the loft (to keep it safe from potential floods) with a twelve-hour capacity. The basement housed two diesel generators, and if all else failed, there was a huge socket on the garage wall to allow a transport container generator to be plugged in.

It was an excellent design—but you know what site you're on, so you can guess how it all worked out.

Kit was in the middle of commissioning and testing the systems they'd installed. Everything was looking good in the control room, and the customer was running some training exercises.

Then, it happened: the servers stopped responding.

The terminals remained on, but there was clearly nothing for them to connect to. This was around 1990, so it was still very much a mainframe setup. Kit's team headed to the equipment room, only to find the gut-wrenching sight of dead machines: no lights, no fans, nothing.

It has to be the power, Kit thought. The system was working five minutes ago, and they're redundant servers. They wouldn't all just break down.

He was sweating, but tried not to let his team see. "All right, let's check the UPS," he declared, trying to sound casual.

"This way," replied one of the techs, leading him to the stairwell ... and down the stairs.

"Isn't the UPS in the loft?" Kit asked, frowning.

"No, sir," the tech replied with a grin. "Turns out the floor up there isn't rated for the weight of the lead acid batteries."

The best laid plans of mice and men ... Kit thought, then shook his head.

Twenty minutes later, the UPS checked out fine. It wasn't flood-proofed anymore, but there wasn't any water, so it ought to have been working. The diesel generators had kicked in, which was why the overhead lights were still on. There had to be some kind of wiring mistake for the servers.

Kit traced the wires, mentally correcting the specification to account for the relocated UPS. That led him back to the equipment room without any obvious sign of fault other than "equipment not working." After pulling open a wall panel, he were able to figure out the mistake pretty quickly: the servers were powered by the UPS, but the switch was hooked to the raw mains, and everything was designed to shut off if the switch went down.

Kit rubbed his forehead, sent a tech to check all the outlets, and kept looking for any other bonehead moves.

The control room power didn't route through the equipment room. When Kit ran a check, half the gear in that room didn't seem to work, either. It had power, but the communication was down.

This was all fine before the power went, he reminded himself. Now where's that intercom switch?

Then he remembered: the training room. You see, due to the massive amounts of equipment needed to run the control room, there wasn't any space for the communication switches. The nearby training room, however, had much less equipment in it, so they'd moved the switches there.

Sure enough, as Kit poked his head into the training room, he found the whole place dark. Who'd want to train during an emergency? Nobody, that's who. So why bother with redundant power? Save the juice for the important rooms—which now couldn't function because they were missing key components.

Only one question remained: why did the power go out in the first place? It wasn't a scheduled disaster drill. There were two redundant power lines coming in, so it would've taken something massive to knock them both out. Was one of them disconnected? No; Kit had been there when the electrician went over the wiring, and had seen him sign off on it. Concerned, he wandered out back ... and immediately facepalmed.

Both cables came into the building at the same point, so they could both be fed into the same grid. That point was currently occupied by a small backhoe and some frazzled looking contractors.

Mystery solved.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!