What could possibly be more fun than standing outside in the 100+ degree heat here in southeastern PA? Standing outside in 90+ degree heat while waiting in line at Disney World. At least there’s the promise of something fun, and possibly something cool and wet at the end of the wait.
While recently wading through the sea of humanity and waiting in some of the infernal lines that define Disney at this time of year, I was struck by an interesting IT analogy early in the week (yes, I really did need a vacation, and by the end of the week I wasn’t thinking about IT at all).
In last June and early July, the number of baby-strollers per square foot in the Magic Kingdom increases to approximately 10x the normal rate. This forces one to put up with gives one many opportunities to observe other people’s children under extreme conditions. It’s amazing to watch parents expect their 5-year olds to behave like perfect angels in subtropical queue lines for upwards of 45 minutes, or sprint from one end of a park to another on tiny, tired little legs to score a Toy Story Fast-Pass before they’re all gone. It’s amazing because you can tell these kids are perfect hellions under ideal conditions as well. Putting them under stress only intensifies the problems that already exist.
Likewise, if you’ve got poorly designed or neglected infrastructure, simply moving it to a colo facility isn’t going to improve up-time or performance significantly, if at all. Certainly you can improve environmentals, save capex, and get lower network latency with a colo move, but if application response time and reliability are sucking wind before the move because of bad design or sysadmin neglect, not much is going to change.
My point isn’t that you should avoid putting your infrastructure in a better home if you need to, but that you shouldn’t expect it to behave any differently just because you moved it. Moreover, move time is not the time to make drastic changes to your production systems. It’s not a “free” outage window. The more changes you make during a move, the higher the risk of a failed, or at minimum a very stressful move.
On the other hand, a move can be an ideal time to upgrade to better hardware and legitimately raise your expectations. For example, you can set up new hardware next to your old, cluster it, and then move the new half of the cluster to a better home while the old half continues to run the business. After you complete the move and let the clusters resynchronize, you can turn down the old cluster and all activity will automatically switch over to the new hardware. Your users will never feel a thing. Very little pain, but very much gain.
Of course that all sounds good, and there are a lot of details involved in making it happen, but that’s what we do best. If you’re interested in smoothly moving your critical IT gear to a new home and need some experienced help to get it done, give us call. Hardware prone to temper tantrums is one of our specialties.
Don’t ever press the wed, err red one. While I laughed hysterically at this cartoon as kid, I never thought it would become my reality one day. Yes, I have pressed the red one, but I hope to never have to again.
The “red one” is none other than the Emergency Power Off button, and here on the east coast it’s pretty hard to build a data center without one. What?! You don’t have one? Shhhh…I won’t tell. You’re secret is safe with me. Here’s what a real EPO red button looks like in case you’ve never seen one.
Notice the label. I firmly believe it should also say “UPDATE YOUR RESUME BEFORE PRESSING” as pressing this is in most cases is a resume-generating, if not career-ending event. Why? When activated, this button’s job is to do one thing, and one thing only: cut the power to your data center. All of it. Let that sink in for a moment. Think through that what that would mean in your shop. No power. No sound. Just deafening silence, that is of course, unless you pressed it by accident and the silence gives way to the sound of clanging pitch forks and the smell of torches being lit over in the end-user community.
I am obviously a bit biased about this topic. I don’t think these systems are necessary, but you should do some research and draw your own conclusions. I am 100% all for safety, but from the historical evidence I’ve seen, the risk that EPO is designed to mitigate is lower than what you’re exposed to driving to work every day. APC’s white paper #22 pretty much nails it:
EPO is a subsystem that is specifically designed to override all redundancy and fault tolerance built into the
network-critical physical infrastructure (NCPI), thereby putting the entire network at risk. EPO operation is
one of the largest causes of unplanned data center shutdown. The design of an EPO system must
therefore try to prevent any possibility of accidental operation, and it must minimize deliberate operation for
any reason other than a valid life-threatening emergency. [Emphasis mine]
Red buttons are no panacea, but we are nevertheless forced to install then, and then make them nigh unto impossible to press unless you Really Mean It. Note in the photo above that the button is both recessed and protected by a plastic cover. Without the plastic cover, the recessed nature of the button is the only thing preventing it from accidentally being bumped and also hopefully slows down a would-be pusher enough to stop and ask “Do I Really Mean It?” Speaking of the cover, note also the small gray loop of wire in the upper left corner of the housing – we opted to install covers with alarms. Lifting the cover results in a piercing electronic squeal capable of penetrating 2-hour fire-rated walls and forces one once again to stop and ask “Do I Really Mean It?” Cover alarms are designed to stop non-data center savvy electricians and others from innocently doing something disastrous, such as pressing the red button before installing a new circuit breaker. Yes, it happens. Well, the label does contain the word “off”, doesn’t it? Changing the label from “Electrical Power Off” to “Emergency Power Off” tends to alter the results little. The word “off” seems to be the Pavlovian trigger.
Disarming Considerations
As I write this, our EPO system is being expanded to accommodate the growth of our operations. If you are building a new data center with EPO, make sure the designer includes a way to disable the system during maintenance and expansion activities. This seems like an obvious feature to include, but don’t take it for granted. This is also a handy feature to have if your operations are prone to having “civilians” in the data center, i.e. those who are unfamiliar with the various buttons and switches on the walls. It is very reassuring to be able to disarm the red buttons while such folks are meandering about the room. Even when escorted, such folks have been known to find ways to activate the EPO system, either accidentally by bumping a non-recessed red button, or deliberately pushing it out of curiosity when no one is watching.
A Marriage Made In Hell
Once you have an EPO system in place, you will have to learn to live with it. It is a risk that must be managed like all the others. If you’re building a new data center, you at least have the opportunity to design and build it properly, and then test it without jeopardizing your operations. Retrofitting an existing data center with EPO or expanding an existing system is a different matter entirely. You will want to engage an engineering firm and electricians that are very experienced with EPO systems, as most electricians are not familiar with the complexities involved with wiring EPO into a live data center environment. There is no second chance to get it right.
About a month after opening a new facility in March 2003, Roberts, the director of data center services for Novi, Mich.-based Trinity Health, got a call. It was Easter morning, and a contractor had accidentally activated the EPO switch as he tried to replace a module connecting the button to the fire alarm system. According to Roberts, the fiasco “took the data center out.”
“We went out at 8:30 that morning,” he said. “By 11:30 that night, we were probably 95% up and going, so we were pretty lucky. But from that day forward, I tried to lessen the effect of this EPO.”
Lessen the effect indeed. This not the kind of resurrection we want to be talking about on Easter Sunday.
Stress Relief Department
After all of this talk about outages, and with my own data center’s EPO being modified as we speak, it’s time for some needed stress relief:
Happy Easter!
//spk
P.S. I did press the red button, several times actually, but it wasn’t in a live situation. It was during the initial testing of our system. The lead engineer said “May as well press it now if you want, because you never will again.” Hopefully he was a genuine prophet.
On June 29th a cloud burst occurred at Rackspace, proving that even the mighty eventually do fall. The blow-by-blow Rackspace Twitter account of their power outage provides interesting insight into what happens during a crisis at a hosting provider.
In every industry there are dirty little secrets that customers either don’t know about, or don’t want to know about. The meat counter at the grocery store is a prime example. Those steaks and chops look really good, but did you every watch the entire process from hoof to hamburger? It’s not pretty, and for most folks it’s Too Much Information.
So here’s Dirty Little Secret #1 of the hosting industry: While most every hosting company has to make the claim in order to be credible, no one can deliver 100% data center up time forever. No one. Not even the market leader. So why then make the claim at all? Because that’s what customers demand to hear. In talking with customers we find a widespread cross-industry sentiment, usually absent of any logical rational, that says “my business is so important that my infrastructure has to be running 24/7 without any interruptions at all.” Unless your business is keeping patients alive with sophisticated medical equipment, this seems like a rather difficult position to defend. But no one wants to be the bad guy to point that out. We know there is life beyond brief outages because they happen every day and yet nobody goes broke, but it is typically unwise to say so.
Realizing that downtime will occur, even in the elite shops of the world like Rackspace with their fleet of nine data centers, you do need to make realistic decisions about what level of up time you really need in light of the type of business you’re in. And while it may sound like heresy, you also want to make decisions about things that are much more important than up time levels. It seems to me that if downtime is inevitable, and we know that it is, then I want my equipment in the hands of people who know how to recover quickly from an outage, who will communicate with me regularly and truthfully throughout the crisis, and who will do their level best to get me back on line as quickly as possible. I want my equipment in the hands of highly competent people that I can trust. You can’t make that determination when you sign up for service via a web browser or where you do the whole transaction over the phone. The only way to make the determination is to actually meet the people who are going to become the custodians of your infrastructure.
Before you put your equipment in the hands of someone else, make the effort to visit them. If they don’t allow visits, that should be a big Red Flag #1. Talk to their operations and support people, particularly the folks who will be touching your equipment. If you’re not allowed to talk them, that should be Red Flag #2. Ask them about their up time guarantee. If they look at you square in the eye and say 100%, that should be Red Flag #3. Kick the dust out of your shoes and move on.
Let me cordially invite you to visit our data center hosting facility this summer. No red flags – just trustworthy, highly competent, and dependable people.