Posts Tagged ‘disaster planning’



The Red Button

Thursday, April 1st, 2010

Don’t ever press the wed, err red one. While I laughed hysterically at this cartoon as kid, I never thought it would become my reality one day. Yes, I have pressed the red one, but I hope to never have to again.

The “red one” is none other than the Emergency Power Off button, and here on the east coast it’s pretty hard to build a data center without one. What?! You don’t have one?  Shhhh…I won’t tell.  You’re secret is safe with me. Here’s what a real EPO red button looks like in case you’ve never seen one.

Notice the label. I firmly believe it should also say “UPDATE YOUR RESUME BEFORE PRESSING” as pressing this is in most cases is a resume-generating, if not career-ending event.  Why? When activated, this button’s job is to do one thing, and one thing only: cut the power to your data center. All of it. Let that sink in for a moment. Think through that what that would mean in your shop.  No power. No sound. Just deafening silence, that is of course, unless you pressed it by accident and the silence gives way to the sound of clanging pitch forks and the smell of torches being lit over in the end-user community.

I am obviously a bit biased about this topic. I don’t think these systems are necessary, but you should do some research and draw your own conclusions. I am 100% all for safety, but from the historical evidence I’ve seen, the risk that EPO is designed to mitigate is lower than what you’re exposed to driving to work every day.  APC’s white paper #22 pretty much nails it:

EPO is a subsystem that is specifically designed to override all redundancy and fault tolerance built into the
network-critical physical infrastructure (NCPI), thereby putting the entire network at risk. EPO operation is
one of the largest causes of unplanned data center shutdown. The design of an EPO system must
therefore try to prevent any possibility of accidental operation, and it must minimize deliberate operation for
any reason other than a valid life-threatening emergency.
[Emphasis mine]

Red buttons are no panacea, but we are nevertheless forced to install then, and then make them nigh unto impossible to press unless you Really Mean It.  Note in the photo above that the button is both recessed and protected by a plastic cover. Without the plastic cover, the recessed nature of the button is the only thing preventing it from accidentally being bumped and also hopefully slows down a would-be pusher enough to stop and ask “Do I Really Mean It?” Speaking of the cover, note also the small gray loop of wire in the upper left corner of the housing – we opted to install covers with alarms. Lifting the cover results in a piercing electronic squeal capable of  penetrating 2-hour fire-rated walls and forces one once again to stop and ask “Do I Really Mean It?”  Cover alarms are designed to stop non-data center savvy electricians and others from innocently doing something disastrous, such as pressing the red button before installing a new circuit breaker. Yes, it happens. Well, the label does contain the word “off”, doesn’t it?  Changing the label from “Electrical Power Off” to “Emergency Power Off” tends to alter the results little.  The word “off” seems to be the Pavlovian trigger.

Disarming Considerations

As I write this, our EPO system is being expanded to accommodate the growth of our operations. If you are building a new data center with EPO, make sure the designer includes a way to disable the system during maintenance and expansion activities. This seems like an obvious feature to include, but don’t take it for granted. This is also a handy feature to have if your operations are prone to having “civilians” in the data center, i.e. those who are unfamiliar with the various buttons and switches on the walls. It is very reassuring to be able to disarm the red buttons while such folks are meandering about the room. Even when escorted, such folks have been known to find ways to activate the EPO system, either accidentally by bumping a non-recessed red button, or deliberately pushing it out of curiosity when no one is watching.

A Marriage Made In Hell

Once you have an EPO system in place, you will have to learn to live with it.  It is a risk that must be managed like all the others. If you’re building a new data center, you at least have the opportunity to design and build it properly, and then test it without jeopardizing your operations. Retrofitting an existing data center with EPO or expanding an existing system is a different matter entirely. You will want to engage an engineering firm and electricians that are very experienced with EPO systems, as most electricians are not familiar with the complexities involved with wiring EPO into a live data center environment. There is no second chance to get it right.

Here is scary story that makes my point.  Cutting to the chase, the article states:

About a month after opening a new facility in March 2003, Roberts, the director of data center services for Novi, Mich.-based Trinity Health, got a call. It was Easter morning, and a contractor had accidentally activated the EPO switch as he tried to replace a module connecting the button to the fire alarm system. According to Roberts, the fiasco “took the data center out.”

“We went out at 8:30 that morning,” he said. “By 11:30 that night, we were probably 95% up and going, so we were pretty lucky. But from that day forward, I tried to lessen the effect of this EPO.”

Lessen the effect indeed. This not the kind of resurrection we want to be talking about on Easter Sunday.

Stress Relief Department

After all of this talk about outages, and with my own data center’s EPO being modified as we speak, it’s time for some needed stress relief:

Happy Easter!

//spk

P.S. I did press the red button, several times actually, but it wasn’t in a live situation.  It was during the initial testing of our system.  The lead engineer said “May as well press it now if you want, because you never will again.”  Hopefully he was a genuine prophet.

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

Personnel DR

Friday, July 10th, 2009

Are you prepared for the departure of a key technical resource in your operation?  Someone who holds the infamous “keys to the kingdom?”   Typically there is a least one person on a company’s IT staff who achieves deity status in regards to physical and logical access. Sometimes key skills also reside in just that one person. If such a person leaves, either voluntarily or involuntarily, how would your critical operations fare?

vista_help_icon_by_thoosje

Now would be a good time to take a fresh look at both your internal documentation and your skills matrix. Things to consider:

  1. Are all sysadmin userids and passwords documented somewhere, somehow?
  2. Are all critical architectures documented in excruciating detail?  (SAN, virtualization, LAN/WAN, disk replication, backup/restore systems). You want to see how things are connected and how they are intended to interact. You want to see things like IP addresses, subnet addressing schemes, WWN numbers, hard and soft zoning information and the like. You’ll know you have all of the information you need when you can hand it to a new engineer and he doesn’t have any questions. Seem impossible? Strive for it, and the result will be good enough.
  3. Where does the above documentation live if you do already have it?  Hopefully it’s not on your staff’s laptops.  If you think you already have it in a shared on-line space, are you sure you have all of it?  And is it being backed up?
  4. Do you have runbooks for all of your servers?  Are they current?  Where are they? Are they backed up?
  5. How many people have practical working knowledge in each area of your critical infrastructure?  Do you have more than one VMWare tech?  More than one SAN person?  How about Active Directory or Exchange? Ideally you’ll want three in each area. Contract for it if you need to.

 
I could go on, but I think you’re getting my point. This process is somewhat like writing a will.  It’s a real drag to write up, and everybody knows that they need to take care it, but yet it often gets ignored until it’s too late. And just like a will, all of this documentation needs to be updated on a regular basis or it may end up being worthless at crisis time.

Alternatively, you could move the responsibility for a large portion of this to a professional hosting facility.   Why not limit your exposure to just your applications and let us worry about how all the  plumbing is hooked up?

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

Worse Than Failure

Friday, June 12th, 2009

Whether you have your own shop or host your gear somewhere else, this week’s horror story at VAserv should serve as a wake-up call if you’re responsible for safeguarding vital company data, especially your customer’s data.

To briefly sum up the story, hackers took out 100,000, (yes 100,000) web sites, many of them permanently, in an evening’s worth of work.   Just restore the backup, you say?  Not so fast.

VAserv basically offers low-cost Web hosting services using virtualized private servers based on HyperVM. As of Wednesday morning, it was not clear how many of its customers — many of them based in the U.S — had irretrievably lost data in the attack. That number could be high, though, because half of those affected had apparently signed up for an unmanaged service that doesn’t include backups, according to the Register. [emphasis added]

 
And for those customers that did sign up for backup?

A note on VAserv’s Web site, which is now just a text document with details on the company’s restoration efforts, claimed its staff had been working “tirelessly” over the last 48 hours. “However, we have now reached the end of all of our servers, and as such, if your server is not currently up, or not partly up, then it is unfortunate that you will have lost your data due to this third-party attack,” the note said.

Oh the humanity, indeed. ComputerWorld’s Jaikumar Vijayan receives this week’s Master of the Understatement award:

The continuing fallout from a hacking incident at U.K.-based Web hosting company VAserv should serve as a powerful reminder that companies need proper data backup and disaster recovery procedures. The incident, which could result in a fire sale of VAserv to another hosting provider, is also an especially stark example of the kind of havoc that a malicious attacker can wreak on businesses.  [more emphasis added]

 
Can you say ‘class action lawsuit?’

Attempts to reach Rus Foster, VAserve’s director via e-mail and phone were unsuccessful. But the terse updates on the company’s Web site and the thousands of customer posts on a discussion forum painted a picture of total chaos.

I’ve personally reached the end of my physical and emotional tether” Foster wrote in one post on the discussion forum late Tuesday evening. “We have worked pretty much continuously for the last few days firefighting.”

Foster wrote in a post that suggested he was putting the company up for sale. In his note, Foster said he had two options: Do what’s best for the customer base by getting “some big boys in behind” to help get things back up and running. The other he said was to simply “Run away and hide and just say to everyone “good bye”"

 
Run away and hide?  When did that become a viable option for gross negligence?  No one can outrun the long arm of the Bar Association.

matrix42

I’m reminded of a line from The Architect’s classic speech: “There are levels of survival we are prepared to accept.”   There are clearly plenty of folks that seem comfortable managing their IT shops that way.  We see it all the time when we look at their backup strategies and disaster plans, if they have any.  It seems to me that being totally wiped out or having to sell our companies because of something so easily preventable as failed backups is not one of those acceptable levels. But wait, it was those scoundrels the hackers, wasn’t it?  They caused the problem, and they killed the company.  No they didn’t. To be sure, the hackers wreaked havoc, but what they really did was expose the ultimate game-ending event: no backups. Had proper backup procedures been in place and restores regularly tested, the incident would have been merely one of  downtime and possibly SLA penalties. (Yes. I know credit card data was also stolen, but that’s not necessarily a game-ender.)

Being an infrastructure company, we routinely preach about the need for proper backup and restore procedures, and the need to test them.  Sadly, it often falls on deaf ears, and while we do occasionally read an obituary like VAserv’s,  death-by-no-backups is happening all the time in companies you’ll never hear about.

There’s another quote I like from the Matrix: “You hear that Mr. Anderson?  That is the sound of inevitability…it is the sound of your death…”   If you aren’t testing your ability to restore your backups (you do have them don’t you?) , the sound of inevitability may be tolling for you.

Hope is not a strategy.  If backups and restores stress you out, or you’re just hoping they’ll be there when you need them, consider handing it all over to a group of people who live and breathe it. They actually enjoy backups, and they’ll take good care of your gear too.

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post


Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.