Archive for the ‘Disaster Recovery’ Category



A Pragmatic View of Downtime Cost

Friday, July 17th, 2009

On occasion, our prospective customers will mention they’ve done an extensive study to determine how much a minute of downtime costs their company. Ergo they are visiting with us to establish either a primary or secondary location as part of strategy to lose exactly $0. Sometimes I wonder what Gary Coleman would say if they heard their explanations.

Well said Gary. Terse, but adequate.

There is apparently some very deep magic involved in figuring out the cost of downtime, and no one seems to agree exactly on what the proper incantation should be. A little over a year ago on The Numbers Guy, there was a humorous post of just how ambiguous, and ultimately irrelevant, calculating this number can become.  To wit:

One blog headlined a post, “Amazon’s $3.6 Million Outage?,” noting that if projected second-quarter revenue was spread evenly over time, then the site normally would be making $1.8 million per hour. TechSpot.com and the Seattle Post-Intelligencer performed similar calculations with last year’s revenue to estimate that Amazon lost $29,000 per minute; CNET used last quarter’s results to calculate $31,000 per minute. Then the New York Times, last week, reported that “Amazon, by some estimates, lost more than a million dollars an hour in sales.”

Does it really matter who was right?  It’s A Lot Of Money by anyone’s reckoning, that is, if you can believe the numbers in the first place.

What exactly is the value of doing an extensive study, or even a moderately detailed investigation, given the cost the of meetings hours one would burn doing the analysis vs. the quality of data one could actually expect to produce? Often the variables can become so complex that gut feel and opinion invariably creep into the equation just to get the math done. This in turn results in a baked-in degree of subjectivity that ends up being the source of debate when the numbers are used to justify a business case later on.

Maybe I’m missing something, but there has to be a better way. Usually the only reason we want to know the cost of downtime is to justify the costs necessary to keep the key parts of the infrastructure highly available. It then logically follows that we need to know which parts of our infrastructure really contribute to the top line such that they are truly worthy of being made highly available. With that in mind, let’s ask the questions:

Let’s say gross revenue was $20M last year, and we do business 5 days a week, or 260 days/year. On a simple even spread, that’s $53/minute, or if you like $3,205/hour. Can you make a business decision based on that? No matter how you rig the math (e.g. more heavily weight end of month, etc), it boils down to crazy numbers that look this. How can they possibly help you justify a monthly spend on an availability solution? Does it therefore matter precisely how accurate they are? The above calculation is admittedly simple and perhaps even lame, but I would argue than any other more exotic formula does no better.

I think the more useful exercise to take the time to really understand what your key business systems are (not the discrete elements like servers and routers), and then determine what underlying systems are required to make them go, along with all of the various inter-system dependencies. As obvious as that sounds, our experience has been that folks often do not have this kind of a handle on their infrastructure. Instead of saying to senior management that “our SQL server absolutely has to be up all the time because the business depends on it,” you should be able to say “our SQL server needs to be up because we can’t take orders if it’s down,” or “our SQL server needs to be highly available because we can’t load our trucks when it’s down.” You are much more likely to hear “make it so” on your DR proposal with this approach than if you go in with a story about how much downtime costs the company. Senior management is well aware of the revenue numbers – they don’t need to be reminded, and trying to foist a murky cost of downtime justification on them is an iffy, if not perilous strategy. What they want and need is plain talk on what happens if things break.

Speaking in pragmatic terms the business understands will result in funding for a DR plan that makes sense for the business, though it may not be everything you’d personally like to have. So if you feel you really need synchronous replication rather than asynchronous, you’re going to have to explain in business terms why the extreme extra cost is necessary. Pitch the solution in plain terms. Lay out information meaningful to your leaders and trust them to make a good business decision.

But do influence them to go with a data center that has your operating budget in mind. We can often bring some of the seemingly out-of-reach high availability solutions within your reach.

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

Personnel DR

Friday, July 10th, 2009

Are you prepared for the departure of a key technical resource in your operation?  Someone who holds the infamous “keys to the kingdom?”   Typically there is a least one person on a company’s IT staff who achieves deity status in regards to physical and logical access. Sometimes key skills also reside in just that one person. If such a person leaves, either voluntarily or involuntarily, how would your critical operations fare?

vista_help_icon_by_thoosje

Now would be a good time to take a fresh look at both your internal documentation and your skills matrix. Things to consider:

  1. Are all sysadmin userids and passwords documented somewhere, somehow?
  2. Are all critical architectures documented in excruciating detail?  (SAN, virtualization, LAN/WAN, disk replication, backup/restore systems). You want to see how things are connected and how they are intended to interact. You want to see things like IP addresses, subnet addressing schemes, WWN numbers, hard and soft zoning information and the like. You’ll know you have all of the information you need when you can hand it to a new engineer and he doesn’t have any questions. Seem impossible? Strive for it, and the result will be good enough.
  3. Where does the above documentation live if you do already have it?  Hopefully it’s not on your staff’s laptops.  If you think you already have it in a shared on-line space, are you sure you have all of it?  And is it being backed up?
  4. Do you have runbooks for all of your servers?  Are they current?  Where are they? Are they backed up?
  5. How many people have practical working knowledge in each area of your critical infrastructure?  Do you have more than one VMWare tech?  More than one SAN person?  How about Active Directory or Exchange? Ideally you’ll want three in each area. Contract for it if you need to.

 
I could go on, but I think you’re getting my point. This process is somewhat like writing a will.  It’s a real drag to write up, and everybody knows that they need to take care it, but yet it often gets ignored until it’s too late. And just like a will, all of this documentation needs to be updated on a regular basis or it may end up being worthless at crisis time.

Alternatively, you could move the responsibility for a large portion of this to a professional hosting facility.   Why not limit your exposure to just your applications and let us worry about how all the  plumbing is hooked up?

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

The Truth About Up Time

Friday, July 3rd, 2009

On June 29th a cloud burst occurred at Rackspace, proving that even the mighty eventually do fall. The blow-by-blow Rackspace Twitter account of their power outage provides interesting insight into what happens during a crisis at a hosting provider.

42-15823054

In every industry there are dirty little secrets that customers either don’t know about, or don’t want to know about. The meat counter at the grocery store is a prime example. Those steaks and chops look really good, but did you every watch the entire process from hoof to hamburger? It’s not pretty, and for most folks it’s Too Much Information.

So here’s Dirty Little Secret #1 of the hosting industry:  While most every hosting company has to make the claim in order to be credible, no one can deliver 100%  data center up time forever. No one. Not even the market leader.  So why then make the claim at all?  Because that’s what customers demand to hear. In talking with customers we find a widespread cross-industry sentiment, usually absent of any logical rational,  that says “my business is so important that my infrastructure has to be running 24/7 without any interruptions at all.” Unless your business is keeping patients alive with sophisticated medical equipment,  this seems like a rather difficult position to defend.  But no one wants to be the bad guy to point that out.  We know there is life beyond brief outages because they happen every day and yet nobody goes broke, but it is typically unwise to say so.

Realizing that downtime will occur, even in the elite shops of the world like Rackspace with their fleet of nine data centers,  you do need to make realistic decisions about what level of  up time you really need in light of the type of business you’re in.  And while it may sound like heresy, you also want to make decisions about things that are much more important than up time levels. It seems to me that if downtime is inevitable, and we know that it is, then I want my equipment in the hands of people who know how to recover quickly from an outage, who will communicate with me regularly and truthfully throughout the crisis, and who will do their level best to get me back on line as quickly as possible.  I want my equipment in the hands of highly competent people that I can trust. You can’t make that determination when you sign up for service via a web browser or where you do the whole transaction over the phone. The only way to make the determination is to actually meet the people who are going to become the custodians of your infrastructure.

Before you put your equipment in the hands of someone else, make the effort to visit them.  If they don’t allow visits, that should be a big Red Flag #1. Talk to their operations and support people, particularly the folks who will be touching your equipment. If you’re not allowed to talk them, that should be Red Flag #2. Ask them about their up time guarantee.  If they look at you square in the eye and say 100%, that should be Red Flag #3. Kick the dust out of your shoes and move on.

Let me cordially invite you to visit our data center hosting facility this summer.  No red flags – just trustworthy, highly competent, and dependable people.

//spk

p.s. Happy 4th of July!

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post


Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.