Posts Tagged ‘downtime’



Keep The Change

Wednesday, June 23rd, 2010

Does this sound like your IT shop?  Reports from the Uptime Institute consistently show that the majority of reliability and uptime woes aren’t caused by hardware,  facilities, or utility failure – they’re caused by humans, and what pray tell are those humans doing?  They’re changing things, and often too much of the change isn’t planned, approved, or documented.  Or, there is simply too much change going on at one time.

Much like a bomb is meant to explode, technicians are meant to be technical, so it’s a bit unrealistic to assume they’re giving a lot of thought to managing change, much less be fond of doing so. They just want to git ‘er done, and in large part, we pay them well to not only do that, but to do it right the first time.  Hard core techies, the ones that really know how to make things work, typically aren’t also wired for sitting in management meetings. The problem with managing change is that it’s boring. It’s not technical. And explaining highly technical things to non-technical folks in a change management meeting is not always the average techie’s strong suite, nor perhaps the best use of their time. To the contrary, it can be a very frustrating experience for them, which can lead them down the Dark Side of making changes beneath the radar. Effective change management therefore becomes a bit of a balancing act. We need to know what’s going on, but we don’t want to bog everyone down in the process.

In our data center controlling change is not optional. Reliability demands it, as do the Spanish Inquisition SAS 70 auditors. But we’ve found a way to manage it without terribly burdening our technical staff. Change requests may be formally entered in the system by any authorized individual whether or not they are technical;  they are simply the person requesting the change. The request is then routed to a technician who can assess what needs to be done, adds those details to the request, makes a suggestion as to when it might be done, and then it’s passed on to someone in management who can assess the risk and approve/disapprove it. If a change is of major significance, the request comes before a Change Advisory Board (CAB) for final approval. Technicians, while welcome, are not required to attend CAB meetings.  When requests are properly documented, the CAB is almost always able to make a good decision without further involving the technical staff.  When the CAB does need more information or defers a  request for some reason (e.g. too many changes on one night), the technician in question is notified and it’s handled outside of a meeting.  This saves time, money, and mental fatigue. Since the pain threshold is relatively low, this method also encourages all change activity to actually be run through the proper channels.

Our process is capable of handling very high rates of change, but that doesn’t mean that we do so.  On the contrary, we try to minimize the rate of change, batching things together when it makes sense to  minimize outages, and spreading them out when the risk is high to maximize uptime.

Managing change is not fun, and you may be justifiably weary of it.  Let us take that burden off of your shoulders.

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

Patches

Friday, September 25th, 2009

Meet Patches. Keep him healthy and he’ll be with you a long time. Look at that face! Knowing the anxiety Patches suffers when going to the vet, do you religiously take him every time there’s a new medicine on the market, just in case he might catch some exotic bug he has .001% chance of contracting?  No, most likely not.  But you do take him to the vet for regular shots to prevent things dogs of his kind are likely to have problems with – a regular maintenance visit, you might say.

our-dog-patches

Why it is worth the cost and commotion of going for the maintenance visit, but not every time a new vaccine or pill is announced?  Because the cost/benefit equation is right for one and not the other.

A lot can be learned from Patches about the discipline of patching servers. We are occasionally asked “How often should I patch my servers?” and we get into discussions with a wide variety of customers with widely differing views on the subject. Often though, we find that it largely boils down to one’s view of the world – is your glass half empty or half full? Certainly, we need to keep systems patched to at least the minimum level supported by our software vendors, but given the cost and commotion (dare I say trauma) of the patching process, how far beyond that is necessary or prudent? If you have Internet facing assets, then clearly you want to keep those up to date with the latest security patches as soon as they’re available. But if you have private, stable, non-web assets well behind well-managed firewalls, a less rigorous approach is reasonable. There is no need or rational justification to blindly apply a patch willy-nilly simply because it’s available. Who has not been the victim of downtime because an ill-behaved patch did something that it was not supposed to do? And, lest we forget, rebooting a Windows server after patching is not always a trivial event – just ask the sysadmin of a Blackberry Enterprise server.

Remember the purpose of infrastructure is to keep running – the very namesake of this blog.  Our infrastructure does us no good when it’s down. Every patch brings with it the some level of risk to uptime. So, the obvious thing to do would be to test every patch before we apply it to a product system. Do you? Really? Every time? Or is easier to just apply the latest raft of fixes from say, Microsoft, and just hope for the best? For those of us who have to endure the regular water-boarding process of a SAS 70 Type II audit, hope is not a strategy. Not only do we have to test every patch before applying it to a live system, but we also have to prove that we did so, and that we have a defined process that meets the muster of the auditors.

agentsmith3

This process of patching is costly in terms of time, money, and risk. So how often should we patch? Somewhere between hope and SAS 70 lies the right answer for most of us. Like maintaining your car, regular maintenance of a server is necessary to keep a system “on the road.”  This many mean spending time regularly (like an oil change) researching patches to see which of them you really need as opposed to those that make you feel warm and fuzzy, and then testing appropriately first. On the other hand, regular maintenance may not imply regular patching. If a system needs to be running the latest Windows server OS, or the application vendor forces your hand, then you will certainly be patching more often. If on the other hand, you have a functionally stable system that doesn’t change much, has been running well and isn’t the flagship of your ecommerce empire, then you will probably patch extremely infrequently if ever, and that’s OK. We’ve got a Red Hat 7.2 system here that sees heavy daily usage, has not been patched in years, and has not been hacked or had any problems over that same span of time. Sacrilege? Perhaps, but we believe it’s prudence. It could also be Pennsylvania Dutch stubbornness.

You do need a patching process, but it should reflect you particular situation and account for the nature of each of your servers. Like the Pirate Code, best practices in this arena are more like guidelines. You can spend a lot of money and create a lot of headaches with a one-size-fits-all approach.  A socialist patching approach sounds good on paper, but as you would expect with anything socialistic, it tends not to work out well in reality.

pirates-guidelines-cover-we

Weigh the risks and cost of downtime vs. the potential benefit of a patch.  Part of your process should include a justification phase where IT and business stakeholders have an opportunity to understand what is being patched, why it’s been deemed necessary, and what the possible ramifications are if things go awry. And, most importantly, the stakeholders should have both veto power and the power to determine the scheduling of patch activity.

Patching is a necessary evil, but it is manageable if you take the time to think through the process and come up with a practical plan that fits your business. Or, you could simply delegate the process to folks who know how to both open and close Pandora’s box.

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post

Infrastructure Friday

Friday, May 29th, 2009

Since only Robinson Crusoe had the luxury of getting everything done by Friday, the rest of us have to come up with other strategies to get all of the things done necessary to properly serve our customers.   To help with this in our own data center, we’ve have a pseudo-tradition called Infrastructure Friday.  This is not to be confused with Redneck Tuesday:

redneckhorseshoes

On Infrastructure Fridays, members of the data center team who normally don’t work out on the floor put down email, IM and PDAs, roll up their sleeves, and step into the data center to help get some of the “real” work done.  To keep IT running smoothly, we sooner or later have to stop talking about it and actually go and do something about it.   That “something” we do includes taking care of ongoing operational details and implementing new functionality that maintain or improve reliability.

Like many other things, excellent  performance in the data center is all about execution and details.   Focus on the details and the big picture will take care of itself, or as Mel Gibson advised his young son in The Patriot, “Aim small, miss small.”    What sort of details are we talking about on Infrastructure Friday?

  • Not just performing rack inspections, but actually correcting any problems found.
  • Not just noting network latency issues, but getting the right people involved to isolate and resolve them.
  • Not just checking that critical monitoring systems in the NOC are healthy, but verifying they are actually working by simulating failures.
  • Not just verifying that operational documentation is current and complete, but actually updating it if it’s not.
  • Not just checking parts inventories (patch cables, cable management supplies, etc), but placing the orders to replenish supplies.
  • Not just validating that data center standards are being followed (equipment mounted for proper air flow, floor tile placement, etc) , but actually correcting violations.
  • Not just noting that wire management is shoddy, but actually making it better.
  • Not just complaining that critical patch cables aren’t labeled, but actually getting out the label machine and doing the labeling.
  • Not just finding hot spots in the electrical system, but scheduling the downtime required to avert a future disaster.

Hopefully the theme is obvious.  On Infrastructure Friday, the goal isn’t to grouse about problems, it’s to fix them.

On a happier note, what sort of cool new functionality might we install on Infrastructure Fridays to improve reliability?  That’s a shorter list probably not worthy of a set of bullets, but it typically involves installing new or upgraded monitoring capabilities in the NOC,  adding additional monitoring instrumentation out on the floor, improving the quality and types of information on the master dashboards, and continuing to implement automated processes  to lessen the chance of unplanned downtime.    But again the theme is the same:  take action.

In the day-to-day blur of activity required to keep a live data center running, the Oughta List of things (we ought to do this, we ought to do that)  that would improve reliability grows week by week, but never seem to get done because of the tryanny of the urgent.  We find ourselves officially declared Too Busy to work on the Oughta List and before we know it, an outage occurs and the Oughta List suddenly becomes an embarrassing Shoulda List.

Infrastructure Friday is designed to overcome Oughta List inertia.  With a “try me” cost of zero, it has pretty good ROI.

//spk

Post to Twitter Tweet This Post to Delicious Delicious Post to Digg Digg This Post Post to StumbleUpon Stumble This Post


Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.