Disaster recovery plans are useful for more than just audit compliance: If carefully constructed, they can actually work. Purchasing a newfangled Oracle database replication license, for example, is a very small, almost irrelevant part of the overall disaster recovery plan. Let’s talk about some of the things disaster planners frequently overlook.
1. Remote Is Never Remote Enough
Many datacenter managers in the NYC World Trade Center thought that replicating important functions to the neighboring tower was good enough. If a power failure, water incident, or even a fire threatened one tower, the other could keep operating. Clearly that wasn’t remote enough. Perhaps, then, keeping your remote site across town is prudent? No, hurricane Katrina showed us that was not good enough either.
Think on political and geographical scales. There may be a mountain range that spans two states and creates a scary valley where flooding could occur. More simply, you may just want to ensure that only one site is located next to an ocean or within an earthquake zone.
Remote sites can be cold, warm or hot. A hot site is just another datacenter that runs all the time, and is capable of immediate failover to take over duties of a failed datacenter. A full secondary datacenter generally has a full copy of all data, which means you need to duplicate all of your costs when thinking about SAN or server expansion. Warm sites, on the other hand, often have running equipment, but lack immediate or full failover capabilities. The most common use of a warm site is to ensure the most critical services for a business are kept running. When time and conditions allow, staff can travel to the warm site and start bringing up less critical services and restoring data from backups. Cold sites may already have equipment in racks, but it’s powered off. Staff must travel to the cold site and restore services manually.
Some form of a warm site is generally the best bang for the buck, depending on how critical your IT services are. Just remember not to underestimate the severity of a disaster. The next city over is probably too close for a remote site.
2. People-planning: Takes More Time Than Servers
Servers, databases, and storage all have mechanisms to replicate themselves, if you research and choose the right products for the job. It’s important to test a datacenter, or some critical application’s failover procedures and verify that they can run from the remote location. For people whom already have a remote site, the tendency is to stop there, thinking they can survive anything.
Computers don’t run themselves, unfortunately.
Your people need to get to the remote site, and if your budget doesn’t allow for a fully replicated, millisecond failover, redundant datacenter, your people will be spending a lot of time at the remote site setting things up. Where will they sleep? How will the eat? How will they get there in the first place?
Also, do not assume that critical employees who work in the main datacenter will be able to travel to the remote site and work. Some employees may have family to worry about, some may be unwilling to travel (or work) during an emergency, and some may even be injured themselves. It’s important to let each employee deal with personal matters in his or her own way. If an employee goes “missing” following a disaster, but returns after a few weeks, they should be welcomed back with open arms. People deal with emergency situations in their own way, and don’t worry, you’ll have plenty of staff who are able to help out in a disaster. This is where documentation is critical, since some key personnel may be unavailable and less familiar staff will have to take up the slack.
3. Use Your Friends
This isn’t really a huge secret, but we’re pointing it out because it’s often not the first thing disaster planners think about. Shared warm site agreements between businesses are the best way to develop a remote datacenter. Often you can work out an exchange, “you can use two racks in my datacenter, and I use two in yours.” This doesn’t usually work out between competitors, but channel partners and other business units within your own company are frequently more than happy to exchange space.
Your colleagues in different industries are also an excellent resource. There’s rarely any queasiness when you mention storing your data at completely disparate businesses, and often that is the best option. Also, as industries tend to group together geographically, sharing datacenter space with friends in diverse industries will naturally help with “being remote enough.”
4. You Cannot Imagine Every Scenario
It’s very tempting to start thinking about specific disasters, and then draw up plans about how to deal with each one. It’s easier that way; if you know a flood has happened, you can conjure up a plan to deal with closed roads and the like. The probability of one of your disaster scenarios being realized is pretty small, unless you spend months thinking about every possibility.
In reality, the best practice for disaster planning is to take a few steps back, and plan for something extremely large. If you have a plan to deal with a completely unreachable city, then you’ve covered all cases where a subset of the city may be experiencing a disaster. If it happens that people can travel freely, then the plan can be short-circuited and made more efficient. Brief “if you can” scenarios are far more viable than spelling out what to do in the event of every disaster you can think of.
5. Everything You Need is Not There
If you are like most companies, and cannot afford two times (or more) the expense of running your datacenter, you will not have everything you need at your remote warm site. When a disaster recovery plan is activated, scampering employees will often be powering up older machines that may not have been used recently. Hardware may have failed, and even if it doesn’t, you still don’t have everything you need.
Most companies refresh servers every three years, and it’s common practice to extend the life of servers by dedicating phased-out servers as disaster recovery gear. It likely isn’t as fast as your current production equipment. Your data has also likely grown since the last time storage was purchased for the remote site. Perhaps the tape drive your staff was going to use at the remote site to restore data fails part way through. You get the point.
You need a disaster recovery kit, stored at the remote site in a locked container. It should include blank checks, credit cards, phone contact trees, and anything else that a frazzled and hurried employee might need. Again, it all comes back to the people that will execute the plan.