If you work in Information Technology, you have probably heard the terms Business Continuity (BC) or Disaster Recovery (DR).
If you’re creating a Disaster Recovery plan, one of the first things you need to do is determine what you need to protect. All data is not created equal and there’s likely no reason to replicate and store every bit and byte on your servers.
The two primary methods of measuring the criticality of IT systems are how much data and time you can afford to lose. RPO and RTO are the terms management uses to communicate goals for a BC/DR system.
Recovery Point Objective
The first, the Recovery Point Objective (RPO), is the threshold of how much data you can afford to lose since the last backup. Defining your company’s RPO typically begins with examining how frequently backup takes place. Since backup can be an intrusive to systems it is not typically performed more frequently than several hours apart. This means that your backup RPO is probably measured in hours of data loss.
RPO is a business decision as to how much current data (orders, invoices, inventory moves) the business is willing to risk if the production source system becomes unavailable.
RPO is a design spec, not a statistic. You architect your BC/DR solution around the RPO, build the solution to match the RPO, and then measure the solution’s performance against the RPO.
You won’t know if your actual recovery point will match your RPO unless you measure and adjust your system. Having an RPO of 30 seconds and achieving an RPO of 30 seconds are two different things.
The RPO can also be shrunk over time as you install, measure, and tweak your solution to create tighter synchronization between source and target.
Recovery Time Objective
The second, the Recovery Time Objective (RTO) is the threshold for how quickly you need to have an application’s information restored. For example, maybe four 4 hours, eight 8 hours, or the next business day is tolerable for e-mail systems. Keep in mind the amount of time it takes to provision servers, storage, networking resources and virtual machine configurations.
Using these two primary measures will help you understand your cost of downtime, help define a budget for an IT system continuity plan and determine the technology that meets your needs within your budget.
Like your RPO, an RTO is also a target, not a statistic. In the event of a major outage on a production box, it states your goals for restarting the system on a backup machine or partition.
For example, if you designate an RTO of two hours, your goal is to restore service within two hours in the event of a production system failure.
Your RTO time can vary depending on the type of BC/DR solution you’re using.
For clustered solutions, recovery to a backup machine can be almost instantaneous. For manually switched solutions where you need to execute run book steps to redirect production to your backup, it can take several hours. And for rebuilds where you have to perform a bare metal restore, it can take a day or longer.
An RTO is a business decision that will affect your solution choice.
Choosing a Solution
Finding the right balance of features and price to meet your RPO and RTO is one of the most critical things you can do to protect your business. For IT system continuity, there are three solution categories: backup, high availability and disaster recovery.
- Backup means keeping your data safe; in this situation, RPO is more critical than RTO.
- High availability means keeping your critical applications and data online – a high availability solution is required for high RPO and RTO.
- Disaster recovery is the ability to recover data in case the production system is damaged, destroyed or becomes unavailable for an undeterminable period of time. A comprehensive disaster recovery solution that can restore data quickly and completely is required to meet low RPO and RTO thresholds.
There’s an inverse relationship between what it costs to implement a BC/DR solution and what it costs to run that same solution in an emergency.
In general, the less expensive the solution, the more expensive it will be to implement that system in an emergency. It doesn’t take much to back up your systems but in a crisis, an old-fashioned bare metal restore to a new machine could take a day or more (note: newer virtualization technologies can significantly reduce this gap). It may take several tens (or hundreds) of $1000s of dollar to set up a clustering system or a replicated system but in a crisis, a target system can be quickly activated reducing the outage’s effect on your business.
So choose your RTO carefully and understand that it is intimately tied to the tolerance your company has for restoring service after an outage.
The other thing to note is that when you perform regular switch tests, you can also gauge how effective your RPO is (how tightly your source and target machine data are in sync).
Properly communicated, RPO and RTO are valuable management tools for communicating your goals for your solution and how well you are doing in meeting those goals.
Tip: Identify and document the criteria to declare a disaster. Too often, all of us in the IT field will have our heads down, deep in the troubleshooting process, trying to solve the problem without thinking about the business impact. Having the criteria defined, everyone can have a checkpoint to call it a disaster and begin implementation of the actual plan.
Tip: Consider while developing your Disaster Recovery plans to encompass partial outages and not only total loss. Maybe only a few systems are affected or only a single site and the personnel could be moved to another location or have connectivity through another circuitous path.