Business Continuity/Disaster Recovery

I received an email inquiring what a reasonable backup policy would be, given the following issues/constraints:

Industry Best Practices;
Fear for loss of data and inability to restore/replicate; and
Costs associated with backing up data, i.e. more expensive the more often data is backed up and stored for longer time periods

This is a great question and I figured I might as well put my thoughts down publicly.

Our backup policy:

In accordance with the Security Policy, a backup Schedule has been established. The schedule includes a daily backup for each server.

The backup appliance takes a daily virtual machine snapshot of every server hosted at our data center in our virtual environment, as well as the physical server named Backup, which hosts the virtual management console and the shared drives. The snapshots are then replicated to our external data center as part of our DR strategy.

Each week the backups are written to an archive disk. The archive disks are rotated such that the oldest disk that doesn’t include the first day of the month will be overwritten. A disk that does include the first of a month will be kept for a minimum of 1 year.

I know this is technical, so I’ll break it down a little bit. First thing to keep in mind is that a Business Continuity Plan/Disaster Recovery Plan (BCP/DR) is separate from a Backup plan, although they are interrelated. This means it’s possible to have a BCP/DR without creating a lot of backups. In practice, a BCP means you’re replicating your entire server environment to an offsite facility – either hot, warm or cold. Hot means you can flick a switch and be up and running at the offsite facility in seconds. Warm means you can be up and running in minutes/hours. Cold means it might take a day or several to be up and running. You make the business decision on hot/warm/cold by figuring out what your Recovery Time Objective (RTO) is. If your RTO is seconds, you need hot site. If it’s days, you need a cold site. Obviously, hot sites are more expensive than cold ones. The other idea to keep in mind is your Recovery Point Objective (RPO). How far back in time do you need to be able to go if something bad happens? If you keep backups for 14 days, your RPO is 1 to 14 days. You can, theoretically, go back between 1 and 14 days if a disaster happens. If you can live with that, then you have a good backup strategy. If you think you need more than that, you should modify accordingly.

NCIGF’s RPO is 1 day (although we can go back one full year if we need to) and our RTO is 4-8 hours, so we have a warm site. We can be up and running at our disaster facility in several hours with yesterday’s backup data.

I hope all that makes sense. It can be a little dense to suss out. To answer the bullet points specifically:

Industry Best Practices – develop your business case for RTO and RPO. Those values guide your policy and cost.
Fear for loss of data and inability to restore/replicate; – this is very, very real. And this is what I want to stress the most in the entire post: if you don’t test your backups on at least a quarterly basis, you don’t have a back plan. Lots of people create backups. Then, when a disaster happens, they realize their backups are broken and can’t be used. We do a partial restore test on a quarterly basis and do a full restore test once a year.
Costs associated with backing up data, i.e. more expensive the more often data is backed up and stored for longer time periods. This is definitely true – the longer back you want to be able to go (ie bigger RPO), the more you’ll pay. But the biggest driver of cost is RTO. How quickly do you want to be up and running in the event of a disaster?

Hope this helps.

Leave a Reply Cancel reply