Disaster recovery to the Cloud – Amazon EC2 & ShadowProtect
I’ve spent the last week setting up a cloud-based backup system which will image a physical or virtual Windows server to Amazon EC2’s cloud-based servers, and allow a complete copy of that image to be spun up in the cloud if and when a company’s primary office becomes unavailable.
Disaster recovery and backup systems are usually measured using two key metrics – “Recovery Point Objective” (RPO) and “Recovery Time Objective” (RTO). The first of these is the amount of data you’re preparated to lose if a disaster strikes. So if you do a tape-based nightly backup, you could easily lose a day of data, and it might take you two days to recover it. The RPO of the solution I’ve got here is 15 minutes. That’s achieved using ShadowProtect from Storage Craft – which takes a consistent snapshot up to 96 times a day – meaning that you shouldn’t lose any more than the last 15 minutes of your work, even if your office is wiped out in a fire or flood.
ShadowProtect can also create a hardware independent restore which means that the image of your server can run on a different system and still boot up. Even if it’s a physical server we’re backing up, the image can be run on a virtual server if your main server becomes unavailable.
And we’re extending that idea with Amazon EC2. We run ShadowProtect Image Manager software on our cloud-based Windows 2008 R2 server, which sits on a virtual EC2 instance in a data centre in Ireland. At client sites, ShadowProtect runs on their key servers, backing up in the first instance to a computer on their site. That computer also runs Image Manager, and transfers the incremental images over to the server at Amazon.
The tricky bit comes when you need to convert all those 15 minute snapshots into a cloned server at the EC2 end. Shadow Protect handles this for us — but there’s a problem. A vital part in the process is the hardware independent restore process, which needs to run on the system we’re restoring to. Now, that software runs on a CD or USB key, and right now there’s no way to handle that boot process on Amazon.
That’s what I’ve spent the last week on. I’ve looked at various ways to do it, including running the hardware independent restore software on a new instance, and connecting that to the volume that will become our new computer. What I’ve settled on is to use Oracle’s VirtualBox software running on Amazon EC2.
Lots of people will tell you that running VirtualBox on Amazon EC2 simply won’t work. Well, it does. The problems come if you’re running a 64bit server inside VirtualBox, which in turn sits on a virtualised computer in EC2. But running a 32bit image in VirtualBox works just fine.
So – we’re logged onto the (64bit) Windows 2008 R2 instance that we use to store all the server images – inside Amazon EC2. Then we load up VirtualBox inside that, and boot a fresh virtual machine off the Shadow Protect recovery disk iso. We have a blank vhd disk in there as well (which will become the new server). And we have a second vhd disk which contains ShadowProtect’s continuous incremental backup files, (spi files) plus its initial backup (spf file).
ShadowProtect restores those image files onto our new VHD file. We go through the HIR process as normal, make sure the partition we’ve restored to is active, and shut down the virtual machine again.
We’ve now got a vhd file, which is the root volume of our newly recovered server. Now, if it’s a 64bit server (which it probably is), then there’s no way we can run this in VirtualBox. And even if it’s 32 bit, it would be slow (running a virtual machine inside another virtual machine). Certainly not much good for 100 staff who are all wanting to login to the servers from home the day after a client’s building has burned down.
So here’s where the Amazon part comes in. We use their VM Import / Export tools to pull our new VHD image into a fresh EC2 computer instance. The aws API tools are installed on our master Amazon machine, so we run the ec2-import-instance, specifying the new VHD image, and Amazon begins pulling that into a new instance.
Once that process completes, there’s a new instance sitting in the Amazon Management Console, ready to be booted. We access that with Remote Desktop, in the same way we’d have accessed the original physical machine, and we’re away.
And that’s it — complete disaster recovery into the Amazon cloud, with an RPO of 15 minutes.
There are a few caveats. Firstly, Storage Craft claim a 2 minute RTO for their ShadowProtect solution. In other words, if you have a disaster, you can be up and running with a cloned replacement server in 2 minutes. Now, that may be true for the standby machine we’ve got on a client’s site — and that machine would always be the one you’d restore first, if that was possible. But if the client site is no longer accessible following a disaster and we need to do a remote restore on Amazon, that process is going to be longer.
The HIR part on VirtualBox takes probably 15 minutes, and then the import of the VHD into a new Amazon instance takes a couple of hours. This is very dependent on the amount of data we’re restoring, of course. All in all, we’re looking at an RTO of something like 4 hours — though I’m going to be looking at the process to get it slicker, and to look at ways we can automate as much of it as possible. It’s still miles better than getting those nightly backup tapes restored onto replacement hardware, but it’s not the miracle RTO that you’d get from a local ShadowProtect instance.
Another issue — while we have blazing fast replacement servers in Amazon’s data centre, it’s trickier when the time comes to restore to physical servers at the client’s new offices again. To do that, we need to get the image files back out of the cloud and onto the client’s new facility. Fine if they have a 100Mbit connection, but slow if not. An option is to use Amazon’s import/export facility (where they send your data on a hard disk from Ireland). The good news is that the replacement cloud servers are maintained for as long as needed, so that allows buildings to be set up, servers to be fixed and Internet connections to be commissioned without stopping staff being able to work on the temporary servers.
The advantage is clear, though: Off-site disaster recovery to a facility that has large amounts of bandwidth and servers that can be as powerful as needed. Plus, no need to have a stack of physical machines sitting in our office just waiting to be called up to replace machines at a client’s temporary site. Instead, we provision the replacements only when they’re needed. And that means the costs of a full off-site disaster recovery can be much lower than other solutions.