, ,

Disaster recovery to the Cloud – Amazon EC2 & ShadowProtect

Disaster recovery to the Cloud – Amazon EC2 & ShadowProtect

I’ve spent the last week setting up a cloud-based backup system which will image a physical or virtual Windows server to Amazon EC2’s cloud-based servers, and allow a complete copy of that image to be spun up in the cloud if and when a company’s primary office becomes unavailable.

Disaster recovery and backup systems are usually measured using two key metrics – “Recovery Point Objective” (RPO) and “Recovery Time Objective” (RTO). The first of these is the amount of data you’re preparated to lose if a disaster strikes. So if you do a tape-based nightly backup, you could easily lose a day of data, and it might take you two days to recover it. The RPO of the solution I’ve got here is 15 minutes. That’s achieved using ShadowProtect from Storage Craft – which takes a consistent snapshot up to 96 times a day – meaning that you shouldn’t lose any more than the last 15 minutes of your work, even if your office is wiped out in a fire or flood.

ShadowProtect can also create a hardware independent restore which means that the image of your server can run on a different system and still boot up. Even if it’s a physical server we’re backing up, the image can be run on a virtual server if your main server becomes unavailable.

And we’re extending that idea with Amazon EC2. We run ShadowProtect Image Manager software on our cloud-based Windows 2008 R2 server, which sits on a virtual EC2 instance in a data centre in Ireland. At client sites, ShadowProtect runs on their key servers, backing up in the first instance to a computer on their site. That computer also runs Image Manager, and transfers the incremental images over to the server at Amazon.

The tricky bit comes when you need to convert all those 15 minute snapshots into a cloned server at the EC2 end. Shadow Protect handles this for us  — but there’s a problem. A vital part in the process is the hardware independent restore process, which needs to run on the system we’re restoring to. Now, that software runs on a CD or USB key, and right now there’s no way to handle that boot process on Amazon.

That’s what I’ve spent the last week on. I’ve looked at various ways to do it, including running the hardware independent restore software on a new instance, and connecting that to the volume that will become our new computer. What I’ve settled on is to use Oracle’s VirtualBox software running on Amazon EC2.

Lots of people will tell you that running VirtualBox on Amazon EC2 simply won’t work. Well, it does. The problems come if you’re running a 64bit server inside VirtualBox, which in turn sits on a virtualised computer in EC2. But running a 32bit image in VirtualBox works just fine.

So – we’re logged onto the (64bit) Windows 2008 R2 instance that we use to store all the server images – inside Amazon EC2. Then we load up VirtualBox inside that, and boot a fresh virtual machine off the Shadow Protect recovery disk iso. We have a blank vhd disk in there as well (which will become the new server). And we have a second vhd disk which contains ShadowProtect’s continuous incremental backup files, (spi files) plus its initial backup (spf file).

ShadowProtect restores those image files onto our new VHD file. We go through the HIR process as normal, make sure the partition we’ve restored to is active, and shut down the virtual machine again.

We’ve now got a vhd file, which is the root volume of our newly recovered server. Now, if it’s a 64bit server (which it probably is), then there’s no way we can run this in VirtualBox. And even if it’s 32 bit, it would be slow (running a virtual machine inside another virtual machine). Certainly not much good for 100 staff who are all wanting to login to the servers from home the day after a client’s building has burned down.

So here’s where the Amazon part comes in. We use their VM Import / Export tools to pull our new VHD image into a fresh EC2 computer instance. The aws API tools are installed on our master Amazon machine, so we run the ec2-import-instance, specifying the new VHD image, and Amazon begins pulling that into a new instance.

Once that process completes, there’s a new instance sitting in the Amazon Management Console, ready to be booted. We access that with Remote Desktop, in the same way we’d have accessed the original physical machine, and we’re away.

And that’s it — complete disaster recovery into the Amazon cloud, with an RPO of 15 minutes.

There are a few caveats. Firstly, Storage Craft claim a 2 minute RTO for their ShadowProtect solution. In other words, if you have a disaster, you can be up and running with a cloned replacement server in 2 minutes. Now, that may be true for the standby machine we’ve got on a client’s site — and that machine would always be the one you’d restore first, if that was possible. But if the client site is no longer accessible following a disaster and we need to do a remote restore on Amazon, that process is going to be longer.

The HIR part on VirtualBox takes probably 15 minutes, and then the import of the VHD into a new Amazon instance takes a couple of hours. This is very dependent on the amount of data we’re restoring, of course. All in all, we’re looking at an RTO of something like 4 hours — though I’m going to be looking at the process to get it slicker, and to look at ways we can automate as much of it as possible. It’s still miles better than getting those nightly backup tapes restored onto replacement hardware, but it’s not the miracle RTO that you’d get from a local ShadowProtect instance.

Another issue — while we have blazing fast replacement servers in Amazon’s data centre, it’s trickier when the time comes to restore to physical servers at the client’s new offices again. To do that, we need to get the image files back out of the cloud and onto the client’s new facility. Fine if they have a 100Mbit connection, but slow if not. An option is to use Amazon’s import/export facility (where they send your data on a hard disk from Ireland).  The good news is that the replacement cloud servers are maintained for as long as needed, so that allows buildings to be set up, servers to be fixed and Internet connections to be commissioned without stopping staff being able to work on the temporary servers.

The advantage is clear, though: Off-site disaster recovery to a facility that has large amounts of bandwidth and servers that can be as powerful as needed. Plus, no need to have a stack of physical machines sitting in our office just waiting to be called up to replace machines at a client’s temporary site. Instead, we provision the replacements only when they’re needed. And that means the costs of a full off-site disaster recovery can be much lower than other solutions.

Tom

16 replies
  1. Micah Davis
    Micah Davis says:

    Curious about getting the images from local site to AWS. Are you using a replication target with intelligentFTP or Shadowstream? Or something independent?

    Reply
  2. Tom Kerswill
    Tom Kerswill says:

    Hi Micah

    Yes, we’re just using the basic IntelligentFTP for now. I don’t think ShadowStream is necessary unless there’s slowness with IntelligentFTP.

    One other really interesting thing that we’re going to look at, is using Amazon’s Storage Gateway. That way, you can mount local storage (iSCSI) and save Shadow Protect images to that. That’s then replicated and available on Amazon S3. So that might be another interesting way to get the data over…

    Tom

    Reply
  3. Zalman
    Zalman says:

    Keen to know what storage you are using? Is it s3? Unfortunately s3 is rather costly. glacier has some excellent pricing but no good for shadow protect images just yet.

    Reply
  4. tomkerswill
    tomkerswill says:

    Hi Zalman

    We’re actually using elastic block store / EBS to store the Shadow Protect images. That’s even more expensive than S3, but “just works” with ImageManager (intelligentFTP job), because it just sees it as local storage. I think S3 would work just fine.

    I’m sure that Glacier would work in theory — the individual files are small, and I think that would work okay. BUT, I guess Glacier’s going to massively impact on your RTO, because it can take a few hours for them to retrieve the files from Glacier — so for archiving, it’s great, but for a quick restore, not so good I think…

    Reply
  5. Doug Youd
    Doug Youd says:

    Hi Tom,

    Very cool setup. Thanks for posting.

    I work for a small cloud provider in Australia as a solutions architect. This is the architecture I’ve been suggesting for our MSP’s / Integrators to use for a while now.

    In our case, we use a vCloud platform, which allows you to modify the boot process. Would that allow you to skip the 2nd phase of the process? (and massively cut down the RTO?).

    Good to know there are other people out there with a similar thought process 🙂

    Hit me up if you’d like to talk some more, I’d love to chat about it.

    -Doug

    Reply
    • tomkerswill
      tomkerswill says:

      Hi Doug

      Great to hear from you – and sorry for the delay approving the comment. That sounds very interesting indeed. Yes – I think it would make things way simpler. Can you actually log in (eg. via a virtual KVM, I guess you’d call it) and watch the boot process happening? Eg. see the BIOS and chose to boot from an ISO? If yes, then you could boot from the ShadowProtect ISO and very quickly do the repair / hardware / driver provisioning of the image… That would make things a lot easier I think.

      Tom

      Reply
  6. mightyteegar
    mightyteegar says:

    “Another issue — while we have blazing fast replacement servers in Amazon’s data centre, it’s trickier when the time comes to restore to physical servers at the client’s new offices again.”

    This is what StorageCraft’s HeadStart Restore module addresses. Take a look at it — we use it for our own DR purposes. It automatically consolidates and “preps” ShadowProtect images for rapid recovery.

    Reply
  7. Rene Bravo
    Rene Bravo says:

    Hi Tom,
    Your post is very interesting. Have you improved on your DR process?
    We are looking at (not expensive) options of backing up a local windows server (physical or virtual) to AWS and being able to “rapidly” restore into an EC2 instance in case needed. We have tried Avlor.com but it is not working everytime. We tried also cloudvelocity.com, it works but it is very expensive (10s of 1000s $). What is your current recommendation in 2014?

    Reply
  8. AJ Gyomber
    AJ Gyomber says:

    Tom,

    Were you able to perfect this using AWS and are you using it in production? Were you able to implement any automation? I’d love to hear an update!

    –AJ

    Reply
    • tomkerswill
      tomkerswill says:

      Hi Al,

      Yes, it ended up working quite well, but we didn’t get it to the point where it was seamless. We’re actually also looking at ActiveImageProtector from NetJapan. That’s a really similar product to ShadowProtect, but it’s got a really nice command line interface – so I think there’s more scope for scripting it. The more we can automate, and the less that has to be done manually, the better!

      I’ll keep you updated how it goes with these!

      Tom

      Reply
  9. Allan Wright
    Allan Wright says:

    Hi Tom,
    This sounds pretty much what we are looking to do, moving to AWS for our offsite storage solution rather than our inhouse solution in CoLo. I’ve had some patchy experience with NetJapan (we had strong ties with StorageCraft eu guys before the recent changes) and many have gone on to the NetJapan camp. I would like to compare notes offline if you would be willing, and see if between us we can get this as slick as can be.

    Reply
  10. jimmy
    jimmy says:

    Hello Tom,
    That’s great explanation. But I have question, might be stupid! Import VM can be done if the image exists in AWS S3, how do you import the VHD image from EBS volume? Is there any special command I am missing? Thanks

    Reply
  11. Phil
    Phil says:

    The easiest way to automate the whole thing is to use HeadStart Restore on your amazon instance for your vms. That way, you’ll have an always ready to go VHD, which can then be dumped into EC2. Using HSR, you eliminate the VirtualBoot/Box step.

    Reply
  12. Tom Kerswill
    Tom Kerswill says:

    @jimmy – Yes, you’re right – I think we did end up having to copy the image from EBS to S3 – but there’s no traffic cost for this, since it’s an EBS-to-S3 transfer.

    @phil – yes, that is a good idea. I think when we first tried this we thought we still needed to boot into the recovery environment to do driver injection – but perhaps that isn’t necessary. I’m going to try this again, using just HeadStart restore and automating the copy from EBS to S3.

    Reply
  13. Paul
    Paul says:

    Are you still running this setup with no issues?

    How do you seed the shadow protect images to the EC2 instance?

    Have you ever had to export the SP images from the EC2/S3 storage? If so, how did that work out for you?

    Reply
  14. tomkerswill
    tomkerswill says:

    Thanks so much for the comment. Yes – we’ve been testing it, but to be honest, with most of our clients, we back up to local servers because usually the failure they want to recover from is a single server going wrong.

    With EC2, this gives us the ability to completely recover to the cloud and have people remote desktoping into a clone server running within VPC. But it’s not as seamless as I’d like.

    To get the data into EC2, we simply run ImageManager on an EC2 instance and then can copy the data to S3…

    To get the data out — we normally don’t have to do that, as we have local ImageManager running on our servers here, so if we wanted to bring a replacement server to the client, we’d usually do that — the Amazon solution is more for complete recovery to the cloud.

    Also have been to a few NetJapan presentations recently. They have an interesting product and it looks much more scriptable than ShadowProtect, but is otherwise quite similar in its functionality…

    Reply

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *