AWS Disaster Recovery Scenarios
Disaster Recovery (DR) is all about “preparing for” or “recovering from” a disaster [1]. In this blog, I am trying to explain Disaster Recovery scenarios/ options available on AWS. It is important for us to have a high level understanding of these options, when we are designing fault tolerant, highly available AWS solution architectures.
As the first step, lets try to understand what RPO and RTO is all about and then dive into DR scenarios on AWS.
RPO vs RTO
RPO and RTO are sort of benchmarks that we can set before we set up a DR system for any application that is going to be deployed in the cloud.
If both RPO and RTO have a lesser figure, that means you have a system that has a near real time DR plan.
Recovery Time Objective (RTO)
This indicates the time it takes to recover from a disaster (restoring a business process to its service level, as defined by the Operational Level Agreement).
For example, if a disaster occurs at 12:00PM (Noon) and the RTO is four (04) hours, the DR process should recover the system by 4:00PM.
Recovery Point Objective (RPO)
The acceptable amount of data loss measured in time.
For example, If a disaster occurs at 12:00 PM (Noon) and the RPO is one hour, that means the system should recover all its data that was in the system before 11:00 AM. That means, the total data loss is only one hour between 11:00AM and 12:00 PM (Noon).
Disaster Recovery Scenarios in AWS
There are basically four (04) Disaster Recovery scenarios identified in AWS (See Figure 2). Among them some have a higher RTO and some have a lower RTO. It is always good to understand how we can minimize RTO and what level of commitment is needed to achieve those levels.
- Backup and Restore — Data is backed up and restored
- Pilot Light — Only minimal critical functionalities
- Warm Standby — Fully functional scaled down version
- Multi Site (Active-Active) — Another fully functional site
Out of these four scenarios Multi Site (Active-Active) has the lowest RTO and the Backup and Recovery has the highest RTO.
P.Note: In these scenarios, we call “primary infrastructure” as the site, where the disaster happens and “secondary infrastructure” as the recovery infrastructure. The “primary infrastructure” could be either an “on-premise” or an “AWS infrastructure“. The “secondary infrastructure” will be an “AWS infrastructure”.
Lets, dive into these four scenarios in a little bit of detail.
1.0 Backup and Restore
There are multiple backup options available.
- Amazon S3 — Amazon S3 is an ideal destination for backup data that might be needed to quickly to perform a restore.
- Amazon Glacier — Glacier can also be used in conjunction with S3 to produce a tiered “long-term” backup solution.
- Amazon Import / Export — Can be used to transfer very large data sets by shipping storage devices directly to AWS.
- Amazon Storage Gateway — Enables snapshots of your on-premise data volumes to be transparently copied into S3 for backup. Cached volumes allows you to store primary data in S3 but can keep your frequently accessed data local for low-latency accessed data locally for low latency access. VTL configuration can be used as a replacement for traditional magnetic tape backup.
When it comes to restoring data from EC2 instances, it can be a combination of the following (See Figure 4).
- Provisioning the instances from an AMI
- Restoring data from S3
2.0 Pilot Light
The secondary environment is running only the most critical core infrastructure. When the time comes for recovery, you can rapidly provision a full scale production environment around the critical core.
The pilot light method gives you a quicker recovery time than the backup-restore method because the core pieces of the system are already running and are continually kept up to date.
In Figure 5, the database is up and running but the other two components (Reverse Proxy and the Application Server) are not.
In order to recover the inactive components and to scale up running components, you can adhere to one of the following steps:
- Start your EC2 based instances from any customized AMIs
- Scale up Database instances if required
- Add any fail-over features to both inactive and active components (Multi-AZ, etc)
- Point the Route 53 DNS to the secondary site
3.0 Warm Standby
The secondary (backup) environment is running the same infrastructure as the primary one but in a smaller sized components to reduce costs (See Figure 7). For example, if the primary infrastructure has an extra large EC2 instance, the secondary site would run a medium size EC2 instance.
When a disaster occurs, smaller version(s) can be scaled up instantly to give a infrastructure similar to the primary one in a quicker time than the Pilot Light method (See Figure 8).
4.0 Multi Site (Active-Active)
The secondary (backup) infrastructure is a copy (structure, size and services running) of the primary site.
This allows to give you the best performance, high availability the best recovery time compared to other DR scenarios explained. However the cost if exactly the double of the primary infrastructure.
In a AWS multi-region setup, the Active-Active state can give not only fail-over but load balancing aspect as well. We can use Route 53 to balance the load with the Weighted Routing Policy (See Figure 9).
When a Disaster struck, Route 53 will route the traffic entirely to the secondary site. There is no need of any infrastructure scaling, since both primary and secondary maintained a production level setup even before the disaster struck (See Figure 10).
The Comparison
Backup and Recovery: Low cost and Slow in Recovery (High RTO)
Pilot Light: Fairly cheap but the recovery is faster than the “Backup and Recovery”
Warm Standby: Costly but the recovery is faster than the “Pilot Light”
Multi Site : Very Costly (doubles the cost) but the recovery is faster than all the other DR scenarios (almost zero recovery time / RTO)
Conclusion
Adopting the DR scenario out of the above explained ones, is purely based on the criticality and the cost that you can afford of the system that you consider. As explained, Multi-Site approach gives you the best RTO along its high cost factor. If the cost is a major factor, you can choose either of other three options listed.
Resources
- Using AWS for Disaster Recovery (Whitepaper) — October 2014 — AWS