Zero to Sitecore in 60 Minutes: How to Automate Disaster Recovery
At Velir we have extensive technical expertise, but we also enjoy learning new things. Recently, a client asked us to help them with a complex challenge that taught us a lot: automated disaster recovery. What’s disaster recovery? In short, it’s recreating all your web servers after a complete loss from a disaster. There are many different types of disasters and there are several ways to recover from them. However, you must decide which approach makes the most sense for your business. So, we’ve put together what we’ve learned about disaster recovery and we’ve recommended ways to automate the process.
What’s Considered a Disaster?
When you think about disasters, you’re probably picturing storms, floods, or other natural events. But those weather-related catastrophes are only a few ways disasters can happen. Before you think about how to recover from disasters, you should consider the impact they could have on your business.
Does Your Business Need a Disaster Recovery Plan?
For all the time and money, we invest in developing web applications, we don’t talk enough about what to do with them if a disaster strikes. That’s why it’s important to talk about whether having a disaster recovery plan makes financial sense for your business. For example, if you gain significant income from your website, then having a disaster recovery plan is critical.
What Types of Disasters Can Strike?
Disasters range from events you can prevent to ones that are completely out of your control. Here’s a list of unexpected catastrophes that could damage your business if you don’t have a disaster recovery plan. We’ve also included the percentage of each disaster type that happens based on national surveys referenced by Uptime Institute.
- Power Failure (37%)
- Software, IT systems (22%)
- Network (17%)
- Cooling (13%)
- Fire Suppression (4%)
- Third-party service provider – SaaS or hosting (3%)
- Info Security-related (2%)
- Third-party cloud provider (2%)
- Not Known (1%)
Now you know, disasters aren’t just caused by the weather. You could face equipment failure, accidental deletion of data (i.e., human error), or cyber-attacks that affect your servers and in turn, your website.
What Do You Need to Recover from a Disaster?
When you’re considering the costs of disaster recovery, you may not realize some of the hidden fees involved. The costs to recover your website could include:
- Detection
- Containment
- Recovery
- Equipment
- Productivity
- Third Parties
- Revenue
- Opportunities
It may cost time and money just to detect and contain an issue if you’re dealing with malware or a cyber-attack. You might also need to pay for new equipment and rebuild your systems. This is much more challenging if you lack technical expertise and need to enlist third parties for help. All this downtime could lead to decreased revenue and missed business opportunities, which impact your bottom line.
How Much Does It Cost to Recover from a Disaster?
To help you understand the total cost of recovering from a disaster, here are some average national costs obtained from Uptime Institute.
These costs can be significantly reduced by a disaster recovery plan, which streamlines the recovery process, and contains contingencies depending on the type of disaster you need to deal with.
How Do You Automate Disaster Recovery?
Now that you understand the types of disasters that can happen and how much they can cost your business, we’ll explain how you can automate disaster recovery. Remember: there’s no one-size-fits-all solution, so you’ll need to adapt your plan to your specific business.
Automation can recover your website fast and get you back online fairly quickly. We work with Azure more than any other cloud provider so our recommendations will focus on using Azure to automate disaster recovery. But these ideas are generally compatible across providers with similar tools.
Setting Up Failover Regions
The first solution that we typically suggest is using failover regions. The expectation is that if there is a problem with a server or set of servers in one area of the world, there likely won’t be an issue elsewhere. For example, if you were to host servers in Azure East US, we would recommend you also host a failover in France Central or Azure East Asia. This lowers risk by allowing you time to recover and maintain your web presence if a major portion of your servers are compromised.
This type of solution is almost entirely configurable from within Azure. We won’t go into the details, but we can always help you determine if this is a cost-effective solution for you.
Automating Server Recovery
If having a failover doesn’t make sense for you there’s also the option of automated server recovery. This is where you reconstruct your servers, reinstall software, set configurations, and reapply your network settings through a fully or mostly scripted process. This doesn’t seem like a simple solution and it isn’t. Although it’s robust, and if your business critically requires your website to be up, then it might be a good choice for you.
Leveraging Azure Resource Manager Templates
The first step in designing a recovery system is to recreate a baseline system. With Azure this process is handled using Azure Resource Manager (ARM) templates. For the project we mentioned in our introduction, we were installing Sitecore and using Sitecore’s ARM template. This will scaffold all the resources you need in Azure for the specific version of Sitecore you’re installing but you need to modify it to include your company’s Sitecore license, recent database backups, slot settings and any modules you may need installed.
Each environment you’ll need to recreate may have a different architecture and need its own ARM template. We found that modifying ARM templates to account for every change was time consuming. You may find it easier to install a stock system and modify it using PowerShell scripts.
Determining Your Build Process
Having your architecture scripted is only part of the disaster recovery process. You should also consider the code and configuration files that need to be deployed to these newly created systems. You may already have a delivery system that works for you but even if you do, think about in the context of variable storage and access. Depending on where you store all your environment-specific settings you may want to insert environment variables during build time or swap variables into the files after. The first option may be simpler but requires a build for each environment. The second option requires more scripting however it allows you to build once and deploy multiple times. This decision will likely be determined by your business and where/how your variables are stored.
Our client determined that building once with Azure DevOps and deploying to each environment with Octopus Deploy worked better for them. This gave them a lot of flexibility with the scripting and allowed them to keep all their important data within their corporate network.
Figuring Out Your Deployment Process
The hardest and most crucial part of our disaster recovery system was the deployment. Having Octopus as our deployment tool allowed us to create a single deployment script for all environments with its own internal variable cache and flexibility at each step with conditional behavior. This meant that depending on the environment, we could skip or add steps as needed.
There are no steps that fit all systems, but we can give a rough outline of what we did to help you understand what the deployment process. Here’s what to expect:
- Deploy ARM templates (first-time-conditional)
- Apply network permissions, domains, and certificates (first-time-conditional)
- Add/modify connection strings (first-time-conditional)
- Update Sitecore default password (first-time-conditional)
- Copy custom config files to specific machines (always)
- These were custom settings for xConnect and Identity Server
- Create separate CD server and deploy files to it (environment-conditional)
- Create preview copy of web database (first-time- and environment-conditional)
- Create blue copy of web database (first-time- and environment-conditional)
- Insert variables and deploy code (always)
- Sync serialized files (always)
- Deploy blue/green config files to specific machines (environment-conditional)
- Deploy to inactive slot
- Publish to inactive web database
- Send email with test URL and wait for approval
- Swap Sitecore publishing target
- Swap active slot
Using Octopus was beneficial for performance because we could control the order of steps and could run them in parallel. After we built all the steps, we created a table of dependencies and determined how to significantly compress their processing time.
Using a Blue/Green Deployment System
You’ll notice that in the last few steps we used a blue/green deployment system. This was easier with Octopus due to its ability to make calls to the active system and pause a deployment while a person verifies the changes. It added another layer to the overall project, but it helped the deployment team so much that it was worth every hour we invested.
Customizing Based Your System
- Authentication (possibly more than one)
- Authoring Features (preview database)
- Modules (most clients have a few and some many)
- Search (most clients have one, but some have two)
- Analytics (xConnect can be a challenge to customize)
- Networking (at a baseline there are domains and certs, but complexity can grow with CDNs and other custom networking features)
Delivering on Our Disaster Recovery Plan
Using the system, we devised, we were able to fully recreate any given environment for our client in about an hour. The build was already available since the system we were trying to recover existed, so that didn’t factor into the timing. The ARM deployment for a single environment was the biggest variable and could take anywhere from 25 minutes to an hour. With our compressed steps we were able to get each environment deployment down to around 25 minutes.
In the end we provided our client with peace of mind and improved costs and performance for their business. We also gave them several new features that weren’t available to them before. If you’re thinking about disaster recovery or other complex deployment systems, reach out to us at Velir. We’re happy to help you improve your system or offer advice on making the right disaster recovery decisions for your business.