Ansible and how we got a DR

The beginnings

Remember our previous Ansible article series? How Smart Tools Are Changing UsDeploying Jenkins with Ansible and Maintaining Jenkins with Ansible?

In short, the past articles presented our take on how infrastructure and services can be provisioned, deployed and maintained with Ansible and various other existing bits of infrastructure.

As it so happens, one of the advantages of using a configuration management tool, being able to replicate accurately environments and deployments, was seriously put to the test in our latest project: a migration to a secondary disaster recovery site, while our primary site was being physically moved to a new location.

By staying true to the principles of a configuration management tool, we were able to replicate most of our services in the disaster recovery site, with minimal downtime and no data loss.

The concept diagram I proposed in “Deploying Jenkins with Ansible” looked like this:

Deploying Jenkins with Ansible

The preparations

As explained in the previous articles at length, our goal was to get as close as possible to real-time data replication between the two sites, so that, in case of need, we can easily restore a service in the secondary disaster recovery site (DR site for short).

The infrastructure layout was kept as in the original diagram, with some minor adjustments, mostly about prerequisites (such as available and in sync AD domain controllers, email services, publishing services, networking, etc. in the DR site).

The goal was to move what the Business Impact Analysis document considered high and moderate impact services, to the DR site. Everything else we could move, was an added bonus.

Some technical bits

The primary and the DR sites are linked with a very stable and high-bandwidth VPN that allowed us to rely on replication services such as Windows DFS (Distributed File System). In short, DFS provided us with near real-time replication of files from primary to DR.

Once that was in place and tested, we had to answer two scenarios:

  • What to restore if the primary site becomes unavailable, based on a predetermined schedule.
  • What to restore if the primary site becomes unavailable due to an extended unplanned outage.

Since this move was planned months ago, we decided to only focus on a “delayed main site failure” scenario, where we can control the timing of the last backup, when a service becomes unavailable, when files finished replicating to the DR site etc.

Using the existing Ansible playbooks, we started building our scenarios and updated said playbooks to answer our needs.

We soon learned that we need to take advantage of Ansible’s capability of handling multiple inventory files. This was great for us as we were able to set inventory-specific variables for each inventory (primary site inventory, DR site inventory, staging inventory etc.).

Here’s a sample of one of our actual DR site inventory files:

jira.dr.local ansible_host=jira.dr.local ansible_user=root ansible_port=22




 This made running the same playbook (which basically installs Jira and dependencies from scratch, configures and restores database and attachments), with no code change, against the desired inventory, a breeze and led to minimal extra environment-specific configurations.

I’ve mentioned earlier backups and restores. How does that actually fit in here?

For most of our services, having the database backup and the files available in the DR site is enough.

We have already scheduled tasks that create regular database backups and store them in a DFS replicated network location. Now all we need to do is restore the database backup to the new location.

Being a fan of never touching code again (only in case of bugs or new features ), the Ansible role responsible for installing the database server can also take care of the database import, if instructed.

- name: ensure the jira DB user is present

name: '{{ jira_db_username }}'
password: '{{ jira_db_password }}'
state: present
role_attr_flags: 'CREATEDB,NOSUPERUSER'
become_user: postgres
tags: postgres

- name: ensure database is not present if we do an import
command: dropdb --if-exists {{ jira_db_name }}
become_user: postgres
when: dump_file is defined
tags: postgres

- name: ensure the jira database is present
db: '{{ jira_db_name }}'
owner: '{{ jira_db_username }}'
encoding: 'UTF-8'
lc_collate: 'en_US.UTF-8'
lc_ctype: 'en_US.UTF-8'
template: 'template0'
become_user: postgres
tags: postgres

- name: import dump
shell: 'psql {{ jira_db_name }} < {{ dump_file }}' become_user: postgres when: dump_file is defined tags: postgres

The snippet executes a couple of tasks:

  • Creates the desired database user
  • Deletes the desired database, if a database import is flagged
  • Creates the database again, so we have a clean DB
  • Imports the database dump file, if a database import is flagged

Flagging a database import is done via a variable named "dump_file" whose value is the path to the restore database dump file.

This means that we can run our playbook in two ways:

  • Simple service install and setup, where a service is installed with no old data imported. This is mostly used for new environments or new services

$ ansible-playbook -i prod-dr clean-install.yml

  • Service install, setup and data restoration

$ ansible-playbook -i prod-dr clean-install.yml --extra-vars "dump_file=/tmp/jira22.dump"

So how does everything work in this DR thingy?

Well, the short answer is: easy!

At this point we have:

  • Infrastructure in place
  • Playbooks and various other scripts in place
  • DNS, AD, email and other dependencies in place
  • File replication between the two sites

Now all that was left was to plan and execute services move.

The plan was pretty simple and to the point:

  • Stop services in the primary site
  • Get latest database backup and files from primary site replicated to the DR site (virtually no clicks needed, just a couple of minutes of patience)
  • Run playbooks against the DR site inventory, specifying location for database dump file
  • Start the service
  • Profit!

There were some other “hidden” additional steps specific to the environment that still require human intervention, like changing external DNS name resolutions, change Active Directory connectors in service etc.

We decided it's not worth spending time developing modules for those additional steps, but instead thoroughly document everything left to be done manually. This would also be a bit outside the scope of this article, as it would focus more on the actual scripts than on the current subject of using Ansible to create a DR.

The bonus

The playbooks, written in a way in which we can recreate a service from scratch, import old data, and start it, in minutes (depending on actual data size), hold another hidden advantage.

We can now use them, with no code changes, but only variable changes, to maintain and upgrade or services with minimal impact or downtime.

Simply changing variables from:

jira_version: '7.3.2'
postgres_major_version: '9'
postgres_minor_version: '3'
postgres_patch_version: '3' jira_plugins:
- 'automation-module-2.0.3.jar'
- 'base-hipchat-integration-plugin-7.8.29.jar'

to something like this:

jira_version: '7.3.3'
postgres_major_version: '9'
postgres_minor_version: '3'
postgres_patch_version: '3' jira_plugins:
- 'automation-module-2.1.3.jar'
- 'base-hipchat-integration-plugin-7.8.31.jar'

This also means we will have consistent versions down to the tiniest detail amongst all environments.and running the playbook will upgrade the specified environment based on the used inventory file.


I hope you enjoyed my little trip in the world of DR, Ansible, DFS and other mythical creatures.

Please share your thoughts, comments, rants and flame wars in the comments section below.

Tags: , ,

You can be the first one to comment on this story.

Leave a Reply

Your email address will not be published. Required fields are marked *