Maintaining Jenkins with Ansible

In my previous article I’ve presented and highlighted the benefits of using Ansible to deploy Jenkins and its components, while getting accustomed to how Ansible roles should look like and how to run them.

But what good is a service, if it only serves you now and in two months’ time it becomes outdated and eventually obsolete? Probably not too helpful.

Let’s take for example a setup such as a CI environment that needs to respond to a wide range of configurations and software versions like the one we use. It takes a lot of time for the initial setup and then maintenance. Some facts:

  • A simple Jenkins master update can easily break all stored jobs, if one of the plugins is no longer compatible
  • Jenkins gets a new release every 12 weeks
  • X-Code gets a new release every couple of months
  • Plugins get new releases on their own schedule, sometimes several at a time.

Now imagine a standard setup where you have to manually upgrade all components in a staging environment, test if things still work, and then – assuming you have a Disaster Recovery setup – manually upgrade and validate each cluster from the DR setup.

If that sounds like a bad dream, now try to imagine that only one of the clusters has an issue, and you need to roll back everything you did manually, in all clusters and environments. Chances are you’ll end up with a completely new setup, granted upgraded, but then again, it would be manually upgraded, with all the benefits of using a configuration management stripped away.

This is why using a configuration management tool proves to be the best approach.

So let’s break it down a bit and see how much different a worst case upgrade scenario would play out using our choice of a configuration management tool, Ansible.

A successful staging upgrade

One of the best use cases for Ansible is to control multiple environments. For that, Ansible uses inventory files, which list hosts that are part of a specific environment. We’ll take our own Jenkins CI environment as a basis for this hypothetical example and here we have 3 environment files: staging (or pre-production), production-1 and production-2 (if you recall our diagram, we have two data centres that provide DR – one in Brasov and one in Cluj).

CJ-PP-Jenkins-V001 ansible_ssh_host=10.10.11.90   ansible_ssh_user=root
CJ-PP-CICE7-V001   ansible_ssh_host=10.10.11.94   ansible_ssh_user=root
CJ-PP-CICE7-V002   ansible_ssh_host=10.10.11.95   ansible_ssh_user=root
CJ-PP-CIWK12-V001  ansible_ssh_host=10.10.11.93
CJ-PP-CIWK12-V3    ansible_ssh_host=10.10.11.96
CJ-PP-MMAC-F001    ansible_ssh_host=10.10.11.91

[master]
CJ-PP-Jenkins-V001

[slaves:children]
linux-slaves
osx-slaves
windows-slaves

[linux-slaves]
CJ-PP-CICE7-V001
CJ-PP-CICE7-V002

[osx-slaves]
CJ-PP-MMAC-F001

[windows-slaves]
CJ-PP-CIWK12-V001
CJ-PP-CIWK12-V3

[all:vars]
jenkins_mount=”//windows_share_pp.local”
jenkins_mount_smb_server=”\\windows_share_pp.local”
ntp_server_pp=’10.10.0.2′
vcenter_hostname=” vcentercj.local”
The inventory file contains several crucial pieces of information:

  • The hosts from the pre-production environment
  • The user Ansible should use for ssh connection
  • Some grouping that we can use later in the roles
  • Variables that will be available in the role’s tasks when executing the playbook for this environment only.

Getting ready for the actual upgrade is a simple job that requires to update the role files to contain the desired upgrade version. For instance, to upgrade Jenkins to 2.7.2, in our install_master role in the vars/main.yml file, we would just have to change the version number to match the desired one.

jenkins_data: /data
jenkins_repository_url: 'http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo' # Jenkins LTS (Long-Term Support release) repository
jenkins_repository_key: 'http://pkg.jenkins-ci.org/redhat-stable/jenkins-ci.org.key' # Jenkins LTS (Long-Term Support release) key
jenkins_version: '2.7.2'

We then of course push our changes to git, on the branch we created for this upgrade.

The actual upgrade is performed in 5 steps:

  1. Stop Jenkins:
ansible-playbook -i pre-production control-services.yml --tags stop-jenkins
  1. Make a backup of all Jenkins configurations
ansible -i pre-production master –a "/root/scripts/backup_jenkins_home.sh"
  • backup_jenkins_home.sh:
#!/bin/bash
SCRIPTS_DIR=/root/scripts
hour=`date +"%H"`

# Create the backup
cd /tmp_backup/
tar czf jenkins-home-$hour.gz --warning=no-file-changed --exclude='cache' --exclude='caches' --exclude='plugins' --exclude='shelvedProjects' --exclude='tools' --exclude="monitoring" --exclude="logs" /var/lib/jenkins || [[ $? -eq 1 ]]

# Run backup rotate
cd $SCRIPTS_DIR
bash rotate_backup.sh
  • rotate_backup.sh:
#!/bin/bash
# Backup daily with retention for weekly backup and monthly backup
set -e
storage=/bkp
source=/tmp_backup

date_daily=`date +”%Y-%m-%d”`

month_day=`date +”%d”`
week_day=`date +”%u”`
hour=`date +”%H”`

if [ ! -f $source/jenkins-home-$hour.gz ]; then
exit 1
fi

# 1-st of each month
if [ “$month_day” -eq 1 ] ; then
destination=$storage/backup.monthly/$date_daily
else
# On saturdays
if [ “$week_day” -eq 7 ] ; then
destination=$storage/backup.weekly/$date_daily
else
# Daliy
destination=$storage/backup.daily/$date_daily
fi
fi

mkdir -p $destination
cp -R $source/* $destination
rm -rf $source/*

# daily – keep for 14 days
find $storage/backup.daily/ -maxdepth 1 -mtime +14 -type d -exec rm -rv {} \;

# weekly – keep for 60 days
find $storage/backup.weekly/ -maxdepth 1 -mtime +60 -type d -exec rm -rv {} \;

# monthly – keep for 300 days
find $storage/backup.monthly/ -maxdepth 1 -mtime +300 -type d -exec rm -rv {} \;

SIZE=$(du -sb $destination/jenkins-home-$hour.gz | awk ‘{ print $1 }’)
ls -alh $destination
if ((SIZE < 90000)) ; then
exit 1
else
exit 0
fi

  1. Run a clean installation for Jenkins, with the new provided version
ansible-playbook -i pre-production master.yml
  1. Restore previous backup
ansible-playbook -i pre-production restore.yml -e "backup_version=backup.daily/2016-02-10/jenkins-home-11.gz"
  1. And finally start the upgraded Jenkins master
ansible-playbook -i pre-production control-services.yml --tags start-jenkins

While it might seem like there are a lot of “manual” steps, these can also be automated and bundled in one single yml file and executed in one run, in the desired order.

A failed production upgrade – a potential real life example

So now since the upgrade for the pre-production went so well, our confidence got a boost and we move on to production. Needless to say, an extensive downtime from a production CI setup is less than desirable, as it can put a hold on the development process.

Back to our hypothetical upgrade then. We learned from the pre-production upgrade what the procedure is, and now we’d like to run the same on production.

Since this is a DR setup, there’ll be a few extra steps.

  • Failover everything to only one data centre, the Cluj one in our example. control-dns.yml playbook only controls the DNS configuration.
ansible-playbook -i production control-dns.yml --tags dc_cj
  • Stop Jenkins on the other data centre:
ansible-playbook -i production-bv control-services.yml --tags stop-jenkins
  • Make a backup of all Jenkins configurations
ansible -i production-bv master –a "/root/scripts/backup_jenkins_home.sh"

Please note how we only change the inventory file that contains the Brasov hosts, while we keep the same roles, tags, tasks etc.

  • Run a clean installation for Jenkins, with the new provided version
ansible-playbook -i production-bv master.yml
  • Restore previous backup
ansible-playbook -i production-bv restore.yml -e "backup_version=backup.daily/2016-10-10/jenkins-home-11.gz"
  • Start the upgraded Jenkins master on Brasov data centre
ansible-playbook -i pre-production control-services.yml --tags start-jenkins

After a quick investigation, we decide the Brasov side of the DR setup is successfully upgraded. Now let’s do the same on the primary (Cluj) data centre.

  • Failover everything to only one data centre, the Brasov
ansible-playbook -i production control-dns.yml --tags dc_bv
  • Stop Jenkins on the other data centre:
ansible-playbook -i production-cj control-services.yml --tags stop-jenkins
  • Make a backup of all Jenkins configurations
ansible -i production-cj master –a "/root/scripts/backup_jenkins_home.sh"

Again, note we’re only changing the inventory file, to apply changes to the Cluj cluster.

  • Run a clean installation for Jenkins, with the new provided version
ansible-playbook -i production-cj master.yml
  • Restore previous backup
ansible-playbook -i production-cj restore.yml -e "backup_version=backup.daily/2016-10-10/jenkins-home-11.gz"
  • Start the upgraded Jenkins master on Cluj data centre
ansible-playbook -i pre-production control-services.yml --tags start-jenkins

And disaster strikes!

Let’s say, for some weird reason, Jenkins master in Cluj doesn’t start. There are over 1000 Java exceptions and debugging each one would take hours. Before we could find a fix, we can’t even see the main console for Jenkins – everything is messed up. We touched a working system, and now it’s no longer working. Should’ve paid more homage to the SysAdmin Gods. 🙂

Now, since we already upgraded one data centre, we could in theory just leave the working one to run jobs, while we manually investigate and fix the Cluj cluster.

There are two problems with this approach though:

  • Network latency: jobs will take longer to run as things would need to be pushed to a remote location, over a site-to-site VPN, which adds latency. The DR setup should serve traffic to the Brasov data centre only in the event of the Cluj one being unavailable.
  • Manual fixing conflicts with what we want to achieve in the first place: configuration management, identical reproducible setups, fully automated.

After a quick investigation, we conclude that the DNS and DHCP setup in the Cluj data centre is different from the one in Brasov, and the upgrade is affected by those minor differences (this is purely an imaginary scenario). So we decide that while the Network team gets to fix those differences, we need to revert and postpone our upgrade.

Nothing simpler! We first check out our previous version of the Ansible roles from git, and we re-run the upgrade procedure from above. We have to adjust it a bit though, as we’re all ready in full failover in the Brasov data centre, so the steps are the following:

  • Git checkout the previous version
  • Run Ansible playbook against the Cluj inventory file for:
    • Jenkins install (we already have a “dead” Jenkins, so no need to stop or backup)
    • Jenkins restore from backup
    • Jenkins start
  • Run Ansible playbook against the production inventory to fail back from Brasov to Cluj only
  • Run Ansible playbook against the Brasov inventory file for:
    • Jenkins stop
    • Jenkins install
    • Jenkins restore from backup
    • Jenkins start
  • Run Ansible playbook against the production inventory to resume DR for the two data centres.

Same as earlier, all those actions can be all bundled into a single playbook.

We should consider reverting to the old version in our pre-production environment too, for the sake of consistency, and when the time comes to have a play environment where we run the upgrade again.

We learned that Ansible describes the state that we want a system to have. This means we specify how a resource should “be”. For example, we can:

  • Describe what state a service should have: stopped, started, restarted.
  • Describe the state of a package: installed, removed.
  • Describe what a file should contain, and what properties it should have (access modes, owner etc.).
  • Run freestyle commands, for more complex situations where we can’t rely on a package manager or a service file.

Combining all these options gives us the possibility to control almost any aspect of a system, perform almost any sort of action, which leads to a very strictly controlled environment.

Why would we want this?

Because moving from one state to another will be much simpler. We know how a system “is”, we have all the information we need, and we can just apply the changes we need, and not redo or recheck all previous settings. It also guarantees we can achieve the exact same configuration on any other system that we’ll control with a configuration tool.

Keeping this in mind, what may have caused the production failure?

A plausible and very simple answer could be: DNS and DHCP was NOT set up using a configuration manager. This inserted slight differences between two implementations that should’ve been identical (except for the obvious, intended differences such as IP range, search domain etc.) that are usually attributed to human error.

The bottom line is that we should always use a configuration management tool, not just for initial setups, but also for day-to-day tasks, to ensure that the state a system reaches is indeed the desired state.

P.S: The upgrade example provided above is purely fictional (at least the part with the failure) and is solely intended for learning and sharing purposes. 🙂 No data centres were harmed during any of our maintenance activities and during the making of this article.

For any helpful input and/or questions, we’re looking forward to your comments in the section below.

Tags: , ,

2 comments

Is it possible to share your “/root/scripts/backup_jenkins_home.sh”?

Hi Nick,
I’ve added the scripts files (there are 2 of them) in the blogpost above, after the section “2. Make a backup of all Jenkins configurations”.

Hope it helps!
If anything, give a shout.

Leave a Reply

Your email address will not be published. Required fields are marked *