Ansible Is Not (Yet) Perfect

A Review of Ansible in Production

2018-06-03

Introduction

I have been using Ansible for over a year now, both at work and at home (for example to configure my Kubernetes cluster using kubespray.

When I first used Ansible, I was blown away by its power and simplicity. And all that by leveraging the existing SSH server, without a new client setup? Awesome!

But over time, I discovered more and more warts and limitations while using Ansible. In this blog post, I will go over all the cases where it falls short of the promises it make and where you start fighting against instead of together with Ansible.

This post is in no way meant to put down Ansible. When focusing on the bad parts, one might get the impression that there are no good parts. This is absolutely not the case! But by listing its drawbacks, maybe we can come up with ideas how to fix or work around those, benefitting everyone.

This post assumes some familiarity with Ansible. I can recommend the Getting Started Documentation for the first steps with Ansilbe.

Ansible limitations

YAML as the configuration language

Ansible uses YAML for almost all of its configuration. YAML is an excellent choice when you want to express data, similar to JSON. You have dictionaries, lists, scalars, combinations of them and some syntactic sugar to save typing. Easy.

But YAML is not a good language to express program logic.

It is declarative, but without a strict logic you are only poorly implementing a DSL in a data serialization language.

Let's start with a simple example:

- shell: echo {{ item }}
  with_items:
    - "one"
    - "two"
    - "three"

Easy to reason about: This outputs one two three. Now, Ansible has a way to express loops using YAML plus Jinja, like this:

- shell: echo {{ item }}
  when: item != 'two'
  with_items:
    - "one"
    - "two"
    - "three"

If you know Ansible, you most likely know what is going to happen: If will output one and three, skipping two. This is of course the only way the ordering between when and with_items makes sense, but this is not at all obvious or deducible by only looking at the code. If this was instead done procedurally, it is immediately obvious:

for item in ['one', 'two', 'three'] {
    if item != 'two' {
        shell('echo {item}')
    }
}

A similar problem occurs when using register together with a loop, like this:

shell: echo {{ item }}
with_items:
  - "one"
  - "two"
register: out

Usually, register saves the output of a module into the variable given as its key. But when using a loop, the structure of out differs from one you would get without the loop: Instead of out being a dictionary containing the return data, out[results] is a list of dictionaries with that data for every invocation of shell. Now you know that, and it might make sense, but it is so obscure that you will most likely have to look it up next time (I do every time).

My guess is that YAML was chosen because it is declarative. But Ansible is inherently non-declarative, but rather procedular, at least on a high level.

On the module level declarativeness makes a lot of sense. I do not actually care how that file gets its content and permissions, or how that package is installed. I just want to tell Ansible to make it so, and its job is to figure it out. So, YAML might actually be a good decision for module invokation:

- copy:
    dest: /etc/foo.bar
    content: 'Hey!'
    owner: foo
    group: foo
    mode: 0644

If is immediately obvious what is going to happen, apart from the not-so-obvious name of copy for the module. In the end, I am going to end up with a file that has the exact properties I specified above. Nice.

There is another configuration management system that uses YAML together with Jinja for its syntax: SaltStack. The difference is the ordering of the "rendering pipeline": Ansible first parses the files as YAML, and then applies Jinja to certain parts (e.g. the when key). SaltStack's files are Jinja-templated YAML files, so it first passes the file through the Jinja engine and then parses the output as YAML.

This approach makes for a much more powerful syntax, because you actually have a turing complete language (Jinja) to write your declarations (YAML). It's also less magic: If you know Jinja well enough, it's easy to reason about the code without knowing SaltStack internals.

The problem: You can shoot yourself in the foot, and SaltStack placed your target right next to your foot. There is a thin line between "That makes sense!" and "This is messy!". Take the following code as an example, taken from my salt forumla to set up Nginx together with LetsEncrypt (link here):

{% for domain in params.domains %}
letsencrypt-keydir-{{ domain.name }}:
  file.directory:
    - name: {{ nginx_map.acme.home }}/{{ domain.name }}
    - user: {{ nginx_map.acme.user }}
    - group: {{ nginx_map.acme.user }}
    - mode: '0750'
    - require:
        - user: acme
{% endfor %}

This is quite easy to understand. But down the rabbithole it goes, and you stumble upon something like this in a different file:

{% if params.get('manage_certs', True) %}
{% set no_commoncert = [] %}
{% for domain in params.domains %}
{% if domain.get('ssl_cert', false) %}
{% set main_name = domain.names[0] %}
{% do no_commoncert.append(1) %}
nginx-pkidir-{{ main_name }}:
  file.directory:
    - name: {{ nginx_map.conf.confdir }}/{{ nginx_map.pki.pkidir }}/{{ main_name }}
    - user: root
    - group: {{ defaults.rootgroup }}
    - mode: 700
    - require:
      - file: nginx-pkidir
{% endif %}
{% endfor %}
{% endif %}

Whatever it does, I think we can agree that this is not nice to read.

In the end, I think that YAML is simply not a good abstraction for configuration management files, and using Jinja as a crutch to get more functionality out of a data description language makes it even worse.

Configuration management needs a touring complete language with a declarative way to use modules. From this, you can generate a declaration of your desired configuration, that can then be used to configure your system. Complete declarativeness for the language, even though it is often touted as the end goal of CMSs, is not possible. Even the Puppet DSL has loops and conditions.

In a way, a strictly functional language might be the best way to go. NixOS is a really promising and interesting candidate.

Release engineering

Ansible moves fast and breaks lots of things. This is simply not a good feature for a configuration management system.

For the package module, state: installed is now called state: present

One bug that let to a lot of frustration for me and my team was a RecursionError caused by too many (>20) import_role statements in a playbook. It was introduced around version 2.0, fixed in version 2.3, resurfaced in version 2.4, and finally fixed for good (hopefully) in version 2.5.

This does not give me a lot of confidence in the Ansible release engineering. I know that it is a very hard job, and you always have to weigh stability against new features. But it is my impression that the Ansible team leans a bit too much on the latter side, introducting breakage and forcing me to adapt my roles and modules every few releases.

Inventory and Host Variables

Ansible has a concept of host and group specific variables. There are a lot of places where you can set variables, and their precedence is strictly defined (look at the list in the official documentation!).

The problem with that is the merging strategy: Nested values are not merged, but the later ones overwrite the pervious ones. This means that custom roles cannot have a "main" key, e.g. postgresql_config for a PostgreSQL role, but have to pollute the top-level variable space with a prefixed list of variables, like this (taken from here):

postgresql_version: 9.6
postgresql_encoding: "UTF-8"
postgresql_data_checksums: false
postgresql_pwfile: ""

This is simply ugly, and not the way YAML is meant to be used. Also, assume you have the following situation: You have a number of servers, and a number of admins that have access to the server, like this:

admins:
  - name: "hannes"
    sudo: true
    sshpubkey: ssh-rsa ...
  - name: ...

Now, you want to add a new guy to your list of users, but only for a few servers (you do not want the new guys to break production!). In a perfect world, you would go to the group_vars of those servers, and add the new guy:

admins:
  - name: "newguy"
    sudo: false
    sshpubkey: ssh-rsa ...

This does not work with Ansible, because the second declaration would overwrite the first, and now only your new guy has access to the servers! The only solution to that problem (as far as I can tell), is to use a differently named key:

new_admins:
  - name: "newguy"
    sudo: false
    sshpubkey: ssh-rsa ...

Then, merge those keys in the role you use to create users:

- user:
    name: "{{ item.name }}"
    state: present
  with_items: "{{ admins + new_admins }}"

This does not scale: As soon as you need another distinct access rule, you have to add another key, and the cycle repeats.

Speed

Simply put, the execution speed of Ansible playbooks is horrendous. This is due to its architecture, which requires SSH connections to all servers you run playbooks on. It might not be a problem for you, but the workarounds that were created (stuff like ansible-pull + cron) show that it is a problem for a significant number of people.

Dry runs

When running Ansible playbooks, you can pass --dry-run to the ansible-playbook command, and Ansible will show you what would be done, not actually executing anything.

Except that this does not work reliably. This happens most often when you add a YUM/APT repository, to the install a package from that repository. If the repository is not yet present on the server, the (no-op) package installation will fail with a "package not found" error.

There are workarounds, like using when: not ansible_check_mode, but these are still just that: workarounds.

Ansible does not give me the same sense of reliability as e.g. puppet does.

My opinion

It might sound weird after the above but I have to say: I really like Ansible. Not so much for configuration, but for orchestration. There is simply nothing better.

I love having repetitive tasks written down as code, having them reviewed before running them. Documentation having copy-paste shell snippets now simply link to an Ansible script that does the same, but repeatably, and without accidentially pasting into the wrong terminal window ;)

Slap something like Rundeck or StackStorm in front of Ansible, and you can give fine-grained access to your playbooks to other people, together with logging, auditing, and integration for your favourite tools.

But, Ansible as configuration management tool has not convinced me yet. As old school as it is, Puppet gives me more confidence while using it. Ansible still has a lot to do in that regard, but with lots and lots of people working on Ansible, together with being backed by RedHat, I hope it will get even better in the future!