Ansible Advice

Published:

Recently, I got the chance to work on a non-trivial Ansible setup, with which we manage about a couple dozen VMs at work. Here are my learnings and recommendations from working on it.

Good Practices for Ansible

There is a very long and detailed article on Good Practices for Ansible. It is a treasure trove full of guidance and it contains many examples and rationales for why the authors believe certain things should be as they recommend. Do not let the short scrollbar deceive you, that website uses the <details>-tag a lot! I picked up some of the points from that article but sometimes I also disagree and might even give conflicting advice. Read it and make up your own mind what works best for you and your team.

Read the manual

The ansible documentation is vast, difficult to navigate until you get used to it, and sometimes confusing. Do not dismiss it lightly! It contains so much important information. To give you some examples, here are my most important picks from it:

  • Ansible.Builtin Collection Documentation: It contains the references for all built-in roles, their arguments, and usually a great deal of examples which are almost ready to use.

  • Role directory structure: The basics of a good role directory can be achieved with ansible-galaxy role init <role name>, but this page explains what each directory is about in a concise way.

  • Understanding variable precedence: This always bites me, as there are more than 20 different places to configure variables in ansible. While most of them follow a “more generic gets overwritten with more specific”-approach, sometimes it is still difficult to remember all precedence rules.

  • Including vs Importing: While this is often not important and, at least in my use cases, include_X is often the way to go, it makes sense to read up on include and import differences.

There is also one more important tome of knowledge: The Jinja Template Designer Documentation. It contains all the details you need to know about handling strings in ansible or to write template files. If you are not familiar with Jinja, read at least the parts about built-in functions, those can be very important. Note that ansible comes with a few functions on its own, so sometimes you will not find the things you are looking for here but in the ansible documentation.

Do not reinvent the wheel

Except for very narrow use-cases, many many things have already been solved by the community and are included in, e.g., ansible.builtin and community.general. All of our setups are, after all, not that special. There is also a grand selection of roles and collections in the ansible-galaxy.

However, as with all third-party dependencies, consider the trade-offs between writing something yourself and using an off-the-shelve solution.

Use ansible-lint

Using ansible-lint is likely the best advice one can give when working with Ansible.

If you use VSCode or derivative editors, install the Ansible extension (Ansible extension in the Open VSX Registry) and add these settings to make it aware of all yml files in roles and playbooks:

"files.associations": {
    "**/collections/**/*.yml": "ansible",
    "**/roles/**/*.yml": "ansible",
    "**/playbooks/**/*.yml": "ansible"
}

Alternatively, you can install ansible-lint into your virtual environment, or, if you installed ansible via pipx, inject it into it.

Ansible-lint prevents you from many common code smells and issues. For example, it ensures that you always fully qualify module names, i.e., instead of copy, you would write the FQCN ansible.builtin.copy. This prevents accidental shadowing of tasks and thus ensures that you always use the task you wanted to use.

Another one of my favorites is that ansible-lint warns when you use ansible.posix.shell where you often do not need it, but instead could use ansible.posix.command. Even better: If you use, for example, curl inside an ansible.posix.command, ansible-lint will happily suggest to use one of the available modules of ansible to download files directly. These rules often entail to think about failed_when or changed_when checks, but in my opinion it is often worth to do so.

You can disable the linter in special cases with magic comments such as # noqa: command-instead-of-module – or whatever the hint is you want to suppress:

- name: Copy files around on control node  # noqa: command-instead-of-module
  delegate_to: localhost
  ansible.builtin.command:
    argv:
      - "rsync"
      - "--archive"
      - "--delete"
      - "--cvs-exclude"
      - "--exclude=.git"
      - "-i"
      - "/from/path/"
      - "/to/path/"
  register: rsync_result
  failed_when: rsync_result.rc != 0
  changed_when: rsync_result.stdout_lines
  when: not ansible_check_mode

Building complex inventories

When building your inventory, place group and host variables inside an inventory directory. So instead of using two or three top-level directories (host_vars, group_vars, inventory), use one inventory directory and nest the host and group vars inside. This seems to be the only reliable way for ansible to pick them up when you organize variables in per-group or per-host directories:

`- inventory
   |- group_vars
   |  `- group_name1
   |     |- variables1.yml
   |     `- variables2.yml
   |- host_vars
   |  |- hostfqdn
   |  |  |- variables_a.yml
   |  |  `- variables_b.yml
   |  `- hostfqdn2
   |     `- variables.yml
   |- hostfqdn
   `- hostfqdn2

If you would manage your group variables outside of the inventory directory, ansible will not gather multiple variable files from subdirectories. I have not yet found any documentation as to why that is the case, but at work my team and I took a while to figure this out; we merely found out by accident when we identified two ansible projects behaving differently and were able to bring it down to the difference in the inventory definition. There might be other factors at play, but so far we have been successful with this strategy.

Note that while ansible is able to merge certain variables in some ways, for inventories it uses a “last-wins” strategy. That means that in the example above, a variable set in both, variables1.yml and variables2.yml, would only take the value set in variables2.yml.

You can use this feature in the same way many configuration directories in linux solve this, by naming your files appropriately: 01-something.yml to 99-something.yml. Why, you ask? Good question indeed, we could use a single file and just dump everything in there! However, at work we automate some of those variables via OpenTofu, i.e., we have a few resources in our tofu state which automatically manage host variables for our ansible playbooks, and it is so much easier to handle files in contrast to lines in files. Thus, even though you might not be automating variables right now, consider using this pattern for variables to maybe automate things later.

ansible_advice | Task naming

ansible-lint enforces that task names start with capital letters, unless the names have a pipe in them. In that case, the convention is to use the task-file filename before the pipe. Here is an example from my home infrastructure, where I configure a pihole. It comes with three task files: roles/pihole/tasks/main.yml, configure_netplan.yml, and diable_systemd_resolved.yml. The tasks inside the main.yml are named “normally”, for example:

- name: Start pihole

Contrast this to the names in configure_netplan.yml:

- name: configure_netplan | Ensure netplan config is not accessible by others

This turned out to be a useful convention when running playbooks with the default verbosity to quickly find out in which file a certain task lives, for example to debug, monitor, or extend playbooks. Compare the following outputs of a made-up task to configure an internal DNS server:

# Verbosity 0 (ansible-playbook playbook.yml)
TASK [pihole : configure_netplan | Set DNS server] ******************************************
ok: [myserver]

# Verbosity 1 (ansible-playbook -v playbook.yml)
TASK [pihole : configure_netplan | Set DNS server] ******************************************
ok: [myserver] => {"ansible_facts": {"dns_server": "10.0.0.2"}, "changed": false}

# Verbosity 2 (ansible-playbook -vv playbook.yml)
TASK [pihole : configure_netplan | Set DNS server] ******************************************
task path: .../roles/pihole/tasks/configure_netplan.yml:4
ok: [myserver] => {"ansible_facts": {"dns_server": "10.0.0.2"}, "changed": false}

Ansible will print the filename of the task only if invoked with verbosity levels of 2 or higher, but it always logs the full task name. Thus, adding the task’s yml name as a prefix might feel redundant, but it helps to debug issues especially when you are not really in control of the ansible execution environment, for example, a CI pipeline someone else has control over.

File naming

Name your ansible YAML files with the yml extension. This is a convention by the ansible-galaxy tools and used when you scaffold a new role. It also helps to distinguish between YAML files which are used for other purposes than for ansible and makes it thus easier to configure tools such as ansible-lint.

Similarly, jinja templates should be named .j2, rather than .jinja2 or similar, for example, myscript.sh.j2.

Role naming

Use descriptive role names! common_config might be descriptive in some context, but it might not always be clear what it means. Additionally, such generic names invite an organic accumulation of tasks which might, at some point, have been common to all hosts, but might not be in the future. Think twice before adding something to such a role, are your sure that all of your servers require the latest CUDA installation?

Verbose playbook execution

Even though the task naming advice above removes some of its necessity, I do recommend running playbooks with -vv. This gives you – in my view – the most relevant information but leaves out unnecessary details such as connection logs which become visible at levels 3 and 4. The main reason is that not all collections and roles from the Ansible GALAXY follow the convention of naming their tasks with the filename, thus it is easier to debug such playbooks with -vv.

Make sure your playbooks are check-mode-compatible

Before I apply playbooks, I run them with --check to see if they are, at least from the point of view of ansible, more or less okay. Sometimes I also run them with --diff to see changes, but both options are only somewhat reliable. However, it requires that you write your playbooks in a way that they support those modes, or at least the check mode.

You must sprinkle your code with a few magic values here and there to make them work with the check mode properly:

  • when: not ansible_check_mode

  • ignore_errors: "{{ ansible_check_mode }}"

  • check_mode: true/false

Many modules from the standard library automatically support the check mode, but if some task does not support it, you can skip its execution during check mode with when: not ansible_check_mode. Of course, this also works the other way around, for example when you want to run some debugging code only when the check mode is active. This is the most reliable and straight-forward way to deal with check mode and should be your go-to solution whenever popssible.

If a task must run (e.g., when you need to register a result or something), you can also use ignore_errors: "{{ ansible_check_mode }}". This will run the task but ignore any errors it throws during check mode. Personally, I try to use it as rarely as possible, but sometimes it seems unavoidable to use it.

The last option, using check_mode, is slightly different conceptually. It allows you to control the check mode on a task level; i.e., when you set check_mode: true, then that task will always behave as if it was run in check mode, independent of the --check CLI argument. Similarly, check_mode: false will always run tasks as if the check mode was inactive. This can be useful to query data sources and make sure subsequent tasks get the information they need.

Managing the check mode is useful to plan ahead and is one way to test changes, so it is usually quite helpful to control it properly.

Only become: root when you must

It is tempting to place become: true in your playbooks and not worry about it in your tasks. However, this can easily have unintended side-effects, for example unexpected ownership of files. Instead, it works best to use become: true sparingly: only on the task-based level or on the block level. The block level should only be used when you bundle a few closely related tasks together, otherwise future changes could lead for some rootless tasks to be crammed between the root-tasks in the block.

For example, to set authorized ssh public keys for users, you must become root:

- name: Set ssh authorized keys
  become: true
  ansible.posix.authorized_key:
    user: "shoeffner"
    key: "ssh-ed22519 AAAA..."
    exclusive: true
  ignore_errors: "{{ ansible_check_mode }}"

Argument specs

Whenever possible, especially when you expose arguments for your roles, make sure to write a meta/argument_specs.yml. They are very tedious to write, but they come with tremendous benefits. First, they will validate all role variables to match your specification before any task is executed. This prevents abrupt failures in the middle of a role, which is sometimes difficult to debug. Additionally, it removes the need of using filters like mandatory in your tasks and makes it obvious which arguments are required for a role. Second, it gives much more relevant error messages than whatever happens when one of the variables is ill-defined or missing.

The biggest disadvantage, however, is that the argument specs cannot pick up defaults from the defaults/main.yml or vice-versa; instead, you need to specify defaults twice. Still, I recommend to set defaults in defaults/main.yml where approriate and add a required: true without any default to the meta/argument_specs.yml where a default is neither possible nor useful.

While we are at it, make sure all your role input variables are “namespaced”, i.e., start with the role name, e.g., for the pihole role above, instead of just port:, use pihole_port:. This way, it is easier to find out for which role the variable was specified.

Resource cleanup

Resource cleanup in ansible is a mess, or basically non-existent. This is of course an exaggeration, you can use tasks to ensure absence, but you still need to manually figure out what to clean up. The reason is that ansible has no state-management like, for example, OpenTofu, but instead can only ensure whatever you specify in your playbooks.

Still, we can often build our own cleanup strategy with the following algorithm, which is my go-to solution for such problems:

  1. Get a list of existing resources of that type.

  2. Optionally, get a list of declared resources of that type, usually from the group and host vars.

  3. Get the resources to be cleaned up by taking the difference of existing and declared resources like this: existing | default([]) | difference(declared) – The declared resources should have a default in defaults/main.yml.

  4. Ensure all those undeclared resources are absent, e.g., with state: absent.

  5. Ensure all declared resources are present, e.g., with state: present.

Here is an example for local user accounts in a fictional role local_users:

- name: Lookup existing users
  ansible.builtin.getent:
    database: passwd

- name: Filter existing users with 60000 > uid > 1000
  ansible.builtin.set_fact:
    existing_users: "{{ (existing_users | default([])) + [item.key] }}"
  with_dict: "{{ ansible_facts.getent_passwd }}"
  when: "60000 > item.value[1] | int > 1000"

- name: Ensure non-declared users are absent
  become: true
  ansible.builtin.user:
    name: "{{ item }}"
    state: absent
    force: true
  loop: "{{ existing_users | default([]) | difference(local_users_users) }}"

- name: Ensure declared users are present
  become: true
  ansible.builtin.user:
    name: "{{ item }}"
  loop: "{{ local_users_users }}"

Dynamic includes

It is possible to include tasks dynamically, for example to choose different methods. This way, instead of using blocks with complicated when: ... statements, you can switch based on a variable:

- name: Fetch authentication credentials
  ansible.builtin.include_tasks: "fetch_{{ auth_fetch_method }}.yml"

Another use case for dynamic includes is to reuse more complex code in loops:

- name: Perform complex task
  ansible.builtin.include_tasks: "complex_task.yml"
  loop:
    - ["entity_a", "entity_b"]
  loop_control:
    loop_var: complex_task_var

This will include complex_task.yml and pass the variable complex_task_var to it, once with the value entity_a and once with the value entity_b

When you do these dynamic includes, do not nest things too deep: Task files live all in the same directory, it might make sense to create new roles at some point.

Secret management

While it is okay for small projects, avoid ansible-vault for bigger ones. If you must use it, make sure that you encrypt complete variable files (following the inventory trick above) rather than parts of it. And make sure that you have some mechanism such as pre-commit hooks in place to prevent those files from being checked in unencrypted. Otherwise, you are in for a lot of manual decrypting and encrypting strings.

A better alternative is to use a plugin which can retrieve secrets on demand, such as the hashi_vault plugin – of course, granted you have access to such a secret store. This way the name of the secret instead of the secret itself becomes part of your configuration, making it portable and safe to share your configuration, as there is no encryption required anymore.

Meta dependencies

The dependencies: [] in meta/main.yml is somewhat ambiguously named. I always thought I had to list collections my role required because I would use a module from it in my role. But it is different: You list roles which are always executed before your own tasks run. For example, if you have a role which starts a docker container, you might want to consider adding geerlingguy.docker as a dependency. Beware though that adding something as a dependency which you do not want to run every time you execute a playbook (e.g., installing some packages) might slow down your playbook execution.

Performance optimization

In many situations, there will be a CI pipeline running the playbook and it will not matter much if it takes a minute or two instead of a few seconds to run. However, leaving environmental impacts aside, developing and ad-hoc running of playbooks can take some time if your playbooks take long to execute and lead to frustration. Tags are a solution but I don’t like tags that much – though I cannot pinpoint yet why. Instead I find myself stupidly toggling roles in playbooks when I work on something, which is arguably worse.

Anyways, to improve execution times, it is always important to measure them first. Luckily, ansible comes with a couple of handy tools for that, which you can configure in your ansible.cfg:

[defaults]
callbacks_enabled = timer, profile_roles

This will measure the execution times of each role (profile_roles) and print timestamps after your tasks (timer). Not only will it help you identify bottlenecks, it also makes debugging easier to know when certain tasks have been executed.