SOP Add Zabbix monitoring to the releng compose hosts

This SOP documents step by step, of what was required [14] to add Zabbix monitoring to the Fedora Releng systems. This will hopefully act as a guide which community members might follow if they wanted to help roll out Zabbix to the wider Fedora Infra systems (or help maintain going forward). Once complete this SOP will live in [3],[4].

Resources

Releng Machine List

The following machines are those which are relevant to Releng.

machines:
[releng_compose]
compose-x86-01.iad2.fedoraproject.org
compose-branched01.iad2.fedoraproject.org
compose-rawhide01.iad2.fedoraproject.org
compose-iot01.iad2.fedoraproject.org

[releng_compose_stg]
compose-x86-01.stg.iad2.fedoraproject.org

First install the Zabbix agent on these releng_compose:releng_compose_stg hosts via the zabbix/zabbix_agent ansible role [11]. We targetted the groups/releng-compose.yml playbook as this is responsible for targetting these hosts.

diff --git a/playbooks/groups/releng-compose.yml b/playbooks/groups/releng-compose.yml
index 04b68aba4f..69c0acdad3 100644
--- a/playbooks/groups/releng-compose.yml
+++ b/playbooks/groups/releng-compose.yml
@@ -28,6 +28,8 @@
   - ipa/client
   - rkhunter
   - nagios_client
+  - zabbix/zabbix_agent
   - collectd/base
   - sudo
   - role: keytab/service

Run the playbook like so sudo rbac-playbook groups/releng-compose.yml on the batcave01 host. Then check the Zabbix console hosts section to ensure the new hosts have been picked up by Zabbix[5][17]. To get access to the Zabbix server, your FAS user must be a member of the group sysadmin-noc, then run the playbook sudo rbac-playbook groups/zabbix.yml. Once run you can then authenticate via FAS on the Zabbix web console.

Requirements

There is no compose being run in the staging environment at all, so this is unfortunately going to be need to be implemented on the production environment only.

Existing monitoring is in place to track composes fails or finishes with success, however there is currently no monitoring to track when a compose hangs.

Cronjobs are installed on the releng hosts via the following ansible task[13]. There are a total of 8 cronjobs in total.

  • 1: ftbfs weekly cron job "ftbfs.cron" /etc/cron.weekly/ on compose-x86-01

  • 2: branched compose cron "branched" /etc/cron.d/branched on compose-branched01.iad2

  • 3: rawhide compose cron "rawhide" etc/cron.d/rawhide on compose-rawhide01.iad2

  • 4: cloud-updates compose cron "cloud-updates" /etc/cron.d/cloud-updates on compose-x86-01.iad2

  • 5: container-updates compose cron "container-updates" /etc/cron.d/container-updates on compose-x86-01.iad2

  • 6: clean-amis cron "clean-amis.j2" /etc/cron.d/clean-amis on compose-x86-01.iad2

  • 7: rawhide-iot compose cron "rawhide-iot" /etc/cron.d/rawhide-iot on compose-iot-01.iad2

  • 8: sig_policy cron "sig_policy.j2" /etc/cron.d/sig_policy on compose-x86-01.iad2'

Need at least one Zabbix check per cronjob. The Zabbix check should do the following.

  • When a cronjob starts: — create a file in /tmp/name-of-cron-job

  • When a cronjob ends: — delete the file in /tmp/name-of-cron-job

  • If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.

Implementation

  • Create a custom template called fedora releng compose cronjobs.

  • Create a host group called fedora releng compose.

  • Add the ansible hosts from the group releng_compose in production only since we currently don’t do composes in staging, to this host group.

  • In this template create an item, one for each cronjob.

  • In this template create a trigger, one for each cronjob. Initially set the trigger to alert when the item returns true for more than 1 hour. This can be changed later when we understand just how long these cron jobs run for.

  • Implement this template in JSON see [12] for inspiration and format examples. This template can then be placed in roles/zabbix/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json.

  • Create a task in the roles/zabbix/zabbix_server/tasks to make use of the zabbix_api key to create this template on the server see [1].

  • Use the community Ansible role for adding this template to the releng hosts.

  • Update each cronjob in Ansible, to create the files such as /tmp/name-of-cron-job when starting, and deleting when completed.

Create a host group

- name: Create host groups
  # set task level variables as we change ansible_connection plugin here
  community.zabbix.zabbix_group:
    state: present
    host_groups: "{{ item['hostgroup'] }}"
  with_items: "{{ zabbix_templates }}" # Hostgroups specific to an ansible group can be overridden in inventory/group_vars/group_name
  run_once: True
  tags:
    - zabbix_hostgroups
  vars:
    ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}"
    ansible_network_os: community.zabbix.zabbix
    ansible_connection: httpapi
    ansible_httpapi_port: 443
    ansible_httpapi_use_ssl: true
    ansible_httpapi_validate_certs: false
    ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}"
    ansible_zabbix_url_path: ""  # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu

Add production releng_compose hosts to the Zabbix host group

- name: Add hosts to  hostgroups
  community.zabbix.zabbix_host:
    host_name: "{{ inventory_hostname }}"
    host_groups: "{{ item['hostgroup']}}"
#    link_templates: "{{ item['template'] }}" # We're adding the template to hostgroups in a seperate step, may not be required.
    force: false
  with_items: "{{ zabbix_templates }}"
  tags:
    - zabbix_add_hosts_to_hostgroups
  vars:
    ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}"
    ansible_network_os: community.zabbix.zabbix
    ansible_connection: httpapi
    ansible_httpapi_port: 443
    ansible_httpapi_use_ssl: true
    ansible_httpapi_validate_certs: false
    ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}"
    ansible_zabbix_url_path: ""  # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu

Import a custom template

Using the zabbix ansible role community.zabbix.zabbix_template, create a template:

Make sure to use JSON format. It might be best to use the Zabbix UI to configure initially, and then export the template. Make sure that the JSON template is minimised before importing back into Zabbix.

#- name: Get Zabbix template as JSON
#  community.zabbix.zabbix_template_info:
#    template_name: fedora releng compose cronjobs
#    format: json
#    omit_date: yes
#  register: zabbix_template_json

#- name: Write Zabbix templte to JSON file
#  local_action:
#    module: copy
#    content: "{{ zabbix_template_json['template_json'] }}"
#    dest: "roles/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json"

- name: Import Zabbix templates from JSON
  community.zabbix.zabbix_template:
    template_json: "{{ lookup('file', item['template'] ) }}"
    state: present
  with_items: "{{ zabbix_templates }}" # Templates specific to an ansible group, can be overwridden in inventory/group_vars/group_name
  tags:
    - zabbix_templates
  vars:
    ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}"
    ansible_network_os: community.zabbix.zabbix
    ansible_connection: httpapi
    ansible_httpapi_port: 443
    ansible_httpapi_use_ssl: true
    ansible_httpapi_validate_certs: false
    ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}"
    ansible_zabbix_url_path: ""  # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu

Add template to host groups

- name: Add templates to hosts
  community.zabbix.zabbix_host:
    host_name: "{{ inventory_hostname }}"
    host_groups: "{{ item['hostgroup']}}"
    link_templates: "{{ item['template'] }}"
    force: false
  with_items: "{{ zabbix_templates }}"
  tags:
    - zabbix_add_templates_to_hosts
  vars:
    ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}"
    ansible_network_os: community.zabbix.zabbix
    ansible_connection: httpapi
    ansible_httpapi_port: 443
    ansible_httpapi_use_ssl: true
    ansible_httpapi_validate_certs: false
    ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}"
    ansible_zabbix_url_path: ""  # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu

In this template create an item, one for each cronjob

  • Configure the type as zabbix agent active

  • Configure history to 7d for 1 week

  • Configure the resolution to 60m to check every hour

  • Configure the key to match something like the following, changing the * to what ever the name of the cronjob is, eg rawhide

vfs.file.exists[/tmp/fedora-compose-*]

In this template create a trigger, one for each cronjob.

  • Configure the trigger to 8 hours.

  • Configure the severity to high

  • In this example the releng_compose_cronjobs.json is the name of the template, it makes it generic, and when the template is applied to a host, it gains the triggers and items contained in the template.

  • Configure the expression to something like the following, changing the * to what ever the name of the file in the key in the matching item

last(/releng_compose_cronjobs.json/vfs.file.exists[/tmp/fedora-compose-branched])=1 and min(/releng_compose_cronjobs.json/vfs.file.exists[/tmp/fedora-compose-branched],8h)>0

Modify each cronjob in ansible

  • When a cronjob starts: — create a file in /tmp/name-of-cron-job

  • When a cronjob ends: — delete the file in /tmp/name-of-cron-job

  • If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.

Fedora Ansible Group Vars for Zabbix

The following var structure is required to configure this new zabbix_template role. See the example structure in the inventory/groups/releng_compose for production:

zabbix_templates:
  - group: "releng_compose"
    template: "releng_compose_cronjobs.json"
    hostgroup: "fedora releng compose"

And for staging:

zabbix_templates: "{{ [] }}"

Currently we do not run composes in staging. So I’ve not activated this role on the staging machines. Ordinarially, make sure to add the same vars in prod and staging environments.

Each element in the list, should be used to link a single Zabbix template to a Zabbix hostgroup. group parameter is not currently used, but it should be set to the Ansible group name for documentation purposes.

To use this role going forward:

  • Add template.json files to roles/zabbix/zabbix_templates/files

  • Add a zabbix_templates var to the inventory/groups/groupname file that matches the ansible group

  • Import the role in the template corresponding with this ansible group eg:

# playbooks/groups/releng-compose.yml:33
  roles:
...
  - zabbix/zabbix_templates
...

Future Work

Replace these custom releng monitoring things with zabbix.