SOP Day-to-Day Zabbix usage for Infrastructure
Contact Information
- Owner
-
Fedora Infrastructure Team
- Contact
-
#fedora-admin, sysadmin-main, sysadmin-noc
- Purpose
-
Ensure monitoring config is stored in configuration management for later (re)use.
Overview
Zabbix is the primary monitoring tool for Fedora Infrastructure. The layout is:
-
We have a Prod server for the main infra
-
The URL is [1]
-
Notifications are sent to the UI (for Warning level or lower)
-
Notifications are also sent to #noc:fedoraproject.org (for Average level or above)
-
-
We have a Staging server for testing things on
-
The URL is [2]
-
Notifications are sent to the UI (for Warning level or lower)
-
Notifications are also sent to #fedora-zodbot:fedora.im (for Average level or above)
-
For temporary changes, we use the Zabbix UI - this doc covers the main reasons for that in day-to-day operations.
For permanent changes, we use Ansible to record/deploy them - see SOP Add Zabbix template to Ansible for details on permanent changes.
Common Tasks for Infrastructure
Acknowledging/closing an alert
This takes three possible paths. In all cases, get to the Update form by either:
-
clicking the Update button next to the alert on the Dashboard page, or on the Monitoring > Problems page.
-
clicking the alert in Matrix and then Update on the Event Details page.
Once you have the Update form open, you have three options:
-
If the alert is known and requires the team to take action (e.g. a disk has failed, or similar) check the Acknowledge box.
-
If it’s something we expect to go away, but would like to be re-alerted if it doesn’t (e.g. load on a proxy), check the Suppress box and pick a duration.
-
If it’s dealt with already (e.g. number of packages changed on a host), check the Close Problem box.
In all cases, a description is useful for the rest of the team.
Common manual interventions
There are some frequent alerts that need to be handled manually:
-
Datanommer: FMN queue is stale:
-
This monitors the system at https://notifications.fedoraproject.org, and it does so by checking when a rule was last edited. Create/delete a rule for yourself to trigger an update and clear the alert.
-
-
Datanommer: Bugzilla queue is stale:
-
The connection to the Stomp queue fails from time to time, it needs to be prodded. Restart the pod in fedmsg2bugzilla project in OpenShift
-
-
Datanommer: RPM Sign queue is stale:
-
The signing may have stalled, log in to the autosign01(.stg) host and check if it’s running properly. See https://docs.fedoraproject.org/en-US/infra/howtos/fix_robosignatory/ for details.
-
Disabling/enabling a host
We sometimes need to remove a host due to problems with the machine. To do this:
-
Search for the host in the search sidebar.
-
Click the hostname itself (in the Hosts column of the search results).
-
Scroll to the end of the Host form, and uncheck the Enabled field.
-
Save the host.
To bring the host back, edit again and check Enabled again.
No alerts will be recorded, nor will there be data collection. If data collection is wanted during the downtime, see maintenance windows below.
Maintenance Windows
Zabbix can schedule time periods in which alerts are paused, and optionally data collection too, for one or more hosts. This can be scheduled in advance (e.g. for a migration of hosts to new hardware), or done ad-hoc, and can be one host or everything.
For a quick shush, there are scripts in Ansible (and a playbook) which you can use to disable all hosts for alerts (but continue data colection), and restore it. See:
More generally, you can create a maintenance window in the Zabbix UI like so:
-
Go to Data Collection > Maintenance
-
Click Create Maintenance Period (top-right)
-
Give it a name, type, start/end times, and an optional description
-
For hosts/hostgroups pick what you need. There is an "All hosts" group you can use for everything.
-
Save the maintenance window
Graphing data
Often we want to look at the history for a given host or even a set of items across a group. We can get this data in several ways, here is just one:
-
Go to Monitoring > Latest Data
-
Search for a host of interest
-
Scroll down to find the item name (or use the item filters on name/tag)
-
Click the "Graph" button on the right
-
Select the desired time range
For a group of hosts, the UI often fails to show everything when searching for hostgroups, so the process is:
-
Go to Monitoring > Latest Data
-
Search for one hostname in the group (e.g. proxy01)
-
Find the item of interest (e.g Load average)
-
Copy the item to the Name search box, and change the Hosts search to be a Hostgroup search instead (e.g. Proxies)
-
Check the hosts you want to compare, and click "Display Graph" or "Display Stacked Graph" depending on how you want to see it
If you’re investigating an alert, you can get a graph of the relevant item directly from the Problem (on the Dashboard or Event Detail page). Click the problem, hover over History, and then select the item, and you’ll be taken directly to the graph for that problem.
Resources
-
[1] Zabbix Prod UI: https://zabbix.fedoraproject.org/zabbix.php
-
[2] Zabbix Stg UI: https://zabbix.fedoraproject.org/zabbix.php
Want to help? Learn how to contribute to Fedora Docs ›