OpenQA Infrastructure SOP
OpenQA is an automated test system used to run validation tests on nightly and candidate Fedora composes, and also to run a subset of these tests on critical path updates.
OpenQA production instance: https://openqa.fedoraproject.org
OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org
Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
Upstream project page: http://open.qa/
Upstream repositories: https://github.com/os-autoinst
Contact Information
- Owner
-
Fedora QA devel
- Contact
-
#fedora-qa, #fedora-admin, qa-devel mailing list
- People
-
Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)
- Machines
-
See ansible inventory groups with 'openqa' in name
- Purpose
-
Run automated tests on VMs via screen recognition and VNC input
Architecture
Each openQA instance consists of a server (these are virtual machines) and one or more worker hosts (these are bare metal systems). The server schedules tests ("jobs", in openQA parlance) and stores results and associated data. The worker hosts run "jobs" and send the results back to the server. The server also runs some message consumers to handle automatic scheduling of jobs and reporting of results to external systems (ResultsDB and Wikitcms).
Server
The server runs a web UI for viewing scheduled, running and completed tests and their data, with an admin interface where many aspects of the system can be configured (though we do not use the web UI for several aspects of configuration). There are several separate services that run on each server, and communicate with each other mainly via dbus. Each server requires its own PostgreSQL database. The web UI and websockets server are made externally available via reverse proxying through an Apache server.
It hosts an NFS share that contains the tests, the 'needles'
(screenshots with metadata as JSON files that are used for screen
matching), and test 'assets' like ISO files and disk images. The path is
/var/lib/openqa/share/factory
.
In our deployment, the PostgreSQL database for each instance is hosted by the QA database server. Also, some paths on the server are themselves mounted as NFS shares from the infra storage server. This is so that these are not lost if the server is re-deployed, and can easily be backed up. These locations contain the data from each executed job. As both the database and these key data files are not actually stored on the server, the server can be redeployed from scratch without loss of any data (at least, this is the intent).
Also in our deployment, an openQA plugin (which we wrote, but which is part of the upstream codebase) is enabled which publishes messages on various events.
The server systems run a message consumer for the purpose of
automatically scheduling jobs in response to the appearance of new
composes and critical path updates, and one each for the purpose of
reporting the results of completed jobs to ResultsDB and Wikitcms.
These use the fm-consumer@
pattern from fedora-messaging
.
Worker hosts
The worker hosts run several individual worker 'instances' (via systemd’s 'instantiated service' mechanism), each of which registers with the server and accepts jobs from it, uploading the results of the job and some associated data to the server on completion. The worker instances and server communicate both via a conventional web API provided by the server and via websockets. When a worker runs a job, it starts a qemu virtual machine (directly - libvirt is not used) and interacts with it via VNC and the serial console, following a set of steps dictating what it should do and what response it should expect in terms of screen contents or serial console output. The server 'pushes' jobs to the worker instances over a websocket connection.
Each worker host must mount the /var/lib/openqa/share/factory
NFS
share provided by the server. If this share is not mounted, any jobs run
will fail immediately due to expected asset and test files not being
found.
Some worker hosts for each instance are denominated 'tap workers', meaning they run some advanced jobs which use software-defined networking (openvswitch) to interact with each other. All the configuration for this should be handled by the ansible scripts, but it’s useful to be aware that there is complex software-defined networking stuff going on on these hosts which could potentially be the source of problems (backed by openvswitch). There is some more detail on this in the wiki page and upstream docs; refer to the ansible plays for the details of how it’s actually configured.
Deployment and regular operation
Deployment and normal update of the openQA systems should run entirely through Ansible. Just running the appropriate ansible plays for the systems should complete the entire deployment / update process, though it is best to check after running them that there are no failed services on any of the systems (restart any that failed), and that the web UI is properly accessible.
Regular operation of the openQA deployments is entirely automated. Jobs should be scheduled and run automatically when new composes and critical path updates appear, and results should be reported to ResultsDB and Wikitcms (when appropriate). Dynamically generated assets should be regenerated regularly, including across release boundaries (see the section on createhdds below): no manual intervention should be required when a new Fedora release appears. If any of this does not happen, something is wrong, and manual inspection is needed.
Our usual practice is to upgrade the openQA systems to new Fedora
releases promptly as they appear, using dnf system-upgrade
. This is
done manually. We usually upgrade the staging instance first and watch
for problems for a week or two before upgrading production.
Rebooting / restarting
The optimal approach to rebooting an entire openQA deployment is as follows:
-
Reboot the server
-
Check for failed services (
systemctl --failed
) and restart any that failed -
Once the server is fully functional, reboot the worker hosts
-
Check for failed services and restart any that failed, particularly the NFS mount service, on each worker host
-
Check in the web UI for failed jobs and restart them, especially tests of updates
Rebooting the workers after the server is important due to the NFS share.
If only the server needs restarting, the entire procedure above should ideally be followed in any case, to ensure there are no issues with the NFS mount breaking due to the server reboot, or the server and worker getting confused about running jobs due to the websockets connections being restarted.
If only a worker host needs restarting, there is no need to restart the
server too. Ideally, wait until no jobs are running on that host, and
stop all open-worker@
services on the host before rebooting it; but
in a pinch, if you reboot with running jobs, they should be
automatically rescheduled. Still, you should manually check in the web
UI for failed jobs and restart them.
There are two ways to check if jobs are running and if so where. You can go to the web UI for the server and click 'All Tests'. If any jobs are running, you can open each one individually (click the link in the 'Test' column) and look at the 'Assigned worker', which will tell you which host the job is running on. Or, if you have admin access, you can go to the admin menu (top right of the web UI, once you are logged in) and click on 'Workers', which will show the status of all known workers for that server, and select 'Working' in the state filter box. This will show all workers currently working on a job.
Troubleshooting
New tests not being scheduled
Check that fm-consumer@fedora_openqa_scheduler.service
is enabled,
running, and not crashing. If that doesn’t do the trick, the scheduler
may be broken or the expected messages may not be being published.
Results not being reported to resultsdb and/or the wiki
Check that fm-consumer@fedora_openqa_resultsdb_reporter.service
and
fm-consumer@fedora_openqa_wiki_reporter.service
are enabled,
running, and not crashing.
Services that write to the wiki keep crashing
If fm-consumer@fedora_openqa_wiki_reporter.service
(and other
services that write to the wiki, like the relval
and relvalami
consumers) are constantly failing/crashing, the API token may have
been overwritten somehow. Re-run the relevant plays (on batcave01):
sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher
If this does not sort it out, you may need help from a wiki admin to work out what’s going on.
Many tests failing on the same worker host, in unusual ways
Sometimes, worker hosts can just "go bad", through memory exhaustion, for instance. This usually manifests as unusual test failures (for instance, failures very early in a test that aren’t caused by invalid test files, tests that time out when they usually would not, or tests that seem to just die suddenly with a cryptic error message). If you encounter this, just reboot the affected worker host. This is more common on staging than production, as we intentionally run the older, weaker worker hosts on the staging instance. If things are particularly bad you may not be able to ssh into the host, and will need to reboot it from the sideband controller; if you’re not sure how to do this, contact someone from sysadmin-main for assistance.
Tests failing early, complaining about missing assets
If many tests are failing early with errors suggesting they can’t find required files, check for failed services on the worker hosts. Sometimes the NFS mount service fails and needs restarting.
Disk space issues: server local root
If a server is running out of space on its local root partition, the cause is almost certainly asset storage. Almost all the space on the server root partition is used by test assets (ISO and hard disk image files).
openQA has a system for limiting the amount of space used by asset
storage, which we configure via ansible variables. Check the values of
the openqa_assetsize*
variables in the openQA server group variables
in ansible. If the settings for the server sum to the amount of space
used, or more than it, those settings may need to be reduced. If there
seems to be more space used than the settings would allow for, there
may be an issue preventing the openQA task that actually enforces the
limits from running: check the "Minion Dashboard" (from the top-right
menu) in the openQA web UI and look for stuck or failed limit_assets
tasks (or just check whether any have completed recently; the task is
scheduled after each completed job so it should run frequently). There
is also an "Assets" link in the menu which gives you a web UI view of
the limits on each job group and the current size and present assets,
though note the list of present assets and the current size is updated
by the limit_assets
task, so it will be inaccurate if that is not
being run successfully. You must be an openQA operator to access the
"Assets" view, and an administrator to access the "Minion Dashboard".
In a pinch, if there is no space and tests are failing, you can wipe
older, larger asset files in /var/lib/openqa/share/factory/iso
and
/var/lib/openqa/share/factory/hdd
to get things moving again while
you debug the issue. This is better than letting new tests fail.
Disk space issues: testresults and images NFS share
As mentioned above, the server mounts two NFS shares from the infra
storage server, at /var/lib/openqa/images
and
/var/lib/openqa/testresults
(they are both actually backed by a
single volume). These are where the screenshots, video and logs of
the executed tests are stored. If they fill up, tests will start to
fail.
openQA has a garbage collection mechanism which deletes (most) files from (most) jobs when they are six months old, which ought to keep usage of these shares in a steady state. However, if we enhance test coverage so openQA is running more tests in any given six month period than earlier ones, space usage will increase correspondingly. It can also increase in response to odd triggers like a bug which causes a lot of messages to be logged to a serial console, or a test being configured to upload a very large file as a log.
More importantly, there is a snapshot mechanism configured on this
volume for the production instance, so space usage will always
gradually increase there. When the volume gets too full, we must
delete some older snapshots to free up space. This must be done by
an infra storage admin. The volume’s name is fedora_openqa
.
Scheduling jobs manually
While it is not normally necessary, you may sometimes need to run or re-run jobs manually.
The simplest cases can be handled by an admin from the web UI: for a logged-in admin, all scheduled and running tests can be cancelled (from various views), and all completed tests can be restarted. 'Restarting' a job actually effectively clones it and schedules the clone to be run: it creates a new job with a new job ID, and the previous job still exists. openQA attempts to handle complex cases of inter-dependent jobs correctly when restarting, but doesn’t always manage to do it right; when it goes wrong, the best thing to do is usually to re-run all jobs for that medium.
Restarting a job should cause its status indicator (the little colored blob) to go blue. If nothing changes, the restart likely failed. An error message should explain why, but it always appears at the top of the page, so you may need to scroll up to see it. If restarting a test fails because an asset (an ISO file or hard disk image) is missing, you will need to re-schedule the tests (see below).
To run or re-run the full set of tests for a compose or update, you can
use the fedora-openqa
CLI. To run or re-run tests for a compose, use:
fedora-openqa compose -f (COMPOSE LOCATION)
where (COMPOSE LOCATION)
is the full URL of the /compose
subdirectory of the compose. If you have an existing test to use as a
reference, go to the Settings tab, and the URL will be set as the
LOCATION
setting. This will only work for Pungi-produced composes
with the expected productmd-format metadata, and a couple of other
quite special cases.
The -f
argument means 'force', and is necessary to re-run tests:
usually, the scheduler will refuse to re-schedule tests that have
already run, and -f
overrides this.
To run or re-run tests for an update, use:
fedora-openqa update -f (UPDATEID)
where (UPDATEID)
is the update’s ID - something like
FEDORA-2018-blahblah
.
To run or re-run only the tests for a specific "flavor", you can pass
the --flavor
(update) or --flavors
(compose) argument - for an
update it must be a single flavor, for a compose it may be a single
flavor or a comma-separated list. The names of the flavors are shown
in the web UI results overview for the compose or update, e.g.
"Server-boot-iso". For update tests, omit the leading "updates-" in
the flavor name (so, to re-schedule the "updates-workstation" tests
for an update, you would pass --flavor workstation
).
Less commonly, you can schedule tests for scratch builds using
fedora-openqa task
and side tags using fedora-openqa tag
. This
should usually only be done on the staging instance. See the help
of fedora-openqa
for more details.
openQA provides a special script for cloning an existing job but optionally changing one or more variable values, which can be useful in some situations. Using it looks like this:
/usr/share/openqa/script/clone_job.pl --skip-download --from localhost 123 RAWREL=28
to clone job 123 with the RAWREL
variable set to '28', for instance.
For interdependent jobs, you may or may not want to use the
--skip-deps
argument to avoid re-running the cloned job’s parent
job(s), depending on circumstances.
In very odd circumstances you may need to schedule jobs via an API
request using the low-level CLI client provided by upstream,
openqa-client
; see http://open.qa/docs/#_triggering_tests for details
on this. You may need to refer to the schedule.py
file in the
fedora_openqa
source to figure out exactly what settings to pass to
the scheduler when doing this. It’s extremely unusual to have to do
this, though, so probably don’t worry about it.
Manual updates
In general updates to any of the components of the deployments should be
handled via ansible: push the changes out in the appropriate way (git
repo update, package update, etc.) and then run the ansible plays. There
is an openqa_scratch
variable which can be set to a list of Koji
task IDs for scratch builds; these will be downloaded and configured as
a side repository. This can be used to deploy a newer build of openQA
and/or os-autoinst before it has reached updates-testing if desired
(usually we would do this only on the staging instance). Also, the
openqa_repo
variable can be set to "updates-testing" to install or
update openQA components with updates-testing enabled, to get a new
version before it has waited a week to reach stable.
However, sometimes we do want to update or test a change to something manually for some reason. Here are some notes on those cases.
For updating openQA and/or os-autoinst packages: ideally, ensure no jobs are running. Then, update all installed subpackages on the server. The server services should be automatically restarted as part of the package update. Then, update all installed subpackages on the worker hosts. Usually this should cause the worker services to be restarted, but if not, a 'for' loop can help with that, for instance:
for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
on a host with ten worker instances.
For updating the openQA tests:
cd /var/lib/openqa/share/tests/fedora git pull (or git checkout (branch) or whatever) ./fifloader.py -c -l templates.fif.json templates-updates.fif.json
The fifloader step is only necessary if there are any changes to the templates files.
For updating the scheduler code:
cd /root/fedora_openqa git pull (or whatever changes) python setup.py install systemctl restart fm-consumer@fedora_openqa_scheduler.service systemctl restart fm-consumer@fedora_openqa_resultsdb_reporter.service systemctl restart fm-consumer@fedora_openqa_wiki_reporter.service
Updating other components of the scheduling process follow the same pattern: update the code or package, then remember to restart the message consumers. It’s possible for the openQA instances to need fedfind updates in advance of them being pushed to stable, for example when a new compose type is invented and fedfind doesn’t understand it, openQA can end up trying to schedule tests for it, or the scheduler consumer can crash; when this happens we have to fix and update fedfind on the openQA instances ASAP.
Logging
Just about all useful logging information for all aspects of openQA and
the scheduling and report tools is logged to the journal, except that
the Apache server logs may be of interest in debugging issues related to
accessing the web UI or websockets server. To get more detailed logging
from openQA components, change the logging level in
/etc/openqa/openqa.ini
from 'info' to 'debug' and restart the relevant
services. Any run of the Ansible plays will reset this back to 'info'.
Occasionally the test execution logs may be useful in figuring out why
all tests are failing very early, or some specific tests are failing due
to an asset going missing, etc. Each job’s execution logs can be
accessed through the web UI, on the Logs & Assets tab of the job page;
the files are autoinst-log.txt
and worker-log.txt
.
Dynamic asset generation (createhdds)
Some of the hard disk image file 'assets' used by the openQA tests are
created by a tool called createhdds
, which is checked out of a git
repo to /root/createhdds
on the servers and also on some guests. This
tool uses virt-install
and the Python bindings for libguestfs
to
create various hard disk images the tests need to run. It is usually run
in two different ways. The ansible plays run it in a mode where it will
only create expected images that are entirely missing: this is mainly
meant to facilitate initial deployment. The plays also install a file to
/etc/cron.daily
causing it to be run daily in a mode where it will
also recreate images that are 'too old' (the age-out conditions for
images are part of the tool itself).
This process isn’t 100% reliable; virt-install
can sometimes fail,
either just quasi-randomly or every time, in which case the cause of the
failure needs to be figured out and fixed so the affected image can be
(re-)built. This kind of failure is quite "invisible", as when
regeneration of an image fails, we just keep the old version; this
might be the problem if update tests start failing because the initial
update to bring the system fully up to date times out, for instance.
The images for each arch are built on one worker host of that arch (nominated by inclusion in an ansible inventory group that exists for this purpose); those hosts have write access to the NFS share for this purpose.
Compose check reports (check-compose)
An additional ansible role runs on each openQA server, called
check-compose
. This role installs a tool (also called check-compose
)
and an associated message consumer. The consumer kicks in when all openQA
tests for any compose finish, and uses the check-compose
tool to send
out an email report summarizing the results of the tests (well, the
production server sends out emails, the staging server just logs the
contents of the report). This role isn’t really a part of openQA proper,
but is run on the openQA servers as it seems like as good a place as any
to do it. As with all other message consumers, if making manual changes
or updates to the components, remember to restart the consumer service
afterwards.
Autocloud ResultsDB forwarder (autocloudreporter)
An ansible role called autocloudreporter
also runs on the openQA
production server. This has nothing to do with openQA at all, but is run
there for convenience. This role deploys a fedmsg consumer that listens
for fedmsgs indicating that Autocloud (a separate automated test system
which tests cloud images) has completed a test run, then forwards those
results to ResultsDB.
Want to help? Learn how to contribute to Fedora Docs ›