OpenQA Infrastructure SOP

OpenQA is an automated test system used to run validation tests on nightly and candidate Fedora composes, and also to run a subset of these tests on critical path updates.

OpenQA production instance: https://openqa.fedoraproject.org

OpenQA staging (lab) instance: https://openqa.stg.fedoraproject.org

Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA

Upstream project page: http://open.qa/

Upstream repositories: https://github.com/os-autoinst

Contact Information

Owner

Fedora QA devel

Contact

#fedora-qa, #fedora-admin, qa-devel mailing list

People

Adam Williamson (adamwill / adamw), Lukas Ruzicka (lruzicka)

Machines

See ansible inventory groups with 'openqa' in name

Purpose

Run automated tests on VMs via screen recognition and VNC input

Architecture

Each openQA instance consists of a server (these are virtual machines) and one or more worker hosts (these are bare metal systems). The server schedules tests ("jobs", in openQA parlance) and stores results and associated data. The worker hosts run "jobs" and send the results back to the server. The server also runs some message consumers to handle automatic scheduling of jobs and reporting of results to external systems (ResultsDB and Wikitcms).

Server

The server runs a web UI for viewing scheduled, running and completed tests and their data, with an admin interface where many aspects of the system can be configured (though we do not use the web UI for several aspects of configuration). There are several separate services that run on each server, and communicate with each other mainly via dbus. Each server requires its own PostgreSQL database. The web UI and websockets server are made externally available via reverse proxying through an Apache server.

It hosts an NFS share that contains the tests, the 'needles' (screenshots with metadata as JSON files that are used for screen matching), and test 'assets' like ISO files and disk images. The path is /var/lib/openqa/share/factory.

In our deployment, the PostgreSQL database for each instance is hosted by db-openqa01. Also, some paths on the server are themselves mounted as NFS shares from the infra storage server. This is so that these are not lost if the server is re-deployed, and can easily be backed up. These locations contain the data from each executed job. As both the database and these key data files are not actually stored on the server, the server can be redeployed from scratch without loss of any data. We’ve done this successfully several times.

Also in our deployment, an openQA plugin (which we wrote, but which is part of the upstream codebase) is enabled which publishes messages on various events.

The server systems run a message consumer for the purpose of automatically scheduling jobs in response to the appearance of new composes and critical path updates, and one each for the purpose of reporting the results of completed jobs to ResultsDB and Wikitcms. These use the fm-consumer@ pattern from fedora-messaging.

Worker hosts

The worker hosts run several individual worker 'instances' (via systemd’s 'instantiated service' mechanism), each of which registers with the server and accepts jobs from it, uploading the results of the job and some associated data to the server on completion. The worker instances and server communicate both via a conventional web API provided by the server and via websockets. When a worker runs a job, it starts a qemu virtual machine (directly - libvirt is not used) and interacts with it via VNC and the serial console, following a set of steps dictating what it should do and what response it should expect in terms of screen contents or serial console output. The server 'pushes' jobs to the worker instances over a websocket connection.

Each worker host must mount the /var/lib/openqa/share/factory NFS share provided by the server. If this share is not mounted, any jobs run will fail immediately due to expected asset and test files not being found.

Some worker hosts for each instance are denominated 'tap workers', meaning they run some advanced jobs which use software-defined networking (openvswitch) to interact with each other. All the configuration for this should be handled by the ansible scripts, but it’s useful to be aware that there is complex software-defined networking stuff going on on these hosts which could potentially be the source of problems (backed by openvswitch). There is some more detail on this in the wiki page and upstream docs; refer to the ansible plays for the details of how it’s actually configured.

Deployment and regular operation

Deployment and normal update of the openQA systems should run entirely through Ansible. Just running the appropriate ansible plays for the systems should complete the entire deployment / update process, though it is best to check after running them that there are no failed services on any of the systems (restart any that failed), and that the web UI is properly accessible.

Regular operation of the openQA deployments is entirely automated. Jobs should be scheduled and run automatically when new composes and critical path updates appear, and results should be reported to ResultsDB and Wikitcms (when appropriate). Dynamically generated assets should be regenerated regularly, including across release boundaries (see the section on createhdds below): no manual intervention should be required when a new Fedora release appears. If any of this does not happen, something is wrong, and manual inspection is needed.

Our usual practice is to upgrade the openQA systems to new Fedora releases promptly as they appear, using dnf system-upgrade. This is done manually. We usually upgrade the staging instance first and watch for problems for a week or two before upgrading production.

Rebooting / restarting

The optimal approach to rebooting an entire openQA deployment is as follows:

  1. Reboot the server

  2. Check for failed services (systemctl --failed) and restart any that failed

  3. Once the server is fully functional, reboot the worker hosts

  4. Check for failed services and restart any that failed, particularly the NFS mount service, on each worker host

  5. Check in the web UI for failed jobs and restart them, especially tests of updates

Rebooting the workers after the server is important due to the NFS share.

If only the server needs restarting, the entire procedure above should ideally be followed in any case, to ensure there are no issues with the NFS mount breaking due to the server reboot, or the server and worker getting confused about running jobs due to the websockets connections being restarted.

If only a worker host needs restarting, there is no need to restart the server too. Ideally, wait until no jobs are running on that host, and stop all open-worker@ services on the host before rebooting it; but in a pinch, if you reboot with running jobs, they should be automatically rescheduled. Still, you should manually check in the web UI for failed jobs and restart them.

There are two ways to check if jobs are running and if so where. You can go to the web UI for the server and click 'All Tests'. If any jobs are running, you can open each one individually (click the link in the 'Test' column) and look at the 'Assigned worker', which will tell you which host the job is running on. Or, if you have admin access, you can go to the admin menu (top right of the web UI, once you are logged in) and click on 'Workers', which will show the status of all known workers for that server, and select 'Working' in the state filter box. This will show all workers currently working on a job.

Creating needles

openQA has a "Developer mode" for creating new needles. You need to be an openQA admin or operator to use it. It’s usually best to do this on the staging instance. On a running test, go to the Live View tab. You should see a "Developer mode" box above the test video. Click it, then change "Pause on screen mismatch" to "assert_screen timeout" and click "Confirm to control this test". Now, the next time a screen match assertion fails, the test will pause and a button to open the needle editor will be shown. On the needle editor screen you can select an existing needle to use as a base at the top right. It’s easiest if an existing needle is nearly matched - in this case you can probably just select it and hit Save. Otherwise, it’s still a good idea to pick an existing needle to start from, as it will at least give you the tags and filename of that needle, which are probably a good base. If you don’t need to change the tags, just change the match areas (unless they’re already correct), tweak the filename, and hit Save.

The saved needle will be present on the server in /var/lib/openqa/share/tests/fedora/needles. You should copy it off the server and into your local system, in a checkout of the os-autoinst-distri-fedora repository, then place it in the appropriate subdirectory of needles/ - probably the same one as existing needles of the same type. Create a commit with the needle. If you have commit rights you can push it out directly; if you don’t, or you want someone to check your work, you can create a pull request in the usual way. Once the needle is merged, you can go back to the server, do a git pull, and remove the copy of the needle that’s directly in the needles/ directory. Remember to also pull the needle onto the other server instance! If the staging instance is not on the main branch, rebase whatever branch it’s on (from your local system) then do git fetch origin then git reset --hard origin/<branch> on staging.

Handling Fedora branch events

Branching of a new release is quite disruptive to openQA and requires some manual handling. It is best to collaborate closely with the release engineering team on this.

In general, tests for the newly-branched release and Rawhide may have issues until a first Branched compose, and a first post-branching compose of Rawhide, are both done and synced to mirrors, mirror manager’s configuration is correctly updated, and the relval release metadata is updated. These steps are all covered in the mass brancing guide, but be aware they need to happen and work with releng to monitor them. You may need to edit the relval release metadata if the release engineer does not have access to it. On the openQA side, new base disk images need to be created for both the new Branched release and for Rawhide. This can only be done once the composes are complete and mirrored and the relval metadata is updated.

Until the relval metadata is updated, openQA will believe the to-be- branched release number is Rawhide, and any updates for the new Rawhide release number (one number higher) will confuse it. Once the metadata is updated, openQA will believe the to-be-branched release number is Branched and the new Rawhide release number is Rawhide; if the change is made too early, tests of updates for the new Branched number may fail due to the expected repositories not existing, etc. There is no perfect time to update the metadata, so just be aware that you may see odd behavior related to this during the branching process.

Update tests for Rawhide updates created after branching will expect base disk images with the new release number, and so will all fail until these are created. Update tests for the new Branched release will attempt to run with the existing base disk images for that release number, but because these will be Rawhide images (and have Rawhide repository configuration), various things will fail until new images are built from the new Branched compose.

The simplest way to create the appropriate base disk images is to wait until both Branched and Rawhide composes are done and mirrored and the relval metadata is updated, then delete all existing base images for the new branched release number from /var/lib/openqa/share/factory/hdd/fixed (you can do staging first as a test, then prod if staging works OK). Then run the openQA worker playbook, which will cause the necessary images to be built (this takes some time). You can also do things more manually by ssh’ing into the worker hosts that are configured to build disk images (see openqa_hdds_workers group in ansible), going to /var/lib/openqa/share/factory/hdd/fixed, and running /root/createhdds/createhdds.py commands.

At the time a test is scheduled, several release-number-related variables are set based on info from fedfind, especially RAWREL, which is set to whatever number relval currently believes is Rawhide. If this value is wrong, various tests will fail. Watch out for a situation where tests still have RAWREL set to the old Rawhide, now Branched, number after branching is complete. If this happens, you need to re-schedule the tests, not just re-run them, because re-running the tests does not change the variables. Run:

fedora-openqa update -f <updateid>

to re-schedule.

If tests are failing for reasons that look related to mirroring, ask releng to look at the problems and fix the mirror config.

As part of branching, new fedora-repos and fedora-release packages must be built, but there may be a catch-22 situation where they fail tests, but we need them pushed stable in order to proceed with branching and get back to a point where tests will pass. If that happens, just waive the failures from Bodhi to let the update go through.

The desktop_background test will fail for the newly-branched release because there will be no needles for it yet (we do not test the background on Rawhide). If the backgrounds for the new release are already present and working, just create a new needle in needles/background and commit it. If the new backgrounds are not ready yet, you can either temporarily disable gating for the test - comment out the line for it in infra ansible roles/openshift-apps/greenwave/templates/fedora.yaml.j2 and re-run the relevant plays (you may need help from someone with sysadmin-main access to do this) - or temporarily 'short-circuit' the test for the Branched release in the same way as it is short-circuited for Rawhide, so it (incorrectly) passes.

Install tests on Rawhide updates will fail until new version_NN_ident needles are created - there is a check that the installer correctly shows the release number. Find a test failing on version_NN_ident and create the needle, then commit it.

Throughout the process, keep one Branched update and one Rawhide update handy to use as 'canaries' - keep re-running the tests on them until they all pass. Once you have all the tests passing for a release, re-run (or re-schedule, if necessary) all failed tests on other updates for the same release. It is important to ensure there are no failures for any update (unless, of course, it’s a "real" failure, not a result of branching). Note that the ostree tests aren’t gating, so it’s most important to fix the other tests first, you can fix ostree tests a bit slower if necessary.

Troubleshooting

New tests not being scheduled

Check that fm-consumer@fedora_openqa_scheduler.service is enabled, running, and not crashing. If that doesn’t do the trick, the scheduler may be broken or the expected messages may not be being published.

Results not being reported to resultsdb and/or the wiki

Check that fm-consumer@fedora_openqa_resultsdb_reporter.service and fm-consumer@fedora_openqa_wiki_reporter.service are enabled, running, and not crashing.

Services that write to the wiki keep crashing

If fm-consumer@fedora_openqa_wiki_reporter.service (and other services that write to the wiki, like the relval and relvalami consumers) are constantly failing/crashing, the API token may have been overwritten somehow. Re-run the relevant plays (on batcave01):

sudo rbac-playbook groups/openqa.yml -t openqa_dispatcher

If this does not sort it out, you may need help from a wiki admin to work out what’s going on.

Many tests failing on the same worker host, in unusual ways

Sometimes, worker hosts can just "go bad", through memory exhaustion, for instance. This usually manifests as unusual test failures (for instance, failures very early in a test that aren’t caused by invalid test files, tests that time out when they usually would not, or tests that seem to just die suddenly with a cryptic error message). If you encounter this, just reboot the affected worker host. This is more common on staging than production, as we intentionally run the older, weaker worker hosts on the staging instance. If things are particularly bad you may not be able to ssh into the host, and will need to reboot it from the sideband controller; if you’re not sure how to do this, contact someone from sysadmin-main for assistance.

Tests failing early, complaining about missing assets

If many tests are failing early with errors suggesting they can’t find required files, check for failed services on the worker hosts. Sometimes the NFS mount service fails and needs restarting.

Tests failing early, complaining "could not configure /dev/net/tun (tapXX): Operation not permitted"

Run rbac-playbook groups/openqa-workers.yml -t openqa_worker. Limit it to the specific affected hosts if you like. This happens when the qemu package is updated and /usr/bin/qemu-system-<arch> loses the capabilities it needs to configure tap devices (which are set in the playbook).

Disk space issues: server local root

If a server is running out of space on its local root partition, the cause is almost certainly asset storage. Almost all the space on the server root partition is used by test assets (ISO and hard disk image files).

openQA has a system for limiting the amount of space used by asset storage, which we configure via ansible variables. Check the values of the openqa_assetsize* variables in the openQA server group variables in ansible. If the settings for the server sum to the amount of space used, or more than it, those settings may need to be reduced. If there seems to be more space used than the settings would allow for, there may be an issue preventing the openQA task that actually enforces the limits from running: check the "Minion Dashboard" (from the top-right menu) in the openQA web UI and look for stuck or failed limit_assets tasks (or just check whether any have completed recently; the task is scheduled after each completed job so it should run frequently). There is also an "Assets" link in the menu which gives you a web UI view of the limits on each job group and the current size and present assets, though note the list of present assets and the current size is updated by the limit_assets task, so it will be inaccurate if that is not being run successfully. You must be an openQA operator to access the "Assets" view, and an administrator to access the "Minion Dashboard".

In a pinch, if there is no space and tests are failing, you can wipe older, larger asset files in /var/lib/openqa/share/factory/iso and /var/lib/openqa/share/factory/hdd to get things moving again while you debug the issue. This is better than letting new tests fail.

Disk space issues: testresults and images NFS share

As mentioned above, the server mounts two NFS shares from the infra storage server, at /var/lib/openqa/images and /var/lib/openqa/testresults (they are both actually backed by a single volume). These are where the screenshots, video and logs of the executed tests are stored. If they fill up, tests will start to fail.

openQA has a garbage collection mechanism which deletes (most) files from (most) jobs when they are six months old, which ought to keep usage of these shares in a steady state. However, if we enhance test coverage so openQA is running more tests in any given six month period than earlier ones, space usage will increase correspondingly. It can also increase in response to odd triggers like a bug which causes a lot of messages to be logged to a serial console, or a test being configured to upload a very large file as a log.

More importantly, there is a snapshot mechanism configured on this volume for the production instance, so space usage will always gradually increase there. When the volume gets too full, we must delete some older snapshots to free up space. This must be done by an infra storage admin. The volume’s name is fedora_openqa.

Scheduling jobs manually

While it is not normally necessary, you may sometimes need to run or re-run jobs manually.

The simplest cases can be handled by an admin from the web UI: for a logged-in admin, all scheduled and running tests can be cancelled (from various views), and all completed tests can be restarted. 'Restarting' a job actually effectively clones it and schedules the clone to be run: it creates a new job with a new job ID, and the previous job still exists. openQA attempts to handle complex cases of inter-dependent jobs correctly when restarting, but doesn’t always manage to do it right; when it goes wrong, the best thing to do is usually to re-run all jobs for that medium.

Restarting a job should cause its status indicator (the little colored blob) to go blue. If nothing changes, the restart likely failed. An error message should explain why, but it always appears at the top of the page, so you may need to scroll up to see it. If restarting a test fails because an asset (an ISO file or hard disk image) is missing, you will need to re-schedule the tests (see below).

To run or re-run the full set of tests for a compose or update, you can use the fedora-openqa CLI. To run or re-run tests for a compose, use:

fedora-openqa compose -f (COMPOSE LOCATION)

where (COMPOSE LOCATION) is the full URL of the /compose subdirectory of the compose. If you have an existing test to use as a reference, go to the Settings tab, and the URL will be set as the LOCATION setting. This will only work for Pungi-produced composes with the expected productmd-format metadata, and a couple of other quite special cases.

The -f argument means 'force', and is necessary to re-run tests: usually, the scheduler will refuse to re-schedule tests that have already run, and -f overrides this.

To run or re-run tests for an update, use:

fedora-openqa update -f (UPDATEID)

where (UPDATEID) is the update’s ID - something like FEDORA-2018-blahblah.

To run or re-run only the tests for a specific "flavor", you can pass the --flavor (update) or --flavors (compose) argument - for an update it must be a single flavor, for a compose it may be a single flavor or a comma-separated list. The names of the flavors are shown in the web UI results overview for the compose or update, e.g. "Server-boot-iso". For update tests, omit the leading "updates-" in the flavor name (so, to re-schedule the "updates-workstation" tests for an update, you would pass --flavor workstation).

Less commonly, you can schedule tests for scratch builds using fedora-openqa task and side tags using fedora-openqa tag. This should usually only be done on the staging instance. See the help of fedora-openqa for more details.

openQA provides a special script for cloning an existing job but