Layered Image Build System

The Fedora Layered Image Build System, often referred to as OSBS (OpenShift Build Service) as that is the upstream project that this is based on, is used to build Layered Container Images in the Fedora Infrastructure via Koji.

Contact Information

Owner

Clement Verna (cverna)

Contact

#fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main, sysadmin-releng

Location

osbs-control01, osbs-master01, osbs-node01, osbs-node02 registry.fedoraproject.org, candidate-registry.fedoraproject.org + osbs-control01.stg, osbs-master01.stg, osbs-node01.stg, osbs-node02.stg registry.stg.fedoraproject.org, candidate-registry.stg.fedoraproject.org + x86_64 koji buildvms

Purpose

Layered Container Image Builds

Overview

The build system is setup such that Fedora Layered Image maintainers will submit a build to Koji via the fedpkg container-build command a container namespace within DistGit. This will trigger the build to be scheduled in OpenShift via osbs-client tooling, which creates a custom OpenShift Build which will use the pre-made buildroot container image that we have created. The Atomic Reactor (atomic-reactor) utility will run within the buildroot and prep the build container where the actual build action will execute, it will also maintain uploading the Content Generator metadata back to Koji and upload the built image to the candidate docker registry. This will run on a host with iptables rules restricting access to the docker bridge, this is how we will further limit the access of the buildroot to the outside world verifying that all sources of information come from Fedora.

Completed layered image builds are hosted in a candidate docker registry which is then used to pull the image and perform tests.

Setup

The Layered Image Build System setup is currently as follows (more detailed view available in the RelEng Architecture Document):

=== Layered Image Build System Overview ===

     +--------------+                           +-----------+
     |              |                           |           |
     |   koji hub   +----+                      |  batcave  |
     |              |    |                      |           |
     +--------------+    |                      +----+------+
                         |                           |
                         V                           |
             +----------------+                      V
             |                |           +----------------+
             |  koji builder  |           |                +-----------+
             |                |           | osbs-control01 +--------+  |
             +-+--------------+           |                +-----+  |  |
               |                          +----------------+     |  |  |
               |                                                 |  |  |
               |                                                 |  |  |
               |                                                 |  |  |
               V                                                 |  |  |
    +----------------+                                           |  |  |
    |                |                                           |  |  |
    | osbs-master01  +------------------------------+           [ansible]
    |                +-------+                      |            |  |  |
    +----------------+       |                      |            |  |  |
         ^                   |                      |            |  |  |
         |                   |                      |            |  |  |
         |                   V                      V            |  |  |
         |        +-----------------+       +----------------+   |  |  |
         |        |                 |       |                |   |  |  |
         |        |  osbs-node01    |       |  osbs-node02   |   |  |  |
         |        |                 |       |                |   |  |  |
         |        +-----------------+       +----------------+   |  |  |
         |               ^                           ^           |  |  |
         |               |                           |           |  |  |
         |               |                           +-----------+  |  |
         |               |                                          |  |
         |               +------------------------------------------+  |
         |                                                             |
         +-------------------------------------------------------------+

Deployment

From batcave you can run the following

$ sudo rbac-playbook groups/osbs/deploy-cluster.yml

This is going to deploy the OpenShift cluster used by OSBS. Currently the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can be used to deploy only one of these if needed for example osbs-x86-deploy-openshift.

If the openshift-ansible playbook fails it can be easier to run it directly from osbs-control01 and use the verbose mode.

$ ssh osbs-control01.iad2.fedoraproject.org
$ sudo -i
# cd /root/openshift-ansible
# ansible-playbook -i cluster-inventory playbooks/prerequisites.yml
# ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml

Once these playbook have been successfull, you can configure OSBS on the cluster. For that use the following playbook

$ sudo rbac-playbook groups/osbs/configure-osbs.yml

When this is done we need to get the new koji service token and update its value in the private repository

$ ssh osbs-master01.iad2.fedoraproject.org
$ sudo -i
# oc -n osbs-fedora sa get-token koji
dsjflksfkgjgkjfdl ....

The token needs to be saved in the private ansible repo in files/osbs/production/x86-64-osbs-koji. Once this is done you can run the builder playbook to update that token.

$ sudo rbac-playbook groups/buildvm.yml -t osbs

Operation

Koji Hub will schedule the containerBuild on a koji builder via the koji-containerbuild-hub plugin, the builder will then submit the build in OpenShift via the koji-containerbuild-builder plugin which uses the osbs-client python API that wraps the OpenShift API along with a custom OpenShift Build JSON payload.

The Build is then scheduled in OpenShift and it’s logs are captured by the koji plugins. Inside the buildroot, atomic-reactor will upload the built container image as well as provide the metadata to koji’s content generator.

Outage

If Koji is down, then builds can’t be scheduled but repairing Koji is outside the scope of this document.

If either the candidate-registry.fedoraproject.org or registry.fedoraproject.org. Container registries are unavailable, but repairing those is also outside the scope of this document.

OSBS Failures

OpenShift Build System itself can have various types of failures that are known about and the recovery procedures are listed below.

Ran out of disk space

Docker uses a lot of disk space, and while the osbs-nodes have been allocated what is considered to be ample disk space for builds (since they are automatically cleaned up periodically) it is possible this will run out.

To resolve this, run the following commands:

# These command will clean up old/dead docker containers from old OpenShift
# Pods

$ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i; done

$ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done


# This command should only be run on osbs-master01 (it won't work on the
# nodes)
#
# This command will clean up old builds and related artifacts in OpenShift
# that are older than 30 days (We can get more aggressive about this if
# necessary, the main reason these still exist is in the event we need to
# debug something. All build info we care about is stored in Koji.)

$ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm

A node is broken, how to remove it from the cluster?

If a node is having an issue, the following command will effectively remove it from the cluster temporarily.

In this example, we are removing osbs-node01

$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true

Container Builds are unable to access resources on the network

Sometimes the Container Builds will fail and the logs will show that the buildroot is unable to access networked resources (docker registry, dnf repos, etc).

This is because of a bug in OpenShift v1.3.1 (current upstream release at the time of this writing) where an OpenVSwitch flow is left behind when a Pod is destroyed instead of the flow being deleted along with the Pod.

Method to confirm the issue is unfortunately multi-step since it’s not a cluster-wide issue but isolated to the node experiencing the problem.

First in the koji createContainer task there is a log file called openshift-incremental.log and in there you will find a key:value in some JSON output similar to the following:

'openshift_build_selflink': u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``

The last field of the value, in this example cockpit-f24-6 is the OpenShift build identifier. We need to ssh into osbs-master01 and get information about which node that ran on.

# On osbs-master01
#   Note: the output won't be pretty, but it gives you the info you need

$ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node

Once you know what machine you need, ssh into it and run the following:

$ sudo docker run --rm -ti buildroot /bin/bash'

# now attempt to run a curl command

$ curl https://google.com
# This should get refused, but if this node is experiencing the networking
# issue then this command will hang and eventually time out

How to fix:

Reboot the affected node that’s experiencing the issue, when the node comes back up OpenShift will rebuild the flow tables on OpenVSwitch and things will be back to normal.

systemctl reboot