Layered Image Build System
The Fedora Layered Image Build System, often referred to as OSBS (OpenShift Build Service) as that is the upstream project that this is based on, is used to build Layered Container Images in the Fedora Infrastructure via Koji.
Contact Information
- Owner
-
Clement Verna (cverna)
- Contact
-
#fedora-admin, #fedora-releng, #fedora-noc, sysadmin-main, sysadmin-releng
- Location
-
osbs-control01, osbs-master01, osbs-node01, osbs-node02 registry.fedoraproject.org, candidate-registry.fedoraproject.org + osbs-control01.stg, osbs-master01.stg, osbs-node01.stg, osbs-node02.stg registry.stg.fedoraproject.org, candidate-registry.stg.fedoraproject.org + x86_64 koji buildvms
- Purpose
-
Layered Container Image Builds
Overview
The build system is setup such that Fedora Layered Image maintainers
will submit a build to Koji via the fedpkg container-build
command a
container
namespace within
DistGit. This will
trigger the build to be scheduled in
OpenShift via
osbs-client tooling,
which creates a custom
OpenShift Build
which will use the pre-made buildroot container image that we have
created. The Atomic
Reactor (atomic-reactor
) utility will run within the buildroot and
prep the build container where the actual build action will execute, it
will also maintain uploading the
Content Generator
metadata back to Koji and upload
the built image to the candidate docker registry. This will run on a
host with iptables rules restricting access to the docker bridge, this
is how we will further limit the access of the buildroot to the outside
world verifying that all sources of information come from Fedora.
Completed layered image builds are hosted in a candidate docker registry which is then used to pull the image and perform tests.
Setup
The Layered Image Build System setup is currently as follows (more detailed view available in the RelEng Architecture Document):
=== Layered Image Build System Overview === +--------------+ +-----------+ | | | | | koji hub +----+ | batcave | | | | | | +--------------+ | +----+------+ | | V | +----------------+ V | | +----------------+ | koji builder | | +-----------+ | | | osbs-control01 +--------+ | +-+--------------+ | +-----+ | | | +----------------+ | | | | | | | | | | | | | | | V | | | +----------------+ | | | | | | | | | osbs-master01 +------------------------------+ [ansible] | +-------+ | | | | +----------------+ | | | | | ^ | | | | | | | | | | | | V V | | | | +-----------------+ +----------------+ | | | | | | | | | | | | | osbs-node01 | | osbs-node02 | | | | | | | | | | | | | +-----------------+ +----------------+ | | | | ^ ^ | | | | | | | | | | | +-----------+ | | | | | | | +------------------------------------------+ | | | +-------------------------------------------------------------+
Deployment
From batcave you can run the following
$ sudo rbac-playbook groups/osbs/deploy-cluster.yml
This is going to deploy the OpenShift cluster used by OSBS. Currently the playbook deploys 2 clusters (x86_64 and aarch64). Ansible tags can be used to deploy only one of these if needed for example osbs-x86-deploy-openshift.
If the openshift-ansible playbook fails it can be easier to run it directly from osbs-control01 and use the verbose mode.
$ ssh osbs-control01.iad2.fedoraproject.org
$ sudo -i
# cd /root/openshift-ansible
# ansible-playbook -i cluster-inventory playbooks/prerequisites.yml
# ansible-playbook -i cluster-inventory playbooks/deploy_cluster.yml
Once these playbook have been successfull, you can configure OSBS on the cluster. For that use the following playbook
$ sudo rbac-playbook groups/osbs/configure-osbs.yml
When this is done we need to get the new koji service token and update its value in the private repository
$ ssh osbs-master01.iad2.fedoraproject.org
$ sudo -i
# oc -n osbs-fedora sa get-token koji
dsjflksfkgjgkjfdl ....
The token needs to be saved in the private ansible repo in
files/osbs/production/x86-64-osbs-koji
. Once this is done
you can run the builder playbook to update that token.
$ sudo rbac-playbook groups/buildvm.yml -t osbs
Operation
Koji Hub will schedule the containerBuild
on a koji builder via the
koji-containerbuild-hub
plugin, the builder will then submit the build
in OpenShift via the koji-containerbuild-builder
plugin which uses the
osbs-client
python API that wraps the OpenShift API along with a custom
OpenShift Build JSON payload.
The Build is then scheduled in OpenShift and it’s logs are captured by the koji plugins. Inside the buildroot, atomic-reactor will upload the built container image as well as provide the metadata to koji’s content generator.
Outage
If Koji is down, then builds can’t be scheduled but repairing Koji is outside the scope of this document.
If either the candidate-registry.fedoraproject.org or registry.fedoraproject.org. Container registries are unavailable, but repairing those is also outside the scope of this document.
OSBS Failures
OpenShift Build System itself can have various types of failures that are known about and the recovery procedures are listed below.
Ran out of disk space
Docker uses a lot of disk space, and while the osbs-nodes have been allocated what is considered to be ample disk space for builds (since they are automatically cleaned up periodically) it is possible this will run out.
To resolve this, run the following commands:
# These command will clean up old/dead docker containers from old OpenShift # Pods $ for i in $(sudo docker ps -a | awk '/Exited/ { print $1 }'); do sudo docker rm $i; done $ for i in $(sudo docker images -q -f 'dangling=true'); do sudo docker rmi $i; done # This command should only be run on osbs-master01 (it won't work on the # nodes) # # This command will clean up old builds and related artifacts in OpenShift # that are older than 30 days (We can get more aggressive about this if # necessary, the main reason these still exist is in the event we need to # debug something. All build info we care about is stored in Koji.) $ oadm prune builds --orphans --keep-younger-than=720h0m0s --confirm
A node is broken, how to remove it from the cluster?
If a node is having an issue, the following command will effectively remove it from the cluster temporarily.
In this example, we are removing osbs-node01
$ oadm manage-node osbs-node01.phx2.fedoraproject.org --schedulable=true
Container Builds are unable to access resources on the network
Sometimes the Container Builds will fail and the logs will show that the buildroot is unable to access networked resources (docker registry, dnf repos, etc).
This is because of a bug in OpenShift v1.3.1 (current upstream release at the time of this writing) where an OpenVSwitch flow is left behind when a Pod is destroyed instead of the flow being deleted along with the Pod.
Method to confirm the issue is unfortunately multi-step since it’s not a cluster-wide issue but isolated to the node experiencing the problem.
First in the koji createContainer task there is a log file called openshift-incremental.log and in there you will find a key:value in some JSON output similar to the following:
'openshift_build_selflink': u'/oapi/v1/namespaces/default/builds/cockpit-f24-6``
The last field of the value, in this example cockpit-f24-6
is the
OpenShift build identifier. We need to ssh into osbs-master01
and get
information about which node that ran on.
# On osbs-master01 # Note: the output won't be pretty, but it gives you the info you need $ sudo oc get build cockpit-f25-3 -o yaml | grep osbs-node
Once you know what machine you need, ssh into it and run the following:
$ sudo docker run --rm -ti buildroot /bin/bash'
# now attempt to run a curl command
$ curl https://google.com
# This should get refused, but if this node is experiencing the networking
# issue then this command will hang and eventually time out
How to fix:
Reboot the affected node that’s experiencing the issue, when the node comes back up OpenShift will rebuild the flow tables on OpenVSwitch and things will be back to normal.
systemctl reboot
Want to help? Learn how to contribute to Fedora Docs ›