Fedora Release Engineering Troubleshooting Guide
Fedora Release Engineering consists of many different systems, many different code bases and multiple tools. Needless to say, things can get pretty complex in a hurry. This aspect of Fedora Release Engineering is not very welcoming to newcomers who would like to get involved. This guide stands as a place to educate those new to the processes, systems, code bases, and tools. It also is to serve as a reference to those who aren’t new but maybe are fortunate enough to not have needed to diagnose things in recent memory.
We certainly won’t be able to document every single possible compontent in the systems that could go wrong but hopefully over time this document will stand as a proper knowledge base for reference and educational purposes on the topics listed below.
Compose
If something with a compose has gone wrong, there’s a number of places to find information. Each of these are discussed below.
releng-cron list
The compose output logs are emailed to the releng-cron mailing list. It is good practice to check the releng-cron mailing list archives and find the latest output and give it a look.
compose machines
If the releng-cron mailing list archives doesn’t prove to be useful, you can move on to checking the contents of the composes themselves on the primary compose machines in the Fedora Infrastructure. At the time of this writing, there are multiple machines based on the specific compose you are looking for:
-
Two-Week Atomic Nightly Compose
-
compose-x86-01.phx2.fedoraproject.org
-
-
Branched Compose
-
branched-composer.phx2.fedoraproject.org
-
-
Rawhide Compose
-
rawhide-composer.phx2.fedoraproject.org
-
Depending on which specific compose you are in search of will depend on what full path you will end up inspecting:
-
For Two Week Atomic you will find the compose output in
/mnt/fedora_koji/compose/
-
For Release Candidate / Test Candidate composes you will find compose output in
/mnt/koji/compose/
Note
It’s possible the mock logs are no longer available. The mock chroots are rewritten on subsequent compose runs. |
You can also check for mock logs if they are still persisting from the
compose you are interested in. Find the specific mock chroot directory
name (that will reside in /var/lib/mock/
) you are looking for by
checking the appropriate compose mock configuration (the following is
only a sample provided at the time of this writing):
$ ls /etc/mock/*compose*
/etc/mock/fedora-22-compose-aarch64.cfg /etc/mock/fedora-branched-compose-aarch64.cfg
/etc/mock/fedora-22-compose-armhfp.cfg /etc/mock/fedora-branched-compose-armhfp.cfg
/etc/mock/fedora-22-compose-i386.cfg /etc/mock/fedora-branched-compose-i386.cfg
/etc/mock/fedora-22-compose-x86_64.cfg /etc/mock/fedora-branched-compose-x86_64.cfg
/etc/mock/fedora-23-compose-aarch64.cfg /etc/mock/fedora-rawhide-compose-aarch64.cfg
/etc/mock/fedora-23-compose-armhfp.cfg /etc/mock/fedora-rawhide-compose-armhfp.cfg
/etc/mock/fedora-23-compose-i386.cfg /etc/mock/fedora-rawhide-compose-i386.cfg
/etc/mock/fedora-23-compose-x86_64.cfg /etc/mock/fedora-rawhide-compose-x86_64.cfg
running the compose yourself
If you happen to strike out there and are still in need of debugging, it
might be time to just go ahead and run the compose yourself. The exact
command needed can be found in the cron jobs located on their respective
compose machines, this information can be found in the
compose-machines
section. Also note that each respective compose
command should be ran from it’s respective compose machine as outlined
in the compose-machines
section.
An example is below, setting the compose directory as your
username-debug.1
, the .1
at the end is common notation for an
incremental run of a compose. If you need to do another, increment it to
.2
and continue. This is helpful to be able to compare composes.
$ TMPDIR=`mktemp -d /tmp/twoweek.XXXXXX` && cd $TMPDIR \
&& git clone -n https://pagure.io/releng.git && cd releng && \
git checkout -b stable twoweek-stable && \
LANG=en_US.UTF-8 ./scripts/make-updates 23 updates $USER-debug.1
The above command was pulled from the twoweek-atomic
cron job with
only the final parameter being altered. This is used as the name of the
output directory.
The compose can take some time to run, so don’t be alarmed if you don’t
see output in a while. This should provide you all the infromation
needed to debug and/or diagnose further. When in doubt, as in
#fedora-releng
on irc.libera.chat
.
Docker Layered Image Build Service
The Docker Layered Image Build Service is built using a combination of technologies such as OpenShift, osbs-client, and the koji-containerbuild plugin that when combined are often refered to as an OpenShift Build Service instance (OSBS). Something to note is that OpenShift is a kubernetes distribution with many features built on top of kubernetes such as the build primitive that is used as the basis for the build service. This information will hopefully shed light on some of the terminology and commands used below.
There are a few "common" scenarios in which build may fail or hang that will need some sort of inspection of the build system.
Build Appears to stall after being scheduled
In the event that a build scheduled through koji appears to be stalled
and is not in a free
state (i.e. - has been scheduled). An
administrator will need to ssh into osbs-master01
or
osbs-master01.stg
(depending stage vs production) and inspect the
build itself.
$ oc status
In project default on server https://10.5.126.216:8443
svc/kubernetes - 172.30.0.1 ports 443, 53, 53
bc/cockpit-f24 custom build of git://....
build #8 succeeded 7 weeks ago
build #9 failed 33 minutes ago
$ oc describe build/cockpit-f24-9
# lots of output about status of the specific build
$ oc logs build/cockpit-f24-9
# extremely verbose logs, these should in normal circumstances be found in
# the koji build log post build
The information found in the commands above will generally identify the issue.
Build fails but there’s no log output in the Koji Task
Sometimes there is a communications issue between Koji and OSBS which
cause for a failure to be listed in Koji but without all the logs. These
can be diagnosed by checking the kojid
logs on the Koji builder listed
in the task output.
For example:
$ fedpkg container-build
Created task: 90123598
Task info: http://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90123598
Watching tasks (this may be safely interrupted)...
90123598 buildContainer (noarch): free
90123598 buildContainer (noarch): free -> open (buildvm-04.stg.phx2.fedoraproject.org)
90123599 createContainer (x86_64): free
90123599 createContainer (x86_64): free -> open (buildvm-02.stg.phx2.fedoraproject.org)
90123599 createContainer (x86_64): open (buildvm-02.stg.phx2.fedoraproject.org) -> closed
0 free 1 open 1 done 0 failed
90123598 buildContainer (noarch): open (buildvm-04.stg.phx2.fedoraproject.org) -> FAILED: Fault: <Fault 2001: 'Image build failed. OSBS build id: cockpit-f24-9'>
0 free 0 open 1 done 1 failed
90123598 buildContainer (noarch) failed
In this example the buildContiner task was scheduled and ran on
buildvm-04.stg
with the actual createContainer task being on
buildvm-02.stg
, and buildvm-02.stg
is where we’re going to want to
begin looking for failures to communicate with OSBS as this is the point
of contact with the external system.
Logs can be found in /var/log/kojid.log
or if necessary, check the
koji hub in question. Generally, you will want to start with the first
point of contact with OSBS and "work your way back" so in the above
example you would first check buildvm-02.stg
, then move on to
buildvm-04.stg
if nothing useful was found in the logs of the previous
machine, and again move on to the koji hub if neither of the builder
machines involved provided useful log information.
Build fails because it can’t get to a network resource
Sometimes there is a situation where the firewall rules get messed up on one of the OpenShift Nodes in the environment. This can cause output similar to the following:
$ fedpkg container-build --scratch
Created task: 90066343
Task info: http://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90066343
Watching tasks (this may be safely interrupted)...
90066343 buildContainer (noarch): free
90066343 buildContainer (noarch): free -> open (buildvm-03.stg.phx2.fedoraproject.org)
90066344 createContainer (x86_64): open (buildvm-04.stg.phx2.fedoraproject.org)
90066344 createContainer (x86_64): open (buildvm-04.stg.phx2.fedoraproject.org) -> FAILED: Fault: <Fault 2001: "Image build failed. Error in plugin distgit_fetch_artefacts: OSError(2, 'No such file or directory'). OSBS build id: scratch-20161102132628">
0 free 1 open 0 done 1 failed
90066343 buildContainer (noarch): open (buildvm-03.stg.phx2.fedoraproject.org) -> closed
0 free 0 open 1 done 1 failed
If we go to the OSBS Master and run the following commands, we will see the root symptom:
# oc logs build/scratch-20161102132628 Error from server: Get https://osbs-node02.stg.phx2.fedoraproject.org:10250/containerLogs/default/scratch-20161102132628-build/custom-build: dial tcp 10.5.126.213:10250: getsockopt: no route to host # ping 10.5.126.213 PING 10.5.126.213 (10.5.126.213) 56(84) bytes of data. 64 bytes from 10.5.126.213: icmp_seq=1 ttl=64 time=0.299 ms 64 bytes from 10.5.126.213: icmp_seq=2 ttl=64 time=0.299 ms 64 bytes from 10.5.126.213: icmp_seq=3 ttl=64 time=0.253 ms 64 bytes from 10.5.126.213: icmp_seq=4 ttl=64 time=0.233 ms ^C --- 10.5.126.213 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3073ms rtt min/avg/max/mdev = 0.233/0.271/0.299/0.028 ms # http get 10.5.126.213:10250 http: error: ConnectionError: HTTPConnectionPool(host='10.5.126.213', port=10250): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdab064b320>: Failed to establish a new connection: [Errno 113] No route to host',)) while doing GET request to URL: http://10.5.126.213:10250/
In the above output, we can see that we do actually have network
connectivity to the Node but we can not connect to the OpenShift service
that should be listening on port 10250
.
To fix this, you need to ssh into the OpenShift Node that you can’t
connect to via port 10250
and run the following commands. This should
resolve the issue.
iptables -F && iptables -t nat -F && systemctl restart docker && systemctl restart origin-node
Want to help? Learn how to contribute to Fedora Docs ›