Copr
Copr is build system for 3rd party packages.
- Frontend
- Backend
- Package signer
-
-
copr-keygen.cloud.fedoraproject.org
- Dist-git
-
copr-dist-git.fedorainfracloud.org
-
- Devel instances (NO NEED TO CARE ABOUT THEM, JUST THOSE ABOVE)
-
-
copr-keygen-dev.cloud.fedoraproject.org
-
copr-dist-git-dev.fedorainfracloud.org
Contact Information
- Owner
-
msuchy (mirek)
- Contact
-
#fedora-admin, #fedora-buildsys
- Location
-
Fedora Cloud
- Purpose
-
Build system
This document
This document provides a condensed information allowing you to keep Copr alive and working. For more sofisticated business processes, please see https://docs.pagure.org/copr.copr/maintenance_documentation.html
TROUBLESHOOTING
Almost every problem with Copr is due problem with spawning builder VMs, or with processing action queue on backend.
VM spawning/termination problems
Try to restart copr-backend service:
$ ssh root@copr-be.cloud.fedoraproject.org $ systemctl restart copr-backend
If this doesn’t solve the problem, try to follow logs for some clues:
$ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log
As the last resort option, you can terminate all builders and let copr-backend to throw all information about them. This action will obviously interrupt all running builds and reschedule them:
$ ssh root@copr-be.cloud.fedoraproject.org $ systemctl stop copr-backend $ cleanup_vm_nova.py $ redis-cli > FLUSHALL $ systemctl start copr-backend
Sometimes OpenStack can not handle spawning too much VMs at the same time. So it is safer to edit on copr-be.cloud.fedoraproject.org:
vi /etc/copr/copr-be.conf
and change:
group0_max_workers=12
to "6". Start copr-backend service and some time later increase it to original value. Copr automaticaly detect change in script and increase number of workers.
The set of aarch64 VMs isn’t maintained by OpenStack, but by Copr’s backend itself. Steps to diagnose:
$ ssh root@copr-be.cloud.fedoraproject.org [root@copr-be ~][PROD]# systemctl status resalloc ● resalloc.service - Resource allocator server ... [root@copr-be ~][PROD]# less /var/log/resallocserver/main.log [root@copr-be ~][PROD]# su - resalloc [resalloc@copr-be ~][PROD]$ resalloc-maint resource-list 13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64 status=UP 13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64 status=UP 13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64 status=STARTING ... [resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list 879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319 918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536 904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303 919 - state=OPEN tags=aarch64 ...
Be careful when there’s some resource in STARTING
state. If that’s so,
check
/usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc
.
Copr takes tickets from resalloc server; and if the resources fail to
spawn, the ticket numbers are not assigned with appropriately tagged
resource for a long time.
If that happens (it shouldn’t) and there’s some inconsistency between
resalloc’s database and the actual status on aarch64 hypervisors
(ssh copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org
) - use
virsh
there to introspect theirs statuses - use
resalloc-maint resource-delete
, resalloc ticket-close
or psql
commands to fix-up the resalloc’s DB.
Backend Troubleshoting
Information about status of Copr backend services:
systemctl status copr-backend*.service
Utilization of workers:
ps axf
Worker process change $0 to list which task they are working on and on which builder.
To list which VM builders are tracked by copr-vmm service:
/usr/bin/copr_get_vm_info.py
Appstream builder troubleshoting
Appstream builder is painfully slow when running on a repository with a huge amount of packages. See https://github.com/hughsie/appstream-glib/issues/301 . You might need to disable it for some projects:
$ ssh root@copr-be.cloud.fedoraproject.org $ cd /var/lib/copr/public_html/results/<owner>/<project>/ $ touch .disable-appstream # You should probably also delete existing appstream data because # they might be obsolete $ rm -rf ./appdata
Backend action queue issues
First check the number of not-yet-processed actions. If that number isn’t equal to zero, and is not decrementing relatively fast (say single action takes longer than 30s) — there might be some problem. Logs for the action dispatcher can be found in:
/var/log/copr-backend/action_dispatcher.log
Check if there’s no stucked process under Action dispatch
parent
process in pstree -a copr
output.
Deploy information
Using playbooks and rbac:
$ sudo rbac-playbook groups/copr-backend.yml $ sudo rbac-playbook groups/copr-frontend-cloud.yml $ sudo rbac-playbook groups/copr-keygen.yml $ sudo rbac-playbook groups/copr-dist-git.yml
The copr-setup.txt manual is severely outdated, but there is no up-to-date alternative. We should extract useful information from it and put it here in the SOP or into https://docs.pagure.org/copr.copr/maintenance_documentation.html and then throw the copr-setup.txt away.
On backend should run copr-backend service (which spawns several processes). Backend spawns VM from Fedora Cloud. You could not login to those machines directly. You have to:
$ ssh root@copr-be.cloud.fedoraproject.org $ su - copr $ copr_get_vm_info.py # find IP address of the VM that you want $ ssh root@172.16.3.3
Instances can be easily terminated in https://fedorainfracloud.org/dashboard
Order of start up
When reprovision you should start first: copr-keygen and copr-dist-git machines (in any order). Then you can start copr-be. Well you can start it sooner, but make sure that copr-* services are stopped.
Copr-fe machine is completly independent and can be start any time. If backend is stopped it will just queue jobs.
Logs
Backend
-
/var/log/copr-backend/action_dispatcher.log
-
/var/log/copr-backend/actions.log
-
/var/log/copr-backend/backend.log
-
/var/log/copr-backend/build_dispatcher.log
-
/var/log/copr-backend/logger.log
-
/var/log/copr-backend/spawner.log
-
/var/log/copr-backend/terminator.log
-
/var/log/copr-backend/vmm.log
-
/var/log/copr-backend/worker.log
And several logs for non-essential features such as copr_prune_results.log, hitcounter.log, cleanup_vms.log, that you shouldn’t be worried with.
Services
PPC64LE Builders
Builders for PPC64 are located at rh-power2.fit.vutbr.cz and anyone with access to buildsys ssh key can get there using keys as:: msuchy@rh-power2.fit.vutbr.cz
There are commands:
$ ls bin/ destroy-all.sh reinit-vm26.sh reinit-vm28.sh virsh-destroy-vm26.sh virsh-destroy-vm28.sh virsh-start-vm26.sh virsh-start-vm28.sh get-one-vm.sh reinit-vm27.sh reinit-vm29.sh virsh-destroy-vm27.sh virsh-destroy-vm29.sh virsh-start-vm27.sh virsh-start-vm29.sh
destroy-all.sh
destroy all VM and reinit them
reinit-vmXX.sh
copy VM image from template
virsh-destroy-vmXX.sh
destroys VM
virsh-start-vmXX.sh
starts VM
get-one-vm.sh
start one VM and return its IP - this is used in Copr playbooks.
In case of big queue of PPC64 tasks simply call bin/destroy-all.sh
and
it will destroy stuck VM and copr backend will spawn new VM.
Ports opened for public
Frontend:
Port | Protocol | Service | Reason |
---|---|---|---|
22 |
TCP |
ssh |
Remote control |
80 |
TCP |
http |
Serving Copr frontend website |
443 |
TCP |
https |
^^ |
Backend:
Port | Protocol | Service | Reason |
---|---|---|---|
22 |
TCP |
ssh |
Remote control |
80 |
TCP |
http |
Serving build results and repos |
443 |
TCP |
https |
^^ |
Distgit:
Port | Protocol | Service | Reason |
---|---|---|---|
22 |
TCP |
ssh |
Remote control |
80 |
TCP |
http |
Serving cgit interface |
443 |
TCP |
https |
^^ |
Keygen:
Port | Protocol | Service | Reason |
---|---|---|---|
22 |
TCP |
ssh |
Remote control |
Resources justification
Copr currently uses the following resources.
Frontend
-
RAM: 2G (out of 4G) and some swap
-
CPU: 2 cores (3400mhz) with load 0.92, 0.68, 0.65
Most of the memory is eaten by PostgreSQL, followed by Apache. The CPU usage is also mainly used for those two services but in the reversed order.
I don’t think we can settle down with any instance that provides less than (2G RAM, obviously), but ideally, we need 3G+. 2-core CPU is good enough.
-
Disk space: 17G for system and 8G for pgsqldb directory
If needed, we are able to clean-up the database directory of old dumps and backups and get down to around 4G disk space.
Backend
-
RAM: 5G (out of 16G)
-
CPU: 8 cores (3400MHz) with load 4.09, 4.55, 4.24
Backend takes care of spinning-up builders and running ansible playbooks on them, running createrepo_c (on big repositories) and so on. Copr utilizes two queues, one for builds, which are delegated to OpenStack builders, and action queue. Actions, however, are processed directly by the backend, so it can spike our load up. We would ideally like to have the same computing power that we have now. Maybe we can go lower than 16G RAM, possibly down to 12G RAM.
-
Disk space: 30G for the system, 5.6T (out of 6.8T) for build results
Currently, we have 1.3T of backup data, that is going to be deleted soon, but nevertheless, we cannot go any lower on storage. Disk space is a long-term issue for us and we need to do a lot of compromises and settling down just to survive our daily increase (which is around 10G of new data). Many features are blocked by not having enough storage. We cannot go any lower and also we cannot go much longer with the current storage.
Distgit
-
RAM: ~270M (out of 4G), but climbs to ~1G when busy
-
CPU: 2 cores (3400MHz) with load 1.35, 1.00, 0.53
Personally, I wouldn’t downgrade the machine too much. Possibly we can live with 3G ram, but I wouldn’t go any lower.
-
Disk space: 7G for system, 1.3T dist-git data
We currently employ a lot of aggressive cleaning strategies on our distgit data, so we can’t go any lower than what we have.
Keygen
-
RAM: ~150M (out of 2G)
-
CPU: 1 core (3400MHz) with load 0.10, 0.31, 0.25
We are basically running just signd and httpd here, both with minimal resource requirements. The memory usage is topped by systemd-journald.
-
Disk space: 7G for system and ~500M (out of ~700M) for GPG keys
We are slowly pushing the GPG keys storage to its limit, so in the case of migrating copr-keygen somewhere, we would like to scale-up it to at least 1G.
Want to help? Learn how to contribute to Fedora Docs ›