Mass Upgrade Infrastructure SOP

Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.

Contact Information

Owner

Fedora Infrastructure Team

Contact

#fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org, #fedora-noc

Location

All over the world.

Servers

all

Purpose

Apply kernel/other upgrades to all of our servers

Preparation

Mass updates are usually applied every few months or sooner if there’s some critical bugs fixed. Mass updates are done outside of freeze windows to avoid causing any problems for Fedora releases.

The following items are all done before the actual mass update:

  • Plan a outage window or windows outside of a freeze.

  • File a outage ticket in the fedora-infrastructure tracker, using the outage template. This should describe the exact time/date and what is included.

  • Get the outage ticket reviewed by someone else to confirm there’s no mistakes in it.

  • Sent outage announcement to infrastructure and devel-announce lists (for outages that affect contributors only) or infrastructure, devel-announce and announce (for outages that affect all users).

  • Add a 'planned' outage to fedorastatus. This will show the planned outage there for higher visibility.

  • Setup a hackmd or other shared document that lists all the virthosts and bare metal hosts that need rebooting and organize it per day. This is used to track what admin is handling what server(s).

Typically updates/reboots are done on all staging hosts on a monday, then all non outage causing hosts on tuesday and then finally the outages are on wednsday.

Staging

Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, etc. This is typically done a few days before the actual outage. Too far in advance and things may have changed again, so it’s important to do this just before the production updates.

Non outage causing hosts

Some hosts can be safely updated/rebooted without an outage because they either have multiple machines in a load balancer or are not visible to end users or other reasons. These updates are typically done on tuesday of the outage week so they are done before the outage on wed. These hosts include proxies and a number of virthosts that have vm’s that meet this criteria.

Special Considerations

While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:

Post reboot action

The following machines require post-boot actions (mostly entering passphrases). Make sure admins that have the passphrases are on hand for the reboot:

  • backup01 (ssh agent passphrase for backup ssh key)

  • sign-vault01 (NSS passphrase for sigul service and luks passphrase)

  • sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed)

  • autosign01 (NSS passphrase for robosignatory service and luks passphrase)

  • buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone)

  • batcave01 (ssh agent passphrase for ansible ssh key)

  • notifs-backend01 ( rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).' systemctl restart fmn-backend@1; for i in seq 1 24; do echo $i; systemctl restart fmn-worker@$i | cat; done

Bastion01 and Bastion02 and openvpn server

If a reboot of bastion01 is done during an outage, nothing needs to be changed here. However, if bastion01 will be down for an extended period of time openvpn can be switched to bastion02 by stopping openvpn-server@openvpn on bastion01 and starting it on bastion02.

on bastion01: 'systemctl stop openvpn-server@openvpn' on bastion02: 'systemctl start openvpn-server@openvpn'

and the process can be reversed after the other is back. Clients try 01 first, then 02 if it’s down. It’s important to make sure all the clients are using one machine or the other, because if they are split routing between machines may be confused.

batcave01

batcave01 is our ansible control host. It’s where you run playbooks that have been mentioned in this SOP. However, it too needs updating and rebooting and you cannot use the vhost_reboot playbook for it, since it’s rebooting it’s own virthost. For this host you should go to the virthost and 'virsh shutdown' all the other vm’s, then 'virsh shutdown' batcave01, then reboot the virthost manually.

noc01 / dhcp server

noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that contains noc01 vm, it means that that vmhost has no dhcp server to answer it when booting and trying to configure network to talk to the tang server. To work around this you can run a simple dhcpd on batcave01. Start it there and let the vmhost with noc01 come up and then stop it. Ideally we would make another dhcp host to avoid this issue at some point.

batcave01: 'systemctl start dhcpd'

remember to stop it after the host comes back up.

Special package management directives

Sometimes we need to exclude something from being updated. This can be done with the package_exlcudes variable. Set that and the playbooks doing updates will exclude listed items.

This variable is set in ansible/host_vars or ansible/group_vars for the host or group.

Update Leader

Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren’t specficially asked by the Leader to reboot or change something, please don’t. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn’t come back up from reboot or aren’t working right after reboot. It’s important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.

Uusally for a mass update/reboot there will be a hackmd or similar document that tracks what machines have already been rebooted and who is working on which one. Please check with the leader for a link to this document.

Updates and Reboots via playbook

There’s several playbooks related to this task: vhost_update.yml applies updates to a vmhost and all it’s guests vhost_reboot.yml shuts down vm’s and reboots a vmhost vhost_update_reboot.yml does both of the above

For hosts out of outage you probably want to use these to make sure updates are applied before reboots. Once updates are applied globally before the outage you will want to just use the reboot playbook.

Additionally there are two more playbooks to check things: check-for-nonvirt-updates.yml check-for-updates.yml See those playbooks for more information, but basically they allow you to see how many updates are pending on all the virthosts/bare metal machines and/or all machines. This is good to run at the end of outages to confirm that everything was updated.

Doing the upgrade

If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging). To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages (Infrastructure Yum Repo SOP)

Before outage, ansible can be use to just apply all updates to hosts or apply all updates to staging hosts before those are done. Something like: ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist

Aftermath

  1. Make sure that everything’s running fine

  2. Check nagios for alerts and clear them all

  3. Reenable nagios notification after they are cleared.

  4. Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes)

  5. Consider running check-for-updates or check-for-nonvirt-updates to confirm that all hosts are updated.

  6. close fedorastatus outage

  7. Close outage ticket.

Non virthost reboots

If you need to reboot specific hosts and make sure they recover - consider using:

sudo ansible -m reboot hostname