SOP Add an OCP4 Node to an Existing Cluster

This SOP should be used in the following scenario:
  • Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.

Steps

  1. Add the new nodes to the Ansible inventory file in the appropriate group.

    eg:

    [ocp_workers]
    worker01.ocp.iad2.fedoraproject.org
    worker02.ocp.iad2.fedoraproject.org
    worker03.ocp.iad2.fedoraproject.org
    
    
    [ocp_workers_stg]
    worker01.ocp.stg.iad2.fedoraproject.org
    worker02.ocp.stg.iad2.fedoraproject.org
    worker03.ocp.stg.iad2.fedoraproject.org
    worker04.ocp.stg.iad2.fedoraproject.org
    worker05.ocp.stg.iad2.fedoraproject.org
  2. Add the new hostvars for each new host being added, see the following examples for VM vs baremetal hosts.

    # control plane VM
    inventory/host_vars/ocp01.ocp.iad2.fedoraproject.org
    
    # compute baremetal
    inventory/host_vars/worker01.ocp.iad2.fedoraproject.org
  3. If the nodes are compute or worker nodes, they must be also added to the following group_vars proxies for prod, proxies_stg for staging

    inventory/group_vars/proxies:ocp_nodes:
    inventory/group_vars/proxies_stg:ocp_nodes_stg:
  4. Changes must be made to the roles/dhcp_server/files/dhcpd.conf.noc01.iad2.fedoraproject.org file for DHCP to ensure that the node will receive an IP address based on its MAC address, and tells the node to reach out to the next-server where it can find the UEFI boot configuration.

    host worker01-ocp {                        # UPDATE THIS
         hardware ethernet 68:05:CA:CE:A3:C9;  # UPDATE THIS
         fixed-address 10.3.163.123;           # UPDATE THIS
         filename "uefi/grubx64.efi";
         next-server 10.3.163.10;
         option routers 10.3.163.254;
         option subnet-mask 255.255.255.0;
    }
  5. Changes must be made to DNS. To do this one must be a member of sysadmin-main, if you are not, one must send a patch request to the Fedora Infra mailing list for review which will be merged by the sysadmin-main members.

    See the following examples for the worker01.ocp nodes for production and staging.

    master/163.3.10.in-addr.arpa:123      IN        PTR      worker01.ocp.iad2.fedoraproject.org.
    master/166.3.10.in-addr.arpa:118      IN        PTR      worker01.ocp.stg.iad2.fedoraproject.org.
    master/iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.163.123
    master/stg.iad2.fedoraproject.org:worker01.ocp            IN      A       10.3.166.118
  6. Run the playbook to update the haproxy config to monitor the new nodes, and add it to the load balancer.

    sudo rbac-playbook groups/noc.yml -t "tftp_server,dhcp_server"
    sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd'
  7. DHCP instructs the node to reach out to the next-server when it is handed out an IP address. The next-server runs a tftp server which contains the kernel, initramfs and UEFI boot configuration. uefi/grub.cfg. Contained in this grub.cfg is the following which relates to the OCP4 nodes:

    menuentry 'RHCOS 4.8 worker staging' {
      linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
    coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign
      initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
    }
    menuentry 'RHCOS 4.8 worker production' {
      linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda
    coreos.live.rootfs_url=http://10.3.163.65/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.163.65/rhcos/worker.ign
      initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img
    }

    When a node is booted up, and reads this UEFI boot configuration, the menu option must be manually selected:

    • To add a node to the staging cluster choose: RHCOS 4.8 worker staging

    • To add a node to the production cluster choose: RHCOS 4.8 worker production

  8. Connect to the os-control01 node which corresponds with the ENV which the new node is being added to.

    Verify that you are authenticated correctly to the OpenShift cluster as the system:admin user.

    oc whoami
    system:admin
  9. Contained within the UEFI boot menu configuration are the links to the web server running on the os-control01 host specific to the ENV. This server should only run when we wish to reinstall an existing node or install a new node. Start it using systemctl manually:

    systemctl start httpd.service
  10. Boot up the node and select the appropriate menu entry to install the node into the correct cluster. Wait until the node displays a SSH login prompt with the nodes name. It may reboot several times during the process.

  11. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted. From the os-control01 node run the following:

    # List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
    oc get csr
    
    # Accept all node CSRs one liner
    oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

    This process usually needs to be repeated twice, for each new node.

To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].