SOP Installation/Configuration of OCP4 on Fedora Infra

Install

To install OCP4 on Fedora Infra, one must be apart of the following groups:

  • sysadmin-openshift

  • sysadmin-noc

Prerequisites

Visit the OpenShift Console and download the following OpenShift tools:

  • A Red Hat Access account is required

  • OC client tools Here

  • OC installation tool Here

  • Ensure the downloaded tools are available on the PATH

  • A valid OCP4 subscription is required to complete the installation configuration, by default you have a 60 day trial.

  • Take a copy of your pull secret file you will need to put this in the install-config.yaml file in the next step.

Generate install-config.yaml file

We must create a install-config.yaml file, use the following example for inspiration, alternatively refer to the documentation[1] for more detailed information/explainations.

apiVersion: v1
baseDomain: stg.fedoraproject.org
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  name: 'ocp'
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
fips: false
pullSecret: 'PUT PULL SECRET HERE'
sshKey: 'PUT SSH PUBLIC KEY HERE kubeadmin@core'
  • Login to the os-control01 corresponding with the environment

  • Make a directory to hold the installation files: mkdir ocp4-<ENV>

  • Enter this newly created directory: cd ocp4-<ENV>

  • Generate a fresh SSH keypair: ssh-keygen -f ./ocp4-<ENV>-ssh

  • Create a ssh directory and place this keypair into it.

  • Put the contents of the public key in the sshKey value in the install-config.yaml file

  • Put the contents of your Pull Secret in the pullSecret value in the install-config.yaml

  • Take a backup of the install-config.yaml to install-config.yaml.bak, as running the next steps consumes this file, having a backup allows you to recover from mistakes quickly.

Create the Installation Files

Using the openshift-install tool we can generate the installation files. Make sure that the install-config.yaml file is in the /path/to/ocp4-<ENV> location before attempting the next steps.

Create the Manifest Files

The manifest files are human readable, at this stage you can put any customisations required before the installation begins.

  • Create the manifests: openshift-install create manifests --dir=/path/to/ocp4-<ENV>

  • All configuration for RHCOS must be done via MachineConfigs configuration. If there is known configuration which must be performed, such as NTP, you can copy the MachineConfigs into the /path/to/ocp4-<ENV>/openshift directory now.

  • The following step should be performed at this point, edit the /path/to/ocp4-<ENV>/manifests/cluster-scheduler-02-config.yml change the mastersSchedulable value to false.

Create the Ignition Files

The ignition files have been generated from the manifests and MachineConfig files to generate the final installation files for the three roles: bootstrap, master, worker. In Fedora we prefer not to use the term master here, we have renamed this role to controlplane.

  • Create the ignition files: openshift-install create ignition-configs --dir=/path/to/ocp4-<ENV>

  • At this point you should have the following three files: bootstrap.ign, master.ign and worker.ign.

  • Rename the master.ign to controlplane.ign.

  • A directory has been created, auth. This contains two files: kubeadmin-password and kubeconfig. These allow cluster-admin access to the cluster.

Copy the Ignition files to the batcave01 server

On the batcave01 at the following location: /srv/web/infra/bigfiles/openshiftboot/:

  • Create a directory to match the environment: mkdir /srv/web/infra/bigfiles/openshiftboot/ocp4-<ENV>

  • Copy the ignition files, the ssh files and the auth files generated in previous steps, to this newly created directory. Users with sysadmin-openshift should have the necessary permissions to write to this location.

  • when this is complete it should look like the following:

    ├── <ENV>
    │   ├── auth
    │   │   ├── kubeadmin-password
    │   │   └── kubeconfig
    │   ├── bootstrap.ign
    │   ├── controlplane.ign
    │   ├── ssh
    │   │   ├── id_rsa
    │   │   └── id_rsa.pub
    │   └── worker.ign

Update the ansible inventory

The ansible inventory/hostvars/group vars should be updated with the new hosts information.

For inspiration see the following PR where we added the ocp4 production changes.

Update the DNS/DHCP configuration

The DNS and DHCP configuration must also be updated. This PR contains the necessiary changes DHCP for prod and can be done in ansible.

However the DNS changes may only be performed by sysadmin-main. For this reason any DNS changes must go via a patch snippet which is emailed to the infrastructure@lists.fedoraproject.org mailing list for review and approval. This process may take several days.

Generate the TLS Certs for the new environment

This is beyond the scope of this SOP, the best option is to create a ticket for Fedora Infra to request that these certs are created and available for use. The following certs should be available:

  • *.apps.<ENV>.fedoraproject.org

  • api.<ENV>.fedoraproject.org

  • api-int.<ENV>.fedoraproject.org

Run the Playbooks

There are a number of playbooks required to be run. Once all the previous steps have been reached, we can run these playbooks from the batcave01 instance.

  • sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'

  • sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd,iptables'

Baremetal / VMs

Depending on if some of the nodes are VMs or baremetal, different tags should be supplied to the following playbook. If the entire cluster is baremetal you can skip the kvm_deploy tag entirely.

If there are VMs used for some of the roles, make sure to leave it in.

  • sudo rbac-playbook manual/ocp4-place-ignitionfiles.yml -t "ignition,repo,kvm_deploy"

Baremetal

At this point we can switch on the baremetal nodes and begin the PXE/UEFI boot process. The baremetal nodes should via DHCP/DNS have the configuration necessary to reach out to the noc01.iad2.fedoraproject.org server and retrieve the UEFI boot configuration via PXE.

Once booted up, you should visit the management console for this node, and manually choose the UEFI configuration appropriate for its role.

The node will begin booting, and during the boot process it will reach out to the os-control01 instance specific to the <ENV> to retrieve the ignition file appropriate to its role.

The system will then become autonomous, it will install and potentially reboot multiple times as updates are retrieved/applied etc.

Eventually you will be presented with a SSH login prompt, where it should have the correct hostname eg: ocp01 to match what is in the DNS configuration.

Bootstrapping completed

When the control plane is up, we should see all controlplane instances available in the appropriate haproxy dashboard. eg: haproxy.

At this time we should take the bootstrap instance out of the haproxy load balancer.

  • Make the necessiary changes to ansible at: ansible/roles/haproxy/templates/haproxy.cfg

  • Once merged, run the following playbook once more: sudo rbac-playbook groups/proxies.yml -t 'haproxy'

Begin instllation of the worker nodes

Follow the same processes listed in the Baremetal section above to switch on the worker nodes and begin installation.

Configure the os-control01 to authenticate with the new OCP4 cluster

Copy the kubeconfig to ~root/.kube/config on the os-control01 instance. This will allow the root user to automatically be authenticated to the new OCP4 cluster with cluster-admin privileges.

Accept Node CSR Certs

To accept the worker/compute nodes into the cluster we need to accept their CSR certs.

List the CSR certs. The ones we’re interested in will show as pending:

oc get csr

To accept all the OCP4 node CSRs in a one liner do the following:

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

This should look something like this once completed:

[root@os-control01 ocp4][STG]= oc get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
ocp01.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
ocp02.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
ocp03.ocp.stg.iad2.fedoraproject.org      Ready    master   34d   v1.21.1+9807387
worker01.ocp.stg.iad2.fedoraproject.org   Ready    worker   21d   v1.21.1+9807387
worker02.ocp.stg.iad2.fedoraproject.org   Ready    worker   20d   v1.21.1+9807387
worker03.ocp.stg.iad2.fedoraproject.org   Ready    worker   20d   v1.21.1+9807387
worker04.ocp.stg.iad2.fedoraproject.org   Ready    worker   34d   v1.21.1+9807387
worker05.ocp.stg.iad2.fedoraproject.org   Ready    worker   34d   v1.21.1+9807387

At this point the cluster is basically up and running.