|
| 1 | +--- |
| 2 | +title: Adding Baremetal Installer Provisioned Infrastructure (IPI) to OpenShift |
| 3 | +authors: |
| 4 | + - "@sadasu" |
| 5 | +reviewers: |
| 6 | + - "@smarterclayton" |
| 7 | + - "@abhinavdahiya" |
| 8 | + - "@enxebre" |
| 9 | + - "@deads2k" |
| 10 | +approvers: |
| 11 | + - "@abhinavdahiya" |
| 12 | + - "@smarterclayton" |
| 13 | + - "@enxebre" |
| 14 | + - "@deads2k" |
| 15 | +creation-date: 2019-11-06 |
| 16 | +last-updated: 2019-11-06 |
| 17 | +status: implemented |
| 18 | +--- |
| 19 | + |
| 20 | +# Adding Baremetal IPI capabilities to OpenShift |
| 21 | + |
| 22 | +This enhancement serves to provide context for a whole slew of features |
| 23 | +and enhancements that will follow to make Baremetal IPI deployments via |
| 24 | +OpenShift a reality. |
| 25 | + |
| 26 | +At the time of this writing, code for some of these enhancements have already |
| 27 | +merged, some are in progress and others are yet to me implemented. References |
| 28 | +to all these features in different stages of development will be provided |
| 29 | +below. |
| 30 | + |
| 31 | +## Release Signoff Checklist |
| 32 | + |
| 33 | +- [ ] Enhancement is `implementable` |
| 34 | +- [ ] Design details are appropriately documented from clear requirements |
| 35 | +- [ ] Test plan is defined |
| 36 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 37 | +- [ ] User-facing documentation is created in [openshift/docs] |
| 38 | + |
| 39 | +## Summary |
| 40 | + |
| 41 | +Baremetal IPI deployments enable OpenShift to enroll baremetal servers to become |
| 42 | +Nodes that can run K8s workloads. |
| 43 | +The Baremetal Operator [1] along with other provisioning services (Ironic and |
| 44 | +dependencies) run in their own pod called "metal3". This pod is deployed by the |
| 45 | +Machine API Operator when the Platform type is `BareMetal`. The OpenShift |
| 46 | +Installer is responsble for providing all the necessary configs required for |
| 47 | +a successful deployment. |
| 48 | + |
| 49 | +## Motivation |
| 50 | + |
| 51 | +The motivation for this enhancement request is to provide a background for all the |
| 52 | +the subsequent enhancement requests for Baremetal IPI deployments. |
| 53 | + |
| 54 | +### Goals |
| 55 | + |
| 56 | +The goal of this enhancement request is to provide context for all the changes |
| 57 | +that have already been merged towards making Baremetal IPI deployments a reality. |
| 58 | +All future Baremetal enhancement requests will refer back to this one to provide |
| 59 | +context. |
| 60 | + |
| 61 | +### Non-Goals |
| 62 | + |
| 63 | +Raising development PRs as a result of this enhancement request. |
| 64 | + |
| 65 | +## Proposal |
| 66 | + |
| 67 | +Every OpenShift based Baremetal IPI deployment will run a "metal3" pod on |
| 68 | +one Master Node. A "metal3" pod includes a container running BareMetal |
| 69 | +Operator(BMO) and several other supporting containers that work together. |
| 70 | + |
| 71 | +The BMO and other supporting containers together are able to discover a |
| 72 | +baremetal server in a pre-determined provisioning network, learn the |
| 73 | +HW attributes of the server and eventually boot it to make it available |
| 74 | +as a Machine within a MachineSet. |
| 75 | + |
| 76 | +The Machine API Operator (MAO) currently deploys the "metal3" pod only |
| 77 | +when the Platform type is `BareMetal` but the BaremetalHost CRD is exposed |
| 78 | +by the MAO as part of the release payload which is managed by the cluster |
| 79 | +version operator. The MAO is responsible for starting the BMO and the |
| 80 | +containers running the Ironic services and for providing these containers |
| 81 | +with their necessary configurations via env vars. |
| 82 | + |
| 83 | +The installer is responsible for kicking off a Baremetal IPI deployment |
| 84 | +with the right configuration. |
| 85 | + |
| 86 | +### User Stories |
| 87 | + |
| 88 | +With the addition of features described in this and other enhancements |
| 89 | +detailed in this current directory, OpenShift can be used to bring up |
| 90 | +a functioning cluster starting with a set of baremetal servers. As |
| 91 | +mentioned earlier, these enhancements rely on the Baremetal Operator (BMO) |
| 92 | +[1] running within the "metal3" pod to manage baremetal hosts. The BMO in |
| 93 | +turn relies on the Ironic service [3] to manage and provision baremetal |
| 94 | +servers. |
| 95 | + |
| 96 | +1. Will enable the user to deploy a control plane with 3 master nodes. |
| 97 | +2. Will enable the user to grow the cluster by dynamically adding worker |
| 98 | +nodes. |
| 99 | +3. Will enable the user to scale down the cluster by removing worker nodes. |
| 100 | + |
| 101 | +### Implementation Details/Notes/Constraints |
| 102 | + |
| 103 | +Baremetal IPI is integrated with OpenShift through the metal3.io [8] project. |
| 104 | +Metal3.io is a set of Kubernetes controllers that wrap the OpenStack Ironic |
| 105 | +project to provide Kubernetes native APIs for managing deployment and |
| 106 | +monitoring of physical hosts. |
| 107 | + |
| 108 | +The installer support for Baremetal IPI deployments is described in more detail |
| 109 | +in [7]. The installer runs on a special "provisioning host" that needs to be |
| 110 | +connected to both a "provisioning network" and an "external network". The |
| 111 | +provisioning network is a dedicated network used just for the purposes of |
| 112 | +configuring baremetal servers to be part of the cluster. The traffic on the |
| 113 | +provisioning network needs to be isolated from the traffic on the external |
| 114 | +network (hence 2 seperate networks.). The external network is used to carry |
| 115 | +cluster traffic which which includes cluster control plane traffic, application |
| 116 | +and data traffic. |
| 117 | + |
| 118 | +Control Plane Deployment |
| 119 | + |
| 120 | +1. A minimin Baremetal IPI deployment consists of 4 hosts, one to be used |
| 121 | +first as a provisioning host and later potentially re-purposed as a worker. |
| 122 | +The other 3 make up the control plane. These 4 hosts need to be connected |
| 123 | +to both the provisioning and external networks. |
| 124 | + |
| 125 | +2. Installation can be kicked off by downloading and running |
| 126 | +"openshift-baremetal-install". This image differs from the "openshift-install" |
| 127 | +binary only because libvirt is needs to be always linked for the baremetal |
| 128 | +install. Removing a bootstrap node would remove the dependency on libvirt |
| 129 | +and then baremetal IPI installs can be part of the normal Openshift installer. |
| 130 | +This is in the roadmap for this work and being investigated. |
| 131 | + |
| 132 | +3. The installer starts a bootstrap VM on the provisioning host. With other |
| 133 | +platform types supported by OpenShift, a cloud already exists and the installer |
| 134 | +runs the bootstrap VM on the control plane of this existing cloud. In the case |
| 135 | +of the baremetal platform type, this cloud does not already exist, so the |
| 136 | +installer starts the bootstrap VM using libvirt. |
| 137 | + |
| 138 | +4. The bootstrap VM needs to be connected to the provisioning network and so the |
| 139 | +the network interface on the provisioning host that is connected to the |
| 140 | +provisioning network needs to be provided to the installer. |
| 141 | + |
| 142 | +5. The bootstrap VM must be configured with a special well-known IP within the |
| 143 | +provisioning network that needs to provided as input to the installer. |
| 144 | + |
| 145 | +6. The installer user Ironic in the bootstrap VM to provision each host that |
| 146 | +makes up the control plane. The installer uses terraform to invoke Ironic API |
| 147 | +that configures each host to boot over the provisioning network using DHCP |
| 148 | +and PXE. |
| 149 | + |
| 150 | +7. The bootstrap VM runs a DHCP server and responds with network infomation and |
| 151 | +PXE instructions when Ironic powers on a host. The host boots the Ironic Agent |
| 152 | +image which is hosted on the httpd instance also running on the bootstrap VM. |
| 153 | + |
| 154 | +8. After the Ironic Agent on the host boots and runs from its ramdisk image, it |
| 155 | +looks for the Ironic Service either using an URL passed in as a kernel command line |
| 156 | +arguement in the PXE response or by using MDNS to seach for Ironic in the local L2 |
| 157 | +network. |
| 158 | + |
| 159 | +9. Ironic on the bootstrap VM then copies the RHCOS image hosted on the httpd |
| 160 | +instance to the local disk of the host and also writes the necessary ignition files |
| 161 | +so that the host can start creating the control plane when it runs the local image. |
| 162 | + |
| 163 | +10. After Ironic writes the image and ignition configs to the local disk of the host, |
| 164 | +Ironic power cycles the host causing it to reboot. The boot order on the host is set |
| 165 | +to boot from the image on the local drive instead of PXE booting. |
| 166 | + |
| 167 | +11. After the control plane hosts have an OS, the normal bootstrapping process continues |
| 168 | +with the help of the bootstrap VM. The bootstrap VM runs a temporary API service to talk |
| 169 | +to the etcd cluster on the control plane hosts. |
| 170 | + |
| 171 | +12. The manifests constructed by the installer are pushed into the new cluster. The |
| 172 | +operators launched in the new cluster would bring up other services and reconcile cluster |
| 173 | +state and configuration. |
| 174 | + |
| 175 | +13. The Machine API Opentaror (MAO) running on the control plane cluster detects the |
| 176 | +platform type as being "baremetal" and launches the "metal3" pod and the cluster-api- |
| 177 | +provider-baremetal (CAPBM) controller. The metal3 pod runs several Ironic services in |
| 178 | +containers in addition to the baremetal-operator (BMO). After the control plane is |
| 179 | +completely up, the bootstrap VM is destroyed. |
| 180 | + |
| 181 | +14. The baremetal-operator that is part of the metal3 service starts monitoring hosts |
| 182 | +using the Ironic service which is also part of metal3. The baremetal-operator uses the |
| 183 | +BareMetalHost CRD to get information about the on-board controllers on the servers. As |
| 184 | +mentioned previously in this document, this CRD exists in non baremetal platform types |
| 185 | +too but does not represent any usable information for other platforms. |
| 186 | + |
| 187 | +Worker Deployment |
| 188 | + |
| 189 | +Unlike the control plane deployment, the worker deployment is managed by metal3. Not |
| 190 | +all aspects of worker deployment are implemented completely. |
| 191 | + |
| 192 | +1. All worker nodes need to be attached to both the provisioning and external networks |
| 193 | +and configured to PXE boot over the provisioning network. A temporary provisioning IP |
| 194 | +address in the provisioning network are assigned to each of these hosts. |
| 195 | + |
| 196 | +2. The user adds hosts to the available inventory for their cluster by creating |
| 197 | +BareMetalHost CRs. For more information about the 3 CRs that already exist for a host |
| 198 | +transitioning from a baremetal host to a Node, please refer to [9]. |
| 199 | + |
| 200 | +3. The cluster-api-provider-baremetal (CAPBM) controller finds an unassigned/free |
| 201 | +BareMetalHost and uses it to fulfill a Machine resource. It then sets the configuration |
| 202 | +on the host to start provisioning with the RHCOS image (using RHCOS image URL present |
| 203 | +in the Machine provider spec) and the worker ignition config for the cluster. |
| 204 | + |
| 205 | +4. Baremetal operator uses the Ironic service to provision the worker nodes in a |
| 206 | +process that is very similar to the provisioning of the control plane except for |
| 207 | +some key differences. The DHCP server is now running within the metal3 pod instead |
| 208 | +of in the bootstarp VM. |
| 209 | + |
| 210 | +5. The provisioning IP used to bring up worker nodes remains the same as the control |
| 211 | +plane case and the provisoning network also remains the same. The installer also |
| 212 | +provides with a DHCP range within the same network that the workers are assigned IP |
| 213 | +addresses from. |
| 214 | + |
| 215 | +6. The ignition configs for the worker nodes are as passed as user data in the config |
| 216 | +drive. Just as in the control plane hosts, Ironic power cycles the hosts that boot |
| 217 | +using the RHCOS image now in their local disk. The host then joins the cluster as a |
| 218 | +worker. |
| 219 | + |
| 220 | +Currently, there is no way to pass the provisioning config known to the installer to |
| 221 | +metal3 that is responsible for provisioning the workers. |
| 222 | + |
| 223 | +### Risks and Mitigations |
| 224 | + |
| 225 | +Will be specified in follow-up enhancement requests mentioned above. |
| 226 | + |
| 227 | +## Design Details |
| 228 | + |
| 229 | +### Test Plan |
| 230 | + |
| 231 | +True e2e and integration testing can happen only after implementation for |
| 232 | +enhancement [2] lands. Until then, e2e testing is being performed with the |
| 233 | +help of some developer scripts. |
| 234 | + |
| 235 | +Unit tests have been added to MAO and the Installer to test additions |
| 236 | +made for the Baremetal IPI case. |
| 237 | + |
| 238 | +### Graduation Criteria |
| 239 | + |
| 240 | +Metal3 integration is in tech preview in 4.2 and is targetted for GA in 4.4. |
| 241 | + |
| 242 | +Metal3 integration is currently missing an important piece to information on |
| 243 | +the baremetal servers and ther provisioning environment. Without this, true |
| 244 | +end to end testing cannot be performed in order to graduate to GA. |
| 245 | + |
| 246 | +### Upgrade / Downgrade Strategy |
| 247 | + |
| 248 | +Metal3 integration is in tech preview n 4.2 and missing key pieces that allows |
| 249 | +a user to specify the baremetal server details and its provisioning setup. It |
| 250 | +is really not usable in this state without the help of external scripts that |
| 251 | +provied the above information in the form of a Config Map. |
| 252 | + |
| 253 | +In 4.4, when all the installer features land, the Metal3 integration would be |
| 254 | +fully functional within OpenShift. Due to those reasons, at this point an |
| 255 | +upgrade strategy would not be necessary. |
| 256 | + |
| 257 | +### Version Skew Strategy |
| 258 | + |
| 259 | +This enahncement serves as a backgroup for the rest of the enhancements. We will |
| 260 | +discuss the version skew strategy for each enhancement individually in their |
| 261 | +respective requests. |
| 262 | + |
| 263 | +## Implementation History |
| 264 | + |
| 265 | +Implementation to deploy a Metal3 cluster from the MAO was added via [4]. |
| 266 | + |
| 267 | +## Infrastructure Needed |
| 268 | + |
| 269 | +The Baremetal IPI solution depends on the Baremetal Operator and the baremetal |
| 270 | +Machine actuator both of which can be found at [5]. |
| 271 | +OpenShift integration can be found here : [6]. |
| 272 | +Implementation is complete on the metal3-io and relevant bits have been |
| 273 | +added to the OpenShift repo. |
| 274 | + |
| 275 | +[1] - https://github.com/metal3-io/baremetal-operator |
| 276 | +[2] - https://github.com/openshift/enhancements/blob/master/enhancements/baremetal/baremetal-provisioning-config.md |
| 277 | +[3] - https://github.com/openstack/ironic |
| 278 | +[4] - https://github.com/openshift/machine-api-operator/commit/43dd52d5d2dfea1559504a01970df31925501e35 |
| 279 | +[5] - https://github.com/metal3-io |
| 280 | +[6] - https://github.com/openshift-metal3 |
| 281 | +[7] - https://github.com/openshift/installer/blob/master/docs/user/metal/install_ipi.md |
| 282 | +[8] - https://metal3.io/ |
| 283 | +[9] - https://github.com/metal3-io/metal3-docs/blob/master/design/nodes-machines-and-hosts.md |
0 commit comments