|
| 1 | +--- |
| 2 | +title: connected-assisted-installer |
| 3 | +authors: |
| 4 | + - "@avishayt" |
| 5 | + - "@hardys" |
| 6 | + - "@dhellmann" |
| 7 | +reviewers: |
| 8 | + - "@beekhof" |
| 9 | + - "@deads2k" |
| 10 | + - "@hexfusion" |
| 11 | + - "@mhrivnak" |
| 12 | +approvers: |
| 13 | + - "@crawford" |
| 14 | + - "@abhinavdahiya" |
| 15 | + - "@eparis" |
| 16 | +creation-date: 2020-06-09 |
| 17 | +last-updated: 2020-06-10 |
| 18 | +status: implementable |
| 19 | +see-also: |
| 20 | + - "/enhancements/baremetal/minimise-baremetal-footprint.md" |
| 21 | +--- |
| 22 | + |
| 23 | +# Assisted Installer for Connected Environments |
| 24 | + |
| 25 | +## Release Signoff Checklist |
| 26 | + |
| 27 | +- [ ] Enhancement is `implementable` |
| 28 | +- [ ] Design details are appropriately documented from clear requirements |
| 29 | +- [ ] Test plan is defined |
| 30 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 31 | +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) |
| 32 | + |
| 33 | +## Summary |
| 34 | + |
| 35 | +This enhancement describes changes in and around the installer to |
| 36 | +assist with deployment on user-provisioned infrastructure. The use |
| 37 | +cases are primarily relevant for bare metal, but in the future may be |
| 38 | +applicable to cloud users who are running an installer UI directly |
| 39 | +instead of using a front-end such as `cloud.redhat.com` or the UI |
| 40 | +provided by their cloud vendor. |
| 41 | + |
| 42 | +## Motivation |
| 43 | + |
| 44 | +The target user is someone wanting to deploy OpenShift, especially on |
| 45 | +bare metal, with as little up-front infrastructure dependencies as |
| 46 | +possible. This person has access to server hardware and wants to run |
| 47 | +workloads quickly. They do not necessarily have the administrative |
| 48 | +privileges to create private VLANs, configure DHCP/PXE servers, or |
| 49 | +manage other aspects of the infrastructure surrounding the hardware |
| 50 | +where the cluster will run. If they do have the required privileges, |
| 51 | +they may not want to delegate them to the OpenShift installer for an |
| 52 | +installer-provisioned infrastructure installation, preferring instead |
| 53 | +to use their existing tools and processes for some or all of that |
| 54 | +configuration. They are willing to accept that the cluster they build |
| 55 | +may not have all of the infrastructure automation features |
| 56 | +immediately, but that by taking additional steps they will be able to |
| 57 | +add those features later. |
| 58 | + |
| 59 | +### Goals |
| 60 | + |
| 61 | +- Make initial deployment of usable and supportable clusters simpler. |
| 62 | +- Move more infrastructure configuration from day 1 to day 2. |
| 63 | +- Support connected on-premise deployments. |
| 64 | +- Support existing infrastructure automation features, especially for |
| 65 | + day 2 cluster management and scale-out. |
| 66 | + |
| 67 | +### Non-Goals |
| 68 | + |
| 69 | +- Because the initial focus is on bare metal, this enhancement does |
| 70 | + not exhaustively cover variations needed to offer similar features |
| 71 | + on other platforms (such as changes to image formats, the way a host |
| 72 | + boots, etc.). It is desirable to support those platforms, but that |
| 73 | + work will be described separately. |
| 74 | +- Environments with restricted networks where hosts cannot reach the internet unimpeded |
| 75 | + ("disconnected" or "air-gapped") will require more work to support |
| 76 | + this installation workflow than simply packaging the hosted solution |
| 77 | + built to support fully connected environments. The work to support |
| 78 | + disconnected environments will be covered by a future enhancement. |
| 79 | +- Replace the existing OpenShift installer. |
| 80 | +- Describe how these workflows would work for multi-cluster |
| 81 | + deployments managed with Hive or ACM. |
| 82 | + |
| 83 | +## Proposal |
| 84 | + |
| 85 | +There are several separate changes to enable the assisted installer |
| 86 | +workflows, including a GUI front-end for the installer, a cloud-based |
| 87 | +orchestration service, and changes to the installer and bootstrapping |
| 88 | +process. |
| 89 | + |
| 90 | +The process starts when the user goes to an "assisted installer" |
| 91 | +application running on `cloud.redhat.com`, enters details needed by |
| 92 | +the installer (OpenShift version, ssh keys, proxy settings, etc.), and |
| 93 | +then downloads a live RHCOS ISO image with the software and settings |
| 94 | +they need to complete the installation locally. |
| 95 | + |
| 96 | +The user then boots the live ISO on each host they want to be part of |
| 97 | +the cluster (control plane and workers). They can do this by hand |
| 98 | +using thumb drives, by attaching the ISO using virtual media support |
| 99 | +in the BMC of the host, or any other way they choose. |
| 100 | + |
| 101 | +When the ISO boots, it starts an agent that communicates with the REST |
| 102 | +API for the assisted installer service running on `cloud.redhat.com` |
| 103 | +to receive instructions. The agent registers the host with the |
| 104 | +service, using the user's pull secret embedded in the ISO's Ignition config |
| 105 | +to authenticate. The agent identifies itself based on the serial |
| 106 | +number from the host it is running on. Communication always flows from |
| 107 | +agent to service via HTTPS so that firewalls and proxies work as |
| 108 | +expected. |
| 109 | + |
| 110 | +Each host agent periodically asks the service what tasks to perform, |
| 111 | +and the service replies with a list of commands and arguments. A |
| 112 | +command can be to: |
| 113 | + |
| 114 | +1. Return hardware information for its host |
| 115 | +2. Return L2 and L3 connectivity information between its host and the |
| 116 | + other hosts (the IPs and MAC addresses of the other hosts are |
| 117 | + passed as arguments) |
| 118 | +3. Begin the installation of its host (arguments include the host's |
| 119 | + role, boot device, etc.). The agent executes different installation |
| 120 | + logic depending on its role (bootstrap-master, master, or worker). |
| 121 | + |
| 122 | +The agent posts the results for the command back to the |
| 123 | +service. During the actual installation, the agents post progress. |
| 124 | + |
| 125 | +As agents report to the assisted installer, their hosts appear in the |
| 126 | +UI and the user is given an opportunity to examine the hardware |
| 127 | +details reported and to set the role and cluster of each host. |
| 128 | + |
| 129 | +The assisted installer orchestrates a set of validations on all |
| 130 | +hosts. It ensures there is full L2 and L3 connectivity between all of |
| 131 | +the hosts, that the hosts all meet minimum hardware requirements, and |
| 132 | +that the API and ingress VIPs are on the same machine network. |
| 133 | + |
| 134 | +The discovered hardware and networking details are combined with the |
| 135 | +results of the validation to derive defaults for the machine network |
| 136 | +CIDR, the API VIP, and other network configuration settings for the |
| 137 | +hosts. |
| 138 | + |
| 139 | +When enough hosts are configured, the assisted installer application |
| 140 | +replies to the agent on each host with the instructions it needs to |
| 141 | +take part in forming the cluster. The assisted installer application |
| 142 | +selects one host to run the bootstrap services used during |
| 143 | +installation, and the other hosts are told to write an RHCOS image to |
| 144 | +disk and set up ignition to fetch configuration from the |
| 145 | +machine-config-operator in the usual way. |
| 146 | + |
| 147 | +During installation, progress and error information is reported to the |
| 148 | +assisted installer application on `cloud.redhat.com` so it can be |
| 149 | +shown in the UI. |
| 150 | + |
| 151 | +### Integration with Existing Bare Metal Infrastructure Management Tools |
| 152 | + |
| 153 | +Clusters built using the assisted installer workflow use the same |
| 154 | +"baremetal" platform setting as clusters built with |
| 155 | +installer-provisioned infrastructure. The cluster runs metal3, without |
| 156 | +PXE booting support. |
| 157 | + |
| 158 | +BareMetalHosts created by the assisted installer workflow do not have |
| 159 | +BMC credentials set. This means that power-based fencing is not |
| 160 | +available for the associated nodes until the user provides the BMC |
| 161 | +details. |
| 162 | + |
| 163 | +### User Stories |
| 164 | + |
| 165 | +#### Story 1 |
| 166 | + |
| 167 | +As a cluster deployer, I want to install OpenShift on a small set of |
| 168 | +hosts without having to make configuration changes to my network or |
| 169 | +obtain administrator access to infrastructure so I can experiment |
| 170 | +before committing to a full production-quality setup. |
| 171 | + |
| 172 | +#### Story 2 |
| 173 | + |
| 174 | +As a cluster deployer, I want to install OpenShift on a large number |
| 175 | +of hosts using my existing provisioning tools to automate launching |
| 176 | +the installer so I can adapt my existing admin processes and |
| 177 | +infrastructure tools instead of replacing them. |
| 178 | + |
| 179 | +#### Story 3 |
| 180 | + |
| 181 | +As a cluster deployer, I want to install a production-ready OpenShift |
| 182 | +cluster without committing to delegating all infrastructure control to |
| 183 | +the installer or to the cluster, so I can adapt my existing admin |
| 184 | +processes and infrastructure management tools instead of replacing |
| 185 | +them. |
| 186 | + |
| 187 | +#### Story 4 |
| 188 | + |
| 189 | +As a cluster hardware administrator, I want to enable power control |
| 190 | +for the hosts that make up my running cluster so I can use features |
| 191 | +like fencing and failure remediation. |
| 192 | + |
| 193 | +### Implementation Details/Notes/Constraints |
| 194 | + |
| 195 | +Much of the work described by this enhancement already exists as a |
| 196 | +proof-of-concept implementation. Some aspects will need to change as |
| 197 | +part of moving from PoC to product. At the very least, the code will |
| 198 | +need to be moved into more a suitable GitHub org. |
| 199 | + |
| 200 | +The agent discussed in this design is different from the |
| 201 | +`ironic-python-agent` used by Ironic in the current |
| 202 | +installer-provisioned infrastructure implementation. |
| 203 | + |
| 204 | +### Risks and Mitigations |
| 205 | + |
| 206 | +The current implementation relies on |
| 207 | +[minimise-baremetal-footprint](https://github.com/openshift/enhancements/pull/361). If |
| 208 | +that approach cannot be supported, users can proceed by providing an |
| 209 | +extra host (4 hosts to build a 3 node cluster, 6 hosts to build a 5 |
| 210 | +node cluster, etc.). |
| 211 | + |
| 212 | +## Design Details |
| 213 | + |
| 214 | +### Test Plan |
| 215 | + |
| 216 | +**Note:** *Section not required until targeted at a release.* |
| 217 | + |
| 218 | +Consider the following in developing a test plan for this enhancement: |
| 219 | +- Will there be e2e and integration tests, in addition to unit tests? |
| 220 | +- How will it be tested in isolation vs with other components? |
| 221 | + |
| 222 | +No need to outline all of the test cases, just the general strategy. Anything |
| 223 | +that would count as tricky in the implementation and anything particularly |
| 224 | +challenging to test should be called out. |
| 225 | + |
| 226 | +All code is expected to have adequate tests (eventually with coverage |
| 227 | +expectations). |
| 228 | + |
| 229 | +### Graduation Criteria |
| 230 | + |
| 231 | +**Note:** *Section not required until targeted at a release.* |
| 232 | + |
| 233 | +Define graduation milestones. |
| 234 | + |
| 235 | +These may be defined in terms of API maturity, or as something else. Initial proposal |
| 236 | +should keep this high-level with a focus on what signals will be looked at to |
| 237 | +determine graduation. |
| 238 | + |
| 239 | +Consider the following in developing the graduation criteria for this |
| 240 | +enhancement: |
| 241 | +- Maturity levels - `Dev Preview`, `Tech Preview`, `GA` |
| 242 | +- Deprecation |
| 243 | + |
| 244 | +Clearly define what graduation means. |
| 245 | + |
| 246 | +#### Examples |
| 247 | + |
| 248 | +These are generalized examples to consider, in addition to the aforementioned |
| 249 | +[maturity levels][maturity-levels]. |
| 250 | + |
| 251 | +##### Dev Preview -> Tech Preview |
| 252 | + |
| 253 | +- Ability to utilize the enhancement end to end |
| 254 | +- End user documentation, relative API stability |
| 255 | +- Sufficient test coverage |
| 256 | +- Gather feedback from users rather than just developers |
| 257 | + |
| 258 | +##### Tech Preview -> GA |
| 259 | + |
| 260 | +- More testing (upgrade, downgrade, scale) |
| 261 | +- Sufficient time for feedback |
| 262 | +- Available by default |
| 263 | + |
| 264 | +**For non-optional features moving to GA, the graduation criteria must include |
| 265 | +end to end tests.** |
| 266 | + |
| 267 | +##### Removing a deprecated feature |
| 268 | + |
| 269 | +N/A |
| 270 | + |
| 271 | +### Upgrade / Downgrade Strategy |
| 272 | + |
| 273 | +This work is all about building clusters on day 1. After the cluster |
| 274 | +is running, it should be possible to upgrade or downgrade it like any |
| 275 | +other cluster. |
| 276 | + |
| 277 | +### Version Skew Strategy |
| 278 | + |
| 279 | +The assisted installer and agent need to know enough about the |
| 280 | +installer version to construct its inputs correctly. This is a |
| 281 | +development-time skew, for the most part, and the service that builds |
| 282 | +the live ISOs with the assisted installer components should be able to |
| 283 | +adjust the version of the assisted installer to match the version of |
| 284 | +OpenShift, if necessary. |
| 285 | + |
| 286 | +## Implementation History |
| 287 | + |
| 288 | +### Proof of Concept (June, 2020) |
| 289 | + |
| 290 | +* https://github.com/filanov/bm-inventory : The REST service |
| 291 | +* https://github.com/ori-amizur/introspector : Gathers hardware and |
| 292 | + connectivity info on a host |
| 293 | +* https://github.com/oshercc/coreos_installation_iso : Creates the |
| 294 | + RHCOS ISO - runs as a k8s job by bm-inventory |
| 295 | +* https://github.com/oshercc/ignition-manifests-and-kubeconfig-generate : |
| 296 | + Script that generates ignition manifests and kubeconfig - runs as a |
| 297 | + k8s job by bm-inventory |
| 298 | +* https://github.com/tsorya/test-infra : Called by |
| 299 | + openshift-metal3/dev-scripts to end up with a cluster of VMs like in |
| 300 | + dev-scripts, but using the assisted installer |
| 301 | +* https://github.com/eranco74/assisted-installer.git : The actual |
| 302 | + installer code that runs on the hosts |
| 303 | + |
| 304 | +## Drawbacks |
| 305 | + |
| 306 | +The idea is to find the best form of an argument why this enhancement should _not_ be implemented. |
| 307 | + |
| 308 | +## Alternatives |
| 309 | + |
| 310 | +The telco/edge bare metal team is working on support for automating |
| 311 | +virtual media and dropping the need for a separate provisioning |
| 312 | +network. Using the results will still require the user to understand |
| 313 | +how to tell the installer the BMC type and credentials and to ensure |
| 314 | +each host has an IP provided by an outside DHCP server. Hardware |
| 315 | +support for automating virtual media is not consistent between |
| 316 | +vendors. |
| 317 | + |
| 318 | +## Infrastructure Needed [optional] |
| 319 | + |
| 320 | +The existing code (see "Proof of Concept" above) will need to be moved |
| 321 | +into an official GitHub organization. |
0 commit comments