Skip to content

Commit 8df2b11

Browse files
committed
connected assisted installer
This enhancement describes changes in and around the installer to assist with deployment on user-provisioned infrastructure. The use cases are primarily relevant for bare metal, but in the future may be applicable to cloud users who are running an installer UI directly instead of using a front-end such as `cloud.redhat.com` or the UI provided by their cloud vendor. Signed-off-by: Doug Hellmann <[email protected]>
1 parent 449380a commit 8df2b11

File tree

1 file changed

+321
-0
lines changed

1 file changed

+321
-0
lines changed
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
---
2+
title: connected-assisted-installer
3+
authors:
4+
- "@avishayt"
5+
- "@hardys"
6+
- "@dhellmann"
7+
reviewers:
8+
- "@beekhof"
9+
- "@deads2k"
10+
- "@hexfusion"
11+
- "@mhrivnak"
12+
approvers:
13+
- "@crawford"
14+
- "@abhinavdahiya"
15+
- "@eparis"
16+
creation-date: 2020-06-09
17+
last-updated: 2020-06-10
18+
status: implementable
19+
see-also:
20+
- "/enhancements/baremetal/minimise-baremetal-footprint.md"
21+
---
22+
23+
# Assisted Installer for Connected Environments
24+
25+
## Release Signoff Checklist
26+
27+
- [ ] Enhancement is `implementable`
28+
- [ ] Design details are appropriately documented from clear requirements
29+
- [ ] Test plan is defined
30+
- [ ] Graduation criteria for dev preview, tech preview, GA
31+
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
32+
33+
## Summary
34+
35+
This enhancement describes changes in and around the installer to
36+
assist with deployment on user-provisioned infrastructure. The use
37+
cases are primarily relevant for bare metal, but in the future may be
38+
applicable to cloud users who are running an installer UI directly
39+
instead of using a front-end such as `cloud.redhat.com` or the UI
40+
provided by their cloud vendor.
41+
42+
## Motivation
43+
44+
The target user is someone wanting to deploy OpenShift, especially on
45+
bare metal, with as little up-front infrastructure dependencies as
46+
possible. This person has access to server hardware and wants to run
47+
workloads quickly. They do not necessarily have the administrative
48+
privileges to create private VLANs, configure DHCP/PXE servers, or
49+
manage other aspects of the infrastructure surrounding the hardware
50+
where the cluster will run. If they do have the required privileges,
51+
they may not want to delegate them to the OpenShift installer for an
52+
installer-provisioned infrastructure installation, preferring instead
53+
to use their existing tools and processes for some or all of that
54+
configuration. They are willing to accept that the cluster they build
55+
may not have all of the infrastructure automation features
56+
immediately, but that by taking additional steps they will be able to
57+
add those features later.
58+
59+
### Goals
60+
61+
- Make initial deployment of usable and supportable clusters simpler.
62+
- Move more infrastructure configuration from day 1 to day 2.
63+
- Support connected on-premise deployments.
64+
- Support existing infrastructure automation features, especially for
65+
day 2 cluster management and scale-out.
66+
67+
### Non-Goals
68+
69+
- Because the initial focus is on bare metal, this enhancement does
70+
not exhaustively cover variations needed to offer similar features
71+
on other platforms (such as changes to image formats, the way a host
72+
boots, etc.). It is desirable to support those platforms, but that
73+
work will be described separately.
74+
- Environments with restricted networks where hosts cannot reach the internet unimpeded
75+
("disconnected" or "air-gapped") will require more work to support
76+
this installation workflow than simply packaging the hosted solution
77+
built to support fully connected environments. The work to support
78+
disconnected environments will be covered by a future enhancement.
79+
- Replace the existing OpenShift installer.
80+
- Describe how these workflows would work for multi-cluster
81+
deployments managed with Hive or ACM.
82+
83+
## Proposal
84+
85+
There are several separate changes to enable the assisted installer
86+
workflows, including a GUI front-end for the installer, a cloud-based
87+
orchestration service, and changes to the installer and bootstrapping
88+
process.
89+
90+
The process starts when the user goes to an "assisted installer"
91+
application running on `cloud.redhat.com`, enters details needed by
92+
the installer (OpenShift version, ssh keys, proxy settings, etc.), and
93+
then downloads a live RHCOS ISO image with the software and settings
94+
they need to complete the installation locally.
95+
96+
The user then boots the live ISO on each host they want to be part of
97+
the cluster (control plane and workers). They can do this by hand
98+
using thumb drives, by attaching the ISO using virtual media support
99+
in the BMC of the host, or any other way they choose.
100+
101+
When the ISO boots, it starts an agent that communicates with the REST
102+
API for the assisted installer service running on `cloud.redhat.com`
103+
to receive instructions. The agent registers the host with the
104+
service, using the user's pull secret embedded in the ISO's Ignition config
105+
to authenticate. The agent identifies itself based on the serial
106+
number from the host it is running on. Communication always flows from
107+
agent to service via HTTPS so that firewalls and proxies work as
108+
expected.
109+
110+
Each host agent periodically asks the service what tasks to perform,
111+
and the service replies with a list of commands and arguments. A
112+
command can be to:
113+
114+
1. Return hardware information for its host
115+
2. Return L2 and L3 connectivity information between its host and the
116+
other hosts (the IPs and MAC addresses of the other hosts are
117+
passed as arguments)
118+
3. Begin the installation of its host (arguments include the host's
119+
role, boot device, etc.). The agent executes different installation
120+
logic depending on its role (bootstrap-master, master, or worker).
121+
122+
The agent posts the results for the command back to the
123+
service. During the actual installation, the agents post progress.
124+
125+
As agents report to the assisted installer, their hosts appear in the
126+
UI and the user is given an opportunity to examine the hardware
127+
details reported and to set the role and cluster of each host.
128+
129+
The assisted installer orchestrates a set of validations on all
130+
hosts. It ensures there is full L2 and L3 connectivity between all of
131+
the hosts, that the hosts all meet minimum hardware requirements, and
132+
that the API and ingress VIPs are on the same machine network.
133+
134+
The discovered hardware and networking details are combined with the
135+
results of the validation to derive defaults for the machine network
136+
CIDR, the API VIP, and other network configuration settings for the
137+
hosts.
138+
139+
When enough hosts are configured, the assisted installer application
140+
replies to the agent on each host with the instructions it needs to
141+
take part in forming the cluster. The assisted installer application
142+
selects one host to run the bootstrap services used during
143+
installation, and the other hosts are told to write an RHCOS image to
144+
disk and set up ignition to fetch configuration from the
145+
machine-config-operator in the usual way.
146+
147+
During installation, progress and error information is reported to the
148+
assisted installer application on `cloud.redhat.com` so it can be
149+
shown in the UI.
150+
151+
### Integration with Existing Bare Metal Infrastructure Management Tools
152+
153+
Clusters built using the assisted installer workflow use the same
154+
"baremetal" platform setting as clusters built with
155+
installer-provisioned infrastructure. The cluster runs metal3, without
156+
PXE booting support.
157+
158+
BareMetalHosts created by the assisted installer workflow do not have
159+
BMC credentials set. This means that power-based fencing is not
160+
available for the associated nodes until the user provides the BMC
161+
details.
162+
163+
### User Stories
164+
165+
#### Story 1
166+
167+
As a cluster deployer, I want to install OpenShift on a small set of
168+
hosts without having to make configuration changes to my network or
169+
obtain administrator access to infrastructure so I can experiment
170+
before committing to a full production-quality setup.
171+
172+
#### Story 2
173+
174+
As a cluster deployer, I want to install OpenShift on a large number
175+
of hosts using my existing provisioning tools to automate launching
176+
the installer so I can adapt my existing admin processes and
177+
infrastructure tools instead of replacing them.
178+
179+
#### Story 3
180+
181+
As a cluster deployer, I want to install a production-ready OpenShift
182+
cluster without committing to delegating all infrastructure control to
183+
the installer or to the cluster, so I can adapt my existing admin
184+
processes and infrastructure management tools instead of replacing
185+
them.
186+
187+
#### Story 4
188+
189+
As a cluster hardware administrator, I want to enable power control
190+
for the hosts that make up my running cluster so I can use features
191+
like fencing and failure remediation.
192+
193+
### Implementation Details/Notes/Constraints
194+
195+
Much of the work described by this enhancement already exists as a
196+
proof-of-concept implementation. Some aspects will need to change as
197+
part of moving from PoC to product. At the very least, the code will
198+
need to be moved into more a suitable GitHub org.
199+
200+
The agent discussed in this design is different from the
201+
`ironic-python-agent` used by Ironic in the current
202+
installer-provisioned infrastructure implementation.
203+
204+
### Risks and Mitigations
205+
206+
The current implementation relies on
207+
[minimise-baremetal-footprint](https://github.com/openshift/enhancements/pull/361). If
208+
that approach cannot be supported, users can proceed by providing an
209+
extra host (4 hosts to build a 3 node cluster, 6 hosts to build a 5
210+
node cluster, etc.).
211+
212+
## Design Details
213+
214+
### Test Plan
215+
216+
**Note:** *Section not required until targeted at a release.*
217+
218+
Consider the following in developing a test plan for this enhancement:
219+
- Will there be e2e and integration tests, in addition to unit tests?
220+
- How will it be tested in isolation vs with other components?
221+
222+
No need to outline all of the test cases, just the general strategy. Anything
223+
that would count as tricky in the implementation and anything particularly
224+
challenging to test should be called out.
225+
226+
All code is expected to have adequate tests (eventually with coverage
227+
expectations).
228+
229+
### Graduation Criteria
230+
231+
**Note:** *Section not required until targeted at a release.*
232+
233+
Define graduation milestones.
234+
235+
These may be defined in terms of API maturity, or as something else. Initial proposal
236+
should keep this high-level with a focus on what signals will be looked at to
237+
determine graduation.
238+
239+
Consider the following in developing the graduation criteria for this
240+
enhancement:
241+
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA`
242+
- Deprecation
243+
244+
Clearly define what graduation means.
245+
246+
#### Examples
247+
248+
These are generalized examples to consider, in addition to the aforementioned
249+
[maturity levels][maturity-levels].
250+
251+
##### Dev Preview -> Tech Preview
252+
253+
- Ability to utilize the enhancement end to end
254+
- End user documentation, relative API stability
255+
- Sufficient test coverage
256+
- Gather feedback from users rather than just developers
257+
258+
##### Tech Preview -> GA
259+
260+
- More testing (upgrade, downgrade, scale)
261+
- Sufficient time for feedback
262+
- Available by default
263+
264+
**For non-optional features moving to GA, the graduation criteria must include
265+
end to end tests.**
266+
267+
##### Removing a deprecated feature
268+
269+
N/A
270+
271+
### Upgrade / Downgrade Strategy
272+
273+
This work is all about building clusters on day 1. After the cluster
274+
is running, it should be possible to upgrade or downgrade it like any
275+
other cluster.
276+
277+
### Version Skew Strategy
278+
279+
The assisted installer and agent need to know enough about the
280+
installer version to construct its inputs correctly. This is a
281+
development-time skew, for the most part, and the service that builds
282+
the live ISOs with the assisted installer components should be able to
283+
adjust the version of the assisted installer to match the version of
284+
OpenShift, if necessary.
285+
286+
## Implementation History
287+
288+
### Proof of Concept (June, 2020)
289+
290+
* https://github.com/filanov/bm-inventory : The REST service
291+
* https://github.com/ori-amizur/introspector : Gathers hardware and
292+
connectivity info on a host
293+
* https://github.com/oshercc/coreos_installation_iso : Creates the
294+
RHCOS ISO - runs as a k8s job by bm-inventory
295+
* https://github.com/oshercc/ignition-manifests-and-kubeconfig-generate :
296+
Script that generates ignition manifests and kubeconfig - runs as a
297+
k8s job by bm-inventory
298+
* https://github.com/tsorya/test-infra : Called by
299+
openshift-metal3/dev-scripts to end up with a cluster of VMs like in
300+
dev-scripts, but using the assisted installer
301+
* https://github.com/eranco74/assisted-installer.git : The actual
302+
installer code that runs on the hosts
303+
304+
## Drawbacks
305+
306+
The idea is to find the best form of an argument why this enhancement should _not_ be implemented.
307+
308+
## Alternatives
309+
310+
The telco/edge bare metal team is working on support for automating
311+
virtual media and dropping the need for a separate provisioning
312+
network. Using the results will still require the user to understand
313+
how to tell the installer the BMC type and credentials and to ensure
314+
each host has an IP provided by an outside DHCP server. Hardware
315+
support for automating virtual media is not consistent between
316+
vendors.
317+
318+
## Infrastructure Needed [optional]
319+
320+
The existing code (see "Proof of Concept" above) will need to be moved
321+
into an official GitHub organization.

0 commit comments

Comments
 (0)