Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions enhancements/baremetal-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: Baremetal IPI Config Enhancement
authors:
- "@sadasu"
reviewers:
- "@deads2k"
- "@hardys"
- "@enxebre"
- "@squeed"
- "@abhinavdahiya"

approvers:
- "@deads2k"

creation-date: 2019-10-23
last-updated: 2019-10-23
status: implemented
see-also:
- "https://github.com/openshift/api/pull/480"
replaces:
-
superseded-by:
-
---

# Config required for Baremetal IPI deployments

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift/docs]

## Open Questions [optional]

None.

## Summary

The configuration required to bring up a Baremetal IPI (metal3) cluster needs
to be availabe in a Config Resource. The installer aleady adds some configuration
to the Infrastructure Config CR. It is also aware of the additional config
required during the deployment of a metal3 cluster. This enhancement proposes
to extend the BareMetaPlatformStatus CR to hold these additional values needed
for a metal3 deployment.

## Motivation

Adding these additional config items to the BareMetalPlatformStatus CR, allows
versioning of these config items.

### Goals

The goal of this enhancement is to enable the installer to add vital config
values for a Baremetal IPI deployment to a CR. The MAO will then be able to
access these values when it kicks off a metal3 cluster.

Update lifecycle for bootimage is an unsolved problem for all platforms. For
more information, refer to : https://github.com/openshift/os/issues/381.

### Non-Goals

All config values except the RHCOSImageUrl are not expected to change after a
metal3 cluster is up. Handling changes to this value after install time has
not been explored and that is a non-goal at this time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mention that the update lifecycle for bootimage is an unsolved problem for all platforms ref openshift/os#381

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imcleod assembled some notes on this issue from a discussion we had a couple months ago as well: https://docs.google.com/document/d/1W_N3pcOuo6E_DcM5GQ731cTTN2VJ1KUa2A49_Zi3IRU/edit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericavonb & @imcleod isit OK for me to add a link to this doc in my enhancement request?


### Proposal

To support configuration of a Baremetal IPI (metal3) deployment, a set of new
config items are required to be made availabe via a Config Resource. The
BareMetalPlatformStatus is an ideal fit because it contains configuration
that is specific to a Baremetal deployment.

This proposal aims to add the following to BareMetalPlatformStatus:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProvisioningInterface, ProvisioningDHCPRange, ProvisioningIP, ProvisioningNetworkCIDR seem like the information that best belongs on the baremetal host CR itself, because my assumption was that it described a baremetal-server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't related to the BareMetalHost CR - that describes the hosts to be provisioned, this is an implementation detail of how the provisioning service works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The host CR is the inventory that the installer uses to fulfill a Machine. These settings tell the provisioning system how to do that work, by telling it which network it can use and what IPs it should use on that network.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go either way. For example, you'd want it on the Machine if you wanted the machine-API to be able to support multiple providers (e.g. "make machine A using provider A and make machine B using provider B"). If we don't want to support that (and we don't for other providers like AWS), then having these configured once in a central location like this makes sense. And even if we do support per-machine settings, having a central default makes sense to support folks creating Machines who don't want to care about selecting these things. We have precedent for all of these, like the default AWS region on the Infrastructure object, which can be overridden in more-specific configs as desired, the GCP project ID on the Infrastructure object, which maybe can't be overridden (I dunno), and on libvirt the libvirtd URI is on the machine provider with no cluster-level default. Also, if this is going to be a machine-API-provider-only config, not used by other cluster components, it might be worth putting the cluster default somewhere specific to the machine-API provider, and not in the more generic Infrastructure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Thanks for the comments.
Machine-api-operator does refer to the PlatformStatus CR which contains the BareMetalPltaformStatus.

@enxebre is there a different CR that the machine-api-operator refers to and is more appropriate for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even if we do support per-machine settings, having a central default makes sense to support folks creating Machines who don't want to care about selecting these things.

I think that is the key issue here, the provisioning toolchain is supposed to be an implementation detail, so we do need some way to set sane defaults, and allow them to be overridden easily by the installer to match local network configuration.

It could be one day that we need custom configuration per machineset or something (I'm thinking a spine/leaf deployment topology), but it's not something currently under consideration, we just need to wire in the defaults from the installer as a first step.

1. ProvisioningInterface : This is the interface name on the underlying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something like talk to this interface if you want to talk to the API, like QEMU URL/amazonaws.com

Also this seems like BareMetalHost Custom Resource should have this information, because i would assume users might want to have different interface on each baremetal-server that is used to talk to provisioning network??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the physical nic used for the dedicated network on the nodes where the baremetal operator and Ironic runs. e.g eno1, eth3 or whatever.

This shouldn't be part of the BareMetalHost CRD because that describes the nodes to be provisioned, this is an implementation detail of how the provisioning service runs on the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya perhaps a bit more context would help here:

AFAICS the changes proposed by @sadasu are aligned with the guidance previously provided by @wking, but perhaps you can provide more detailed guidance as to how we can refine this approach to make progress.

We've been trying to add this small amount of additional configuration data for nearly 3 months now, and it seems like conflicting feedback on the various PRs is part of the reason we're making such slow progress - perhaps you and @wking could chat and provide some very specific guidance on how to proceed? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something like talk to this interface if you want to talk to the API, like QEMU URL/amazonaws.com

Also this seems like BareMetalHost Custom Resource should have this information, because i would assume users might want to have different interface on each baremetal-server that is used to talk to provisioning network??

@abhinavdahiya That is a good point. But, currently we are operating under the assumption that in a given installation, we would have baremetal servers that will all use an interface of the same name to be physically connected to the provisioning network.
Secondly, the BareMetalHost Custom Resourse currently contains information about baremetal hosts known to the BareMetal Operator. We also have an additional set of configs that are used by Ironic, which is a service that provisions baremetal servers and gets them ready for use as hosts and then as nodes. This config cannot be added to the BareMetalHost Custom Resource.
So, instead of creating a brand new Custom Resourse, we are augmenting the BareMetalPlatformStatus Custom Resource. Agreed that adding this make the BareMetalPlatformStatus CR look a lot different from similar CRs used by other clouds. But, in the Baremetal case, we are also responsible for provisioning of the underlying physical servers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example value for this might be "eth1" or "ens8" or the name of a bridge (we use "provisioning" in the VM-based development environment for this work).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the provision network information is for the baremetal-controller so that it knows how to reach the provisioning network on the host it's running to do the provisioning of a new node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the provision network information is for the baremetal-controller so that it knows how to reach the provisioning network on the host it's running to do the provisioning of a new node?

Wait, why can't this just be a routing config on the controller node? Controller: "I'm trying to reach the provisioning API at https://whatever.example.com", DNS: "whatever.example.com is 1.2.3.4", kernel: "take eth1 to reach 1.2.3.4". Doesn't seem bare-metal-specific to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A config similar to that might be the IngressIP and it is present in the BareMetalPlatformStatus CR. https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L176.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the provision network information is for the baremetal-controller so that it knows how to reach the provisioning network on the host it's running to do the provisioning of a new node?

Not quite, it's so that the baremetal-operator pod can configure a dedicated interface on the node where that pod runs, this happens via https://github.com/openshift/ironic-static-ip-manager which is a bit hacky but works in the absence of an external DHCP server or any other external dependencies.

So to reply to @wking routing doesn't help us here, we have to bring up a nic on the node so we can serve DHCP/TFTP etc for pxe booting the nodes.

In future this may become optional, when we support use of external DHCP servers on the provisioning network, but right now it's required to make pxe boot work.

Copy link
Contributor

@hardys hardys Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller: "I'm trying to reach the provisioning API at https://whatever.example.com"

FWIW I think this is a fundamental misunderstanding here - there is no existing provisioning API we have to reach - the metal3 pod deployed by the MAO runs that API, and most of the data under discussion here is just the configuration needed to make that self-hosted API work.

This is very different to the typical "cloud provider" API interaction, which I think is why some of this discussion is difficult without prior context :)

baremetal server which would be physically connected to the provisioning
network.

2. ProvisioningIP : This is the IP address used to to bring up a NIC on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is a little difficult to understand what exactly this is, I don't completely understand baremetal-server provisioinig... a little bit context here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the IP address that the interface above is configured with - we've got a mechanism that statically configures ProvisioningInterface on whatever master node has the metal3 baremetal-operator pod running, but this is a short lease such that if/when the pod gets rescheduled the IP goes down on the old host, and is the brought back up on the new host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the IP address that the interface above is configured with - we've got a mechanism that statically configures ProvisioningInterface on whatever master node has the metal3 baremetal-operator pod running, but this is a short lease such that if/when the pod gets rescheduled the IP goes down on the old host, and is the brought back up on the new host.

@sadasu this would be good info to include in the body of the doc. Maybe link to any metal3 docs about it, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be good info to include in the body of the doc. Maybe link to any metal3 docs about it, too?

A number of the "what is this?" threads on this PR could benefit from a link to or local docs about how metal3 bootstrapping/management works. Currently we have bits and pieces of it described in response comments in the PR; would be nice to collect that up and put an explainer in a Background section right before this Proposal section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a background "enhancement" here #102 so that we can point to that from this enhancement request and others that follow. Feel free to comment on that PR too.

As per the actual config items, this enhancement request would be the one where all the details would exist.

the baremetal server for provisioning. This value should not be in the
DHCP range and should not be used in the provisioning network for any
other purpose (should be a free IP address.) It is expected to be provided
in the format : 10.1.0.3.

3. ProvisioningNetworkCIDR : This is the provisioning network with its
CIDR. The ProvisioningIP and the ProvisioningDHCPRange are expected to
be within this network. The value for this config is expected to be
provided in the format : 10.1.0.0/24.

4. ProvisioningDHCPRange : This is a range of IP addresses from which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we shouldn't just use the MachineCIDR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a completely different network to the MachineCIDR, it's used for baremetal deployment to pxe boot the servers and access their BMCs via IPMI.

In the installer there is a validation to ensure this network does not overlap with MachineCIDR ref openshift/installer#2449

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, for baremetal we assume at least 2 networks. The MachineCIDR describes the usual network where control plane and application traffic go. This new ProvisioningCIDR describes the network where provisioning traffic goes. That traffic needs to be isolated from the control plane and applications because it involves protocols like DHCP and PXE, and you don't want to mix those with publicly facing traffic, so they are typically on different L2 network segments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, Fun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, Fun.

baremetal hosts can be assigned their IP address. This range should be
provided as two comma seperated IP addresses where the addresses represent
a contiguous range of IPs that are available to be allocated. It is expected
to be provided in the format : 10.1.0.10, 10.1.0.100.

5. CacheURL : This is a location where image required to bring up baremetal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the machine object should have this set, how is this a global configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the machine object, it's an implementation detail of the baremetal provisioning service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to expand on "image" (maybe "bootimage"? Something else), so folks don't read this and assume "container image" like I did ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's a good point, it is where we'll cache the qcow2 bootimage indeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. CacheURL : This is a location where image required to bring up baremetal
5. CacheURL : This is a location where the image required to bring up baremetal

hosts have been previously downloaded for faster image downloads within the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hosts have been previously downloaded for faster image downloads within the
hosts has been previously downloaded for faster image downloads within the

cluster. This is an optional parameter and is expected to be provided
in the format : http://111.111.111.1/images

6. RHCOSImageURL : This config is expected to be the location of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This most definitely is not a cluster global configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you suggest the MAO gets this data instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify why this is needed, https://github.com/openshift/ironic-rhcos-downloader is one of the containers we start as part of the ironic deployment.

This dowloads the required RHCOS image, and the URL for that is set in the installer via the data/data/rhcos.json - openshift/os#381 discusses in future updating bootimages but for now we need some way to download the installer specified version of the image to enable deployment of worker nodes via the BMO.

The installer has that data, so how would you prefer that we pass the URL to the downloader container, which is started via the MAO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option I guess would be to use the URL from the machine providerspec for this, but currently that points at the cached URL:

openshift/installer@ac89074

One problem is that at the point where the MAO starts the BMO pod, we don't necessarily have any machines defined.

I guess we could block metal3 deployment on the machine existence, with logic that checks if the cache already contains a specific image, but then somewhere we'd have to rewrite the image used by the deployment to point at the cache (to avoid non-local download for potentially $many machine deployments).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Steve points out, we do have an image URL in the machine provider spec but we don't want to have to download that image from the internet for every deployment. This setting would allow us to populate a cache, and then have the machine provider spec refer to the cached location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the machine objects should have the link to the RHCOS image as known in rhcos.json
The provisioning controllers can see that the image already exists in their cache and then configure the server to use the cache URL (hopefully the controller knows that because it created it?) to download the image.

The controller can do the caching when it sees the fir machine object.

And as for allowing caching across deployments, the user's should be free to override the RHCOS url chosen by the installer and set in the machine object such that the controller gets the image from local network hosted link..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhinavdahiya I think we're in agreement re the RHCOSImageURL as per my reply to @sadasu below.

One question/clarification though:

The provisioning controllers can see that the image already exists in their cache and then configure the server to use the cache URL

So should we add the logic to check the machineconfig URL and compare it to images available in the cache to the MAO (for example by doing a HEAD on the image, or downloading the md5sum we expose there), or do we need to bake that logic into the baremetal-operator?

The latter may be cleaner from an interface perspective, but I'm not sure if it aligns with the upstream vision for the BMO interfaces - @dhellmann do you think it would be reasonable to add logic to the BMO that checks the cache for an image, and rewrites the deployment image if it's found?

official RHCOS release image. It is a required parameter and will be used
to deploy new nodes in the cluster.

### User Stories [optional]

This enhancement will allow users to have their Barametal IPI deployments
customizable to their specific hardware and network requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also allows IPI baremetal deployments without manual workarounds such as creating the required ConfigMap object manually (which is what we're currently doing for testing).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user story doesn't actually provide the components that will be using these fields.

Since this proposal is adding these new fields to the global platform configs, I would like to make sure this is actually a infrastructure level information or it belongs to a specific operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Augmented The User Stories section to take care of the above 2 comments.


This also allows IPI baremetal deployments to take place without manual
workarounds like creating a ConfigMap for the config (which is the
current approach we are using to test baremetal deployments.)

The new config items would be set by the installer and will be used by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only use-case is some machine-api-operator component, I am inclined towards making these fields available to just that component and not add these to the global configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additions are specific to the baremetal platform in openshift/api#480 - and this data is required for all deployments when the baremetal platform is selected.

Can you suggest how we'd make this available only to a component? I'm not entirely clear what you are proposing, and how it differs from using the BareMetalPlatformStatus?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest how we'd make this available only to a component?

E.g. ingress-specific configuration or the registry-specific config or other operator.openshift.io stuff.

... how it differs from using the BareMetalPlatformStatus?

Mostly in scoping. It's easier to reason about the impact of config changes if you can scope read access down more tightly. If the MAO is the only component with creds to read a metal config object, you don't have to worry about other operators or cluster components sneaking in to borrow your settings and give you later trouble when you need to migrate to a different schema. But if multiple components need access, a central location like Infrastructure is DRYer for cluster admins.

the machine-api-operator (MAO) to generate more config items that are
derivable from these basic parameters. Put together, these config
parameters are passed in as environment variables to the various containers
that make up a metal3 baremetal IPI deployment.

## Design Details

### Test Plan

The test plan should involve making sure the openshift/installer generates
all configuration items within the BaremetalPlatformStatus when the platform
type is Baremetal.

MAO reads this configuration and uses these to derive additional configuration
required to bring up a metal3 cluster. E2e testing should make sure that MAO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of negative testing do you plan on doing? Malformed strings, IP addresses/ranges overlapping, not in the expected ranges, poorly formed IP addresses, images not where they are specified, slow or interrupted downloads. Is this going to be tested and within scope of this enhancement testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of negative testing do you plan on doing? Malformed strings, IP addresses/ranges overlapping, not in the expected ranges, poorly formed IP addresses, images not where they are specified, slow or interrupted downloads. Is this going to be tested and within scope of this enhancement testing?

All of these values will be provided by the installer ref openshift/installer#2449 - that does have unit tests for the installer level validation of those values, and I'd assume we'll have similar validation and unit tests in the MAO @sadasu ?

For e2e testing we've got a 3rd party metal3 CI job which can be used to validate these changes, but not yet fully integrated openshift e2e CI for this new platform - @derekhiggins is working on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the only way to use this platform at present is to manually create a ConfigMap resource, e.g a completely unconstrained interface, so IMO having the validation at the installer layer will be a big improvement.

is able to bring up a metal3 cluster using config from this enhanced
BaremetalPlatformStatus CR.

### Upgrade / Downgrade Strategy

Baremetal Platform type will be available for customers to use for the first
time in Openshift 4.3. And, when it is installed, it will always start as a
fresh baremetal installation at least in 4.3. There is no use case where a 4.2
installation would be upgraded to a 4.3 installation with Baremetal Platform
support enabled.

For the above reason, in this particular case, we would not have to make
additional provisions for an upgrade strategy.

Baremetal platform is experimental in 4.2 but, in the small chance that a user
might upgrade to 4.3 from 4.2 and also set the Platform type to Baremetal for
for the first time during the upgrade, our upgrade strategy would be to
support obtaining the config from both the ConfigMap and the
BareMetalPlatformStatus ConfigResource. In 4.3, the MAO would be implemented
to first look for the baremetal config in the CR and if that fails, it would
try to read it from the ConfigMap.