-
Notifications
You must be signed in to change notification settings - Fork 525
Deploy Kubernetes-nmstate with openshift #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cae7845
e66229f
5123b5b
e16326a
261b42b
98bec32
1d0ceda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| --- | ||
| title: kubernetes-nmstate | ||
| authors: | ||
| - "@schseba" | ||
| - "@bcrochet" | ||
| reviewers: | ||
|
|
||
| approvers: | ||
| - TBD | ||
|
|
||
| creation-date: 2019-12-18 | ||
| last-updated: 2019-12-18 | ||
| status: | ||
| --- | ||
|
|
||
| # kubernetes-nmstate | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [X] Enhancement is `implementable` | ||
| - [ ] Design details are appropriately documented from clear requirements | ||
| - [ ] Test plan is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| A proposal to deploy [kubernetes-nmstate](https://github.com/nmstate/kubernetes-nmstate/) on OpenShift. | ||
|
|
||
| Node-networking configuration driven by Kubernetes and executed by | ||
| [nmstate](https://nmstate.github.io/). | ||
|
|
||
| ## Motivation | ||
|
|
||
| With hybrid clouds, node-networking setup is becoming even more challenging. | ||
| Different payloads have different networking requirements, and not everything | ||
| can be satisfied as overlays on top of the main interface of the node (e.g. | ||
| SR-IOV, L2, other L2). | ||
| The [Container Network Interface](https://github.com/containernetworking/cni) | ||
| (CNI) standard enables different | ||
| solutions for connecting networks on the node with pods. Some of them are | ||
| [part of the standard](https://github.com/containernetworking/plugins), and there are | ||
| others that extend support for [Open vSwitch bridges](https://github.com/kubevirt/ovs-cni), | ||
| [SR-IOV](https://github.com/hustcat/sriov-cni), and more... | ||
|
|
||
| However, in all of these cases, the node must have the networks setup before the | ||
| pod is scheduled. Setting up the networks in a dynamic and heterogenous cluster, | ||
| with dynamic networking requirements, is a challenge by itself - and this is | ||
| what this project is addressing. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Dynamic network interface configuration for OpenShift on Day 2 via an API | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Replace SRIOV operator | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for spelling out this as a non-goal :-)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kubernetes-nmstate will touch interfaces only when you explicitly ask it too. The only SR-IOV related feature on nmstate is setting number of VFs AFAIK. So unless somebody creates a Policy changing that, there should be no conflict.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, Kubernets-nmstate will be able to take over the control of VF devices even if the provision of VFs is done by SR-IOV Operator.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the use want to create a policy that change the VF configuration we don't block it right now. |
||
|
|
||
| ## Proposal | ||
|
|
||
| A new kubernetes-nmstate operator Deployment is deployed to the cluster as part of the the OpenShift | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as a new network operator or as part of the existing network operator? why is this required on every openshift cluster versus just opt-in to some clusters where the advanced capability is required? this document should explain why 100% of OpenShift clusters require this capability
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the core problem here is that like a number of other things, nmstate is mostly irrelevant in cloud environments, but really useful for bare metal. Another exampe is the extension system where
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree that this doesn't seem to be required on every cluster. If we had a way to say "install this by default on bare metal", I'd recommend adding it to that list. An operator that's a no-op on cloud clusters doesn't seem ideal, either. It seems like a OLM managed operator is the only choice we have, unless in some future we can have a bare metal CVO profile or something like that. |
||
| installation, via CVO. | ||
|
|
||
| The operator has a single CRD, called `NMstate`. When a custom resource of kind `NMstate`, with | ||
| a name of 'nmstate' is created, it will then create the CRDs for the kubernetes-nmstate handler. | ||
| Then, the namespace, RBAC, and finally the DaemonSet are applied. Custom resources of kind `NMstate` | ||
| with other names will be ignored. | ||
|
|
||
| The `NMState` CR accepts a node filter, allowing one to add a label to nodes to indicate where the | ||
| DaemonSet will be deployed. | ||
|
|
||
| A new kubernetes-nmstate handler DaemonSet is deployed in the cluster by the operator. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it worth mentioning that Operator CR spec supports filtering on which nodes the DaemonSet will be deployed.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please explain how this component relates to #344 |
||
| This DaemonSet contains nmstate package and interacts with the NetworkManager | ||
| on the host by mounting the related dbus. The project contains three | ||
| Custom Resource Definitions, `NodeNetworkState`, `NodeNetworkConfigurationPolicy` and | ||
| `NodeNetworkConfigurationEnactment`. `NodeNetworkState` objects are created per each | ||
| node in the cluster and can be used to report available interfaces and network configuration. | ||
| These objects are created by kubernetes-nmstate and must not be touched by a user. | ||
| `NodeNetworkConfigurationPolicy` objects can be used to specify desired | ||
| networking state per node or set of nodes. It uses API similar to `NodeNetworkState`. | ||
| `NodeNetworkConfigurationEnactment` objects will be created per each node, per each matching | ||
| `NodeNetworkConfigurationPolicy`. | ||
|
|
||
| kubernetes-nmstate DaemonSet creates a custom resource of `NodeNetworkState` type per each node and | ||
| updates the network topology from each OpenShift node. | ||
|
|
||
| User configures host network changes and apply a policy in `NodeNetworkConfigurationPolicy` custom | ||
| resource. Network topology is configured via `desiredState` section in `NodeNetworkConfigurationPolicy`. | ||
| Multiple `NodeNetworkConfigurationPolicy` custom resources can be created. | ||
|
|
||
| Upon receiving a notification event of `NodeNetworkState` update, | ||
| kubernetes-nmstate Daemon verify the correctness of `NodeNetworkState` custom resource and | ||
| apply the selected profile to the specific node. | ||
|
|
||
| `NodeNetworkConfigurationEnactment` objects is a read-only object that represents Policy exectuion | ||
| per each matching Node. It will expose configuration status per each Node. | ||
|
|
||
| A new container image (kubernetes-nmstate-handler) will be created. | ||
|
|
||
| A new container image (kubernetes-nmstate-operator) will be created. However, the operator | ||
| and the handler will co-exist in the same upstream repo. | ||
|
|
||
| An upstream API group of nmstate.io is currently used. | ||
|
|
||
| ### User Stories | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a use case where user doesn't want to assign SR-IOV VFs to a SR-IOV Pod, instead they'd like to use macvlan on top of VFs. Do you think kubernetes-nmstate can be used in such case to manage this kind of VFs once they are created by SR-IOV Operator? By managing VFs, I mean configuring network attrributes such as vlan, mtu etc on that VF device (which, in this case, can be considered as a host device) .
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK nmstate does not support configuring these attributes of VF. @EdDev? Not sure whether there is a support to configure VF parameters through NetworkManager. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A VF will eventually become a regular interface, so you could define whatever you want on it. If the PF or VF is defined by a different tool (CNI?), nmstate will "see" it but will consider it "down". You can take control over them using nmstate and define whatever you want.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the linked PR for supporting PF/VF capabilites via nmstate. Current SR-IOV Operator is responsible for VF provisioning (creating number of VFs on the host) and manages SR-IOV sub-components, such as SR-IOV CNI and SR-IOV Device Plugin. but we are adding VF index support in SR-IOV Operator to allow it only manages sub-set of VFs device from PF. This, once merged, will leave the rest of VFs on the same PF become un-managed where I see kubernetes-nmstate can fit and take over the management of the rest VFs. SR-IOV CNI managed via SR-IOV Operator contains the capability to configure VF properties such as spoof check, trusted VF, MAC address, link state and tx_rate etc which are the same as Kubernetes-nmstate. But it will only be used when VF is requested and attached to a Pod. I think this is a clear dividing line that we shall keep in mind going forward between use of kubernetes-nmstate and SR-IOV Operator for configuring VF properties. If it's for Pod VF configuration, then SR-IOV CNI shall be used. If it's for host-level VF configuration that may be used for other purpose instead of directly used in SR-IOV Pod, then Kubernetes-nmstate shall be used. Several questions regarding provisioning VFs (setting number of VFs ):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
DPDK and 3rd party interfaces have been discussed as part of OpenStack requirements.
Current implementation uses NetworkManager as the provider to change these. If it supports it, nmstate and knmstate supports it.
Seems to fit the discussions we had with OpenStack on 3rd party interfaces.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice! The fact I'm having this question is SR-IOV Operator has implemented VF provisioning functions & GA the support in 4.3 and I didn't see an equivalent in nmstate at this point. But I'm seeing nmstate as a project to converge the implementation of VF configuration/provosioning in future releases on OpenShift. Until we get to it, there might be issues that we need to pay attention to for co-existance of VF provision functions in both nmstate and sriov operator. One example I can think of is that if user configures number of VFs(with different number) from both Operators which all work in declarative ways, then node enters a infinite loop of being cordoned (SR-IOV Operator cordons the node before re-provisioning VFs).
Ack, I think we use different way to define and manage the VF resource pools on Kubernetes. But that doesn't affect us using the same nmstate library for host-level VF provisioning and configration.
Can we list this as one user story in favor of supporting customer case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done @zshi-redhat can you please take another look?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @SchSeba, the change looks good to me.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are right my bad thanks!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SchSeba @zshi-redhat can you modify the PR to include the VF-related use case? The discussion above is quite long and involves plenty of implementation and calendar concerns, so I ended up not understanding the motivation. Do you want to create the VF? Only set its vlan/mac/mtu/ip? Why do you want to do that rather than use the PF? |
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Meta-criticisim - these aren't user stories. User stories look like this: "As a (person), I'd like to do (high level goal)." Start with a goal, then show how your proposed solution fulfills this story. "With (feature x, y, and z), I can accomplish this."" It's important that user stories are written without the proposed solution in mind. Ideally they come "first" in the design process. Their role is to identify real needs, rather than extant features. That way you can be sure you're designing a solution for problems, not the other way around. An example:
Make sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format is just a guideline, not a fixed spec of what a user story is. The needs below are focused on network specific points, like "Be able to crate a bond". It is not the intent here to explain why bonds are needed or how an admin should use it. We are at a level below that, where the need to have these stuff should be obvious.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly agree with @squeed . The most important argument this enhancement needs to make is a clear argument for why this capability is required on all OpenShift clusters by default. Nothing in this enhancement resonates in its current form explaining why this is needed univerally in OpenShift rather than as an advanced optional add-on component delivered via OLM. |
||
| #### Bond creation | ||
|
|
||
| * As an OpenShift administrator, my customer base will change. Each customer will have different needs, | ||
| network-wise. Most will be able to utilize the typical network. Some customers may need more bandwidth | ||
| than a single pipe can provide. In order to satisfy these customers' neds, I would like to have | ||
| the ability to create a bond on my nodes dynamically, without the need for a reboot. | ||
|
|
||
| #### VLAN create | ||
|
|
||
| * As an OpenShift administrator, my customer base will change. Each customer will have different needs, | ||
| network-wise. Most will be able to utilize the typical network. Some customers may need more than | ||
| one network, and desire a VLAN setup. In order to satisfy these customers' needs, I would like to have | ||
| the ability to create a VLAN on top of a node interface, without the need for a reboot. | ||
|
|
||
| #### Assign ip address | ||
|
|
||
| * As an OpenShift administrator, I have a need to create interfaces, such as VLAN interfaces, and | ||
| assign either a static or a dynamic IP address to that interface. I would like the ability to configure | ||
| either a static address or dynamic address without the need for a reboot. | ||
|
|
||
| * As an OpenShift administrator, I have the need to create a bridge, add an existing interface to | ||
| the bridge, and move the IP from that interface to the bridge, all without having to reboot the node. | ||
|
|
||
| #### Create/Update/Remove network routes | ||
|
|
||
| * As an OpenShift administrator, I have a need to create/update/remove network routes for specific | ||
| interfaces. This might include source routing. This is necessary without having to reboot the node. | ||
|
|
||
| #### Manage/Configure host SR-IOV network device | ||
|
|
||
| * As an OpenShift administrator, I would like to be able to change host Virtual functions configuration | ||
| (those not managed by the sriov-operator) like vlans, mtu, driver etc. | ||
|
|
||
| #### Rollback | ||
|
|
||
| * As an OpenShift administrator, if a configuration that I apply is somehow invalid, I would like to be | ||
| able to rollback network configuration if network connectivity is lost to the OpenShift api server | ||
| after a policy is applied. This should be done without my intervention, and restore connectivity as | ||
| it was prior to application of the faulty configuration. | ||
|
|
||
| ### Implementation Details | ||
|
|
||
| https://docs.google.com/document/d/1k7_vWtVRbOvTmOOTFx7qRPvYXJh3YyBPvuwh6H-g_jA/edit#heading=h.cdwyj2vhalzy | ||
|
|
||
| ## Design Details | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need to fill this section out :) |
||
|
|
||
| Distributed. Pod on every labeled node. | ||
|
|
||
| ### Test Plan | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And write some tests :-) |
||
|
|
||
| - Unit tests (implemented) | ||
| - e2e tests of kubernetes-nmstate handler (implemented) | ||
| - e2e tests of kubernetes-nmstate operator (implemented) | ||
|
|
||
| All tests will be automated in CI. | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| Initial support for kubernetes-nmstate will be Tech Preview | ||
|
|
||
| #### Tech Preview | ||
|
|
||
| - kubernetes-nmstate can be installed via container image | ||
| - Host network topology can be configured via CRDs | ||
|
|
||
| #### Tech Preview -> GA | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Questions you need to answer when adding a rich feature such as this:
|
||
|
|
||
| - Record a session, with slides, explaining usage for sharing with CEE. | ||
| - Documented under CNV | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| The kubernetes-nmstate operator will handle upgrades. Downgrades are not specified ATM. | ||
|
|
||
| ### Version Skew Strategy | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be filled out as well. How will the configuration change? What are the components affected here? What are the APIs between them? Are these versioned? This goes far beyond defining a CRD. How tightly is this coupled to the version of NetworkManager? To the kernel? Will this work on RHEL7? Will users need to care? What if NetworkManager deprecates a configuration directive? Adds a new one? How tightly is this coupled to the advanced CNI plugins? SR-IOV, OVN-Kubernetes, etc? Remember, component upgrades are not strictly ordered.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a single strong dependency and that is between kubernetes-nmstate and the NetworkManager running on the host. The communication happens through D-Bus. NetworkManager API is backwards compatible, which makes kubernetes-nmstate forward compatible. When we release, we support certain version of NetworkManager and everything after it. nmstate release cycle (not kubernetes-nmstate) is tightly bound to RHEL and NetworkManager version, so the only thing we need to ensure on kubernetes-nmstate here, is that we don't push a version that would be newer than NetworkManager currently available on RHCOS. There should be no coupling with kernel version apart from that done by NetworkManager. This will not work on RHEL7 since RHEL7 has old versions of NetworkManager. Support for RHEL7 is not on nmstate roadmap. Users don't need to care. We just need to make sure to release only nmstate that is compatible with RHCOS/RHEL supported by the current version of OpenShift (with the exception of RHEL7). Our operator will not deploy kubernetes-nmstate on RHEL7 nodes. In case there is a breaking change on NetworkManager side, we can tackle it in nmstate or kubernetes-nmstate with a workaround. However, since there is guaranteed backward compatibility, it would be a NetworkManager bug. It is not coupled with any CNI. kubernetes-nmstate is just controlling configuration making sure to reach desired state. It won't intervene into any CNI configuration unless it is asked to. |
||
|
|
||
|
|
||
|
|
||
| ## Implementation History | ||
|
|
||
| ### Version 4.6 | ||
|
|
||
| Tech Preview | ||
|
|
||
| ## Infrastructure Needed | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What testing infrastructure will you need? Special hardware? Special kernels? Will this have CI coverage, or will it have to be entirely manual? |
||
|
|
||
| This requires a github repo be created under openshift org to hold a clone from kubernetes-nmstate | ||
| Any CI system could run the unit tests as-is. There is no need for specialized hardware. | ||
| The e2e tests require multiple interfaces on the nodes. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already exists today, so you're going to need to be more specific about the problem that you are trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is providing dynamic (not taking the node down) network interface configuration after installation finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Machine Config Operator (specifically the Machine Config Daemon) is the thing that applies configuration change today. From the cluster's perspective, that's dynamic reconfiguration. It sounds like you are concerned about the individual node, so you'll have to help me understand why that's important. If it's a question of performance, then I'd argue that the MCD is the place to work on those optimizations (that design has started in #159).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me one of the biggest pieces of value this can offer is a domain specific API for something that can be pretty complex. Managing it with direct config files via MachineConfig isn't ideal, so I think it's worth thinking about what a Network config management API might look like.
I was concerned about how this would fit well into OCP 4 since it was originally developed completely disconnected from MCO, but it seems there are some ideas forming to help address that.