-
Notifications
You must be signed in to change notification settings - Fork 525
CONVENTIONS: Add taints and tolerations #344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CONVENTIONS: Add taints and tolerations #344
Conversation
* CONVENTIONS.md: Change "issuses" to "issues".
* CONVENTIONS.md: Add a section on operators with a subsection on taints and tolerations to codify what tolerations may be specified for operators and operands, and under what conditions they may be used.
| DNS node resolver, which adds an entry to the `/etc/hosts` file on all node hosts so | ||
| that the container runtime is able to resolve the name of the cluster image registry; | ||
| absent this entry in `/etc/hosts`, upgrades could fail to pull images of core | ||
| components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think following what's written here, we need to include the MCO as it's responsible for the overall node lifecycle and thus, the node cannot be properly configured otherwise (kubelet among others) https://bugzilla.redhat.com/show_bug.cgi?id=1780318
|
this is perfect. we may want to pull this into a separate "operator conventions" doc as we grow these. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), | ||
| and so it is incumbent on us to establish some conventions around operators. | ||
|
|
||
| #### Taints and Tolerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning You worked a lot on tolerations as I recall. ptal
| with explicit justification. | ||
|
|
||
| Note that the DefaultTolerationSeconds and PodTolerationRestriction admission plugins | ||
| may add time-bound tolerations to an operator or operand in addition to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember @sjenning getting worked up over this, but I have no memory of what the details were
@derekwaynecarr did you extract an e2e test to enforce this before merging it? |
|
|
||
| In exceptional cases, an operand may tolerate all taints: | ||
|
|
||
| * if the operand is required to form a functional Kubernetes node, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for being late on reviewing this PR.
To me this doesn't warrant operand to have blanket tolerations. This means we should be explicit on the tolerations that need to matched for the pod to keep running.
Some of the side effects that we noticed with this blanket toleration are
- node controller wouldn't evict the pods when the node is marked as not ready. We faced this in the past. So allowing allowing blanket tolerations might cause problems in future.
- In future, we'd like to add Windows nodes to OpenShift clusters. The current way to ensure that the Windows workloads lands on Windows nodes and linux workloads lands on linux nodes is via nodeSelector and taints&tolerations. In a way, with this we are allowing operand or operator to tolerate Windows Specific taints and let them land on the Windows nodes which will cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ravisantoshgudimetla. Having an operand tolerate all taints will definitely cause issues on clusters that have enabled Windows workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this doesn't warrant operand to have blanket tolerations. This means we should be explicit on the tolerations that need to matched for the pod to keep running.
Users can set arbitrary taints, so prohibiting operands from specifying blanket tolerations would require either (a) requiring users to specify operands' tolerations too or (b) restricting the user to some set of predefined taints.
Some of the side effects that we noticed with this blanket toleration are
- node controller wouldn't evict the pods when the node is marked as not ready. We faced this in the past. So allowing allowing blanket tolerations might cause problems in future.
That Bugzilla report seems to be about operators, not about operands that need to run on each node. For example, cluster-dns-operator only needs to run on some node somewhere, but the dns-node-resolver container in its daemonset operand must run any given node in order for that node to be fully functional; does specifying a blanket toleration on that operand cause a problem?
- In future, we'd like to add Windows nodes to OpenShift clusters. The current way to ensure that the Windows workloads lands on Windows nodes and linux workloads lands on linux nodes is via nodeSelector and taints&tolerations. In a way, with this we are allowing operand or operator to tolerate Windows Specific taints and let them land on the Windows nodes which will cause problems.
Why do you need both a node selector and a taint? Why isn't a node selector sufficient? For example, cluster-dns-operator creates a daemonset that tolerates all taints but also specifies a node selector: kubernetes.io/os: linux. Would it help to add a section on node selectors with a statement that an operator or operand that requires a specific operating system must specify an appropriate node selector?
CONVENTIONS: Add taints and tolerations
CONVENTIONS.md: Add a section on operators with a subsection on taints and tolerations to codify what tolerations may be specified for operators and operands, and under what conditions they may be used.CONVENTIONS: Fix typo: "issuses".
CONVENTIONS.md: Change "issuses" to "issues".@derekwaynecarr, this is a follow-up to #335 (comment) to unblock openshift/cluster-dns-operator#171 (for which I am hoping you can cancel your hold in order to get the PR merged today or tomorrow). Does this PR look reasonable?