-
Notifications
You must be signed in to change notification settings - Fork 41.8k
Anti affinity predicate for pods #9560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
|
Do you have use cases for strict anti-affinity other than agent-per-node scheduling? Agent-per-node scheduling can be implemented through a controller, rather than requiring a scheduler change. |
As @etune alluded to, we will have per-node controller for one-pod-on-every-node scenarios (#1518). I assume that kind of controller will still let you specify a label selector, so it will be flexibile enough to do one-pod-on-every-node-of-a-subset-of-all-nodes as well. I guess the main advantage of the approach you are suggesting is that you wouldn't need to use a special controller. I think there is an interesting open question in Kubernetes about when do you want a new controller vs. a configuration option to an existing controller vs. a configuration option to pods (thus a scheduling feature that will work under any controller). There are tradeoffs of among usability, flexibility, duplication of code, etc. |
|
@erictune
@davidopp Thank you for the pointer to #1518. We have an implementation of replication controller for that. I will submit a patch for it soon. Not sure if anyone is already working on it, happy to collaborate. |
|
@ravigadde However, I'm not crystal clear on the broader implications are of using the current "best-effort spreading" versus the proposed "spreading or bust" behavior. My default position is not to expand the Pod/Scheduler API unless we have a pretty deep understanding of the use cases for the new feature - otherwise it is just clutter, both in the API, and in terms of understanding scheduler behaviors. Here is a first pass at a comparison. A deeper one is welcome. For use case 1 (zookeeper)If the cluster is not too full, it seems like the best-effort spreading behavior and the spreading-or-bust behavior should have the same result. It is only when the cluster is close to full that they differ. So, I think we need to lay out what behaviors we want when the statically sized cluster is almost full (such that pods would go pending with the spreading-or-bust policy), and also what behavior we want from an autoscaled cluster when adding nodes is be required to ensure all pod schedule and are spread. For use case 2 (sharded storage)IIUC, there isn't actually a requirement that the pod themselves have to be on different nodes. |
|
For the first use case, I think an anti-affinity predicate would be useful. I don't think we necessarily need to do the fully generic thing we do in Borg, where you can do an attribute limit on any combination of labels, but at least having attr_limit=host=1 seems pretty reasonable and I don't think it would make the API too confusing. At the very least we need an API extension mechanism so users who want to write their own scheduling policies can pipe options like this all the way through to the scheduler. But my (limited) experience with extensible config languages suggests these turn into a mess. I'd be in favor of adding the feature suggested here in the limited form (just a "max of one of these pods per host" option). I agree that best-effort spreading works in a not-close-to-full cluster, but I can see the value of making it a guarantee. The feature isn't a perfect match for the second use case but it seems somewhat more user-friendly than telling the app that it has to fix the layout if it gets unlucky in the best-effort spreading. |
|
@erictune It seems, docker also has a similar option: affinity:container!=neo4jclusternode*. I remember coreos has support for it in their fleet model. Also, this makes implementing agents very easy. All we have to do is add an anti-affinity rule and update the count of replication controller instances, based on available minions. Let me know your thoughts.. |
|
I would propose we should defer debating new features until after 1.0. BTW working on the critical-for-1.0 features is not limited to Google folks. Please feel free to help with any of the unassigned 1.0 issues. We welcome your PRs! |
|
@erictune
@davidopp |
|
@davidopp |
|
Previous discussions of anti-affinity include #367 and #4301 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were going to add conflicts, we should use standard label selection semantics, as I described here:
#4301 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607
Thanks for the pointers and the feedback. I could switch it to using a label selector, but I have one concern with the label selector. With a map, its possible to specify if key1 == val1 OR key2 == val2, its a conflict. Label selector has an AND semantic, which wouldn't make this possible.
The OR semantic may be very useful for conflicts. The label selector operators that are likely to be used with conflicts are EqualsOperator, DoubleEqualsOperator, InOperator and ExistsOperator. Other than InOperator, the rest of them are covered by a map albeit with an OR semantic.
Let me know your thoughts on how to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue I cited, I proposed a list of selectors:
Another alternative model would be a list of label selectors in the Pod to be used for anti-affinity.
While individual selectors wouldn't provide top-level OR, the list would: if a pod's labels matched selector1 OR selector2 OR selector3, then it would be considered a conflict.
I don't recommend this mechanism for anti-affinity for like pods, though it could be used that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607
Makes sense, sorry I missed that. I am confused by the last line in your comment. Should I or should I not make the change for anti affinity using a list of label selectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to use []labels.LabelSelector, I ran in to this issue with generated conversion code.
pkg/api/v1/conversion_generated.go:1535: cannot use make([]labels.LabelSelector, len(in.Conflicts)) (type []labels.LabelSelector) as type [][]labels.Requirement in assignment
Perhaps someone familiar with the code generation can comment on whether slice of slice is ok.
I changed it to [][]labels.Requirement for now and submitted a patch.
|
Sorry I didn't see this earlier. I'm swamped with 1.0 at the moment, but let me know if you have questions about the comments in the referenced issues and I'll try to get to them when I can. |
|
Assigning to myself since it's gated on an API change. |
|
@ravigadde If you're looking to help, we're currently focused on bug fixes, critical performance improvements, documentation, and usability improvements (e.g., kubectl improvements), in addition to those things in the v1.0 milestone on github. |
4302625 to
3e6c291
Compare
pkg/api/v1/types.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #341 and #7053. We need a LabelSelector type that has the right json, description, and patch tags.
This will also need description and patch tags.
Insert the field just below NodeSelector.
cc @sdminonne
|
Changed to using []labels.LabelSelector and squashed the changes. I had to make some changes to conversion and deep copy code for slices. I am not sure if there is a better way of handling that. The recursive calls covert []labels.LabelSelector to [][]labels.Requirement which the compiler doesn't like. Passed the unit tests. Not sure why shippable fails. |
…netes Conflicts: pkg/api/v1/conversion_generated.go pkg/api/v1/types.go pkg/api/v1beta3/conversion.go pkg/api/v1beta3/deep_copy_generated.go pkg/api/v1beta3/types.go pkg/util/set.go
pkg/api/v1/types.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest calling this NodeConflictSelectors
We're already using "affinity" to refer to best-effort spreading policies, so I think it's better if we use a new word ("conflict") here. I believe @bgrant0607 was OK with this.
Also, the reason I'm suggesting NodeConflictSelectors instead of ConflictSelectors is because we might eventually want Pods to be able to conflict with Pods that are already running on the node, not just with labels on the Node (in which case we might call that PodConflictSelectors or something like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. This changeset is for the second use case you mentioned, for Pods to be able to conflict with other Pods that are already scheduled to a Node. Should I call it PodConflictSelectors?
Conflicts: pkg/api/v1/types.go
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
|
I will wait for #7053 to be merged before refreshing the code. There is overlap in selector related changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodConflictSelectors SGTM
|
This is one piece of the puzzle that is kind of overlaps with some of the things we would like to do. However the change-set in the diff seems pretty conflated.. There is cleanup/preference work alongside feature addition. Imho they are orthogonal and should be broken up. |
|
Totally agree. I ran in to conversion problems with a slice of LabelSelector which happens to be a slice of Requirement. Are you aware of anyone working on a fix for this? The generated code doesn't compile as it says the left and right sides are of incompatible types. IIRC, the conversion code expands the right side to [][]Requirement while the left side is []Selector. |
|
v1beta3 is gone. Please clean that up. @sdminonne re. the label selector changes. |
|
Is the thing you want to specify here []labels.LabelSelector (which is what you have in the PR) or []string where the string is a key for a label selector? That is, should the semantics be "don't put any two Pods on the same machine if the Pods have the same value of label X" instead of what you currently have ("don't put any two Pods on the same machine if the Pods have value Y for label X")? |
|
Labelling this PR as size/XXL |
|
Labelling this PR as size/L |
|
#7053 hasn't progressed in months. I don't think this PR needs to block on it if you want to continue with it. But it should start with the more general style of selector. |
|
@bgrant0607 @davidopp @sdminonne I am almost ready with the changes, resolved the deep copy issue with slice of slices in a better way. Should we move selector to the api package? I see atleast one test case that assumes objects to be of "versioned" or "built in" types (validateField in SwaggerSchema). --- FAIL: TestValidateOk (0.10s) I was hoping this issue will be addressed in #7053. There is another test failure TestRoundTripTypes that I still need to debug. |
|
Continued in #14543 |
This is useful for cases where two instances of a Pod cannot run on the same node. Unlike ServiceAntiAffinity, this is strict and leaves unschedulable Pods in Pending state. This can also be used to implement agents (a single instance of the agent on every node in the cluster). Will submit a separate pull request for that.
PS: If 1.0 is frozen and this needs to wait, would still like to get review comments/feedback.