Anti affinity predicate for pods #9560

ravigadde · 2015-06-10T08:22:03Z

This is useful for cases where two instances of a Pod cannot run on the same node. Unlike ServiceAntiAffinity, this is strict and leaves unschedulable Pods in Pending state. This can also be used to implement agents (a single instance of the agent on every node in the cluster). Will submit a separate pull request for that.

PS: If 1.0 is frozen and this needs to wait, would still like to get review comments/feedback.

k8s-bot · 2015-06-10T08:24:03Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-bot · 2015-06-10T19:45:11Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

erictune · 2015-06-11T08:13:01Z

Do you have use cases for strict anti-affinity other than agent-per-node scheduling?
Other than that case, it's not clear to me when being Pending is better than co-scheduling.

Agent-per-node scheduling can be implemented through a controller, rather than requiring a scheduler change.

davidopp · 2015-06-11T09:00:00Z

This can also be used to implement agents (a single instance of the agent on every node in the cluster).

As @etune alluded to, we will have per-node controller for one-pod-on-every-node scenarios (#1518). I assume that kind of controller will still let you specify a label selector, so it will be flexibile enough to do one-pod-on-every-node-of-a-subset-of-all-nodes as well. I guess the main advantage of the approach you are suggesting is that you wouldn't need to use a special controller. I think there is an interesting open question in Kubernetes about when do you want a new controller vs. a configuration option to an existing controller vs. a configuration option to pods (thus a scheduling feature that will work under any controller). There are tradeoffs of among usability, flexibility, duplication of code, etc.

ravigadde · 2015-06-11T14:48:34Z

@erictune
Yes, there are other common use cases.

3 pod zookeeper or cassandra cluster - dont want two of those to land on the same host
Mongo with multiple shards is more complicated - dont want any two primaries on the same node, dont want primary & secondary from the same shard to be on the same node, dont want two secondaries from the same shard to be on the same node.

@davidopp
Replication controller cannot guarantee the desired behavior because it can only suggest spreading placement using labels. There are good uses that need these guarantees from the scheduler.

Thank you for the pointer to #1518. We have an implementation of replication controller for that. I will submit a patch for it soon. Not sure if anyone is already working on it, happy to collaborate.

erictune · 2015-06-11T17:06:40Z

@ravigadde
Thanks for the use cases. I agree with your statement of the use cases.

However, I'm not crystal clear on the broader implications are of using the current "best-effort spreading" versus the proposed "spreading or bust" behavior. My default position is not to expand the Pod/Scheduler API unless we have a pretty deep understanding of the use cases for the new feature - otherwise it is just clutter, both in the API, and in terms of understanding scheduler behaviors.

Here is a first pass at a comparison. A deeper one is welcome.

For use case 1 (zookeeper)

If the cluster is not too full, it seems like the best-effort spreading behavior and the spreading-or-bust behavior should have the same result. It is only when the cluster is close to full that they differ. So, I think we need to lay out what behaviors we want when the statically sized cluster is almost full (such that pods would go pending with the spreading-or-bust policy), and also what behavior we want from an autoscaled cluster when adding nodes is be required to ensure all pod schedule and are spread.

For use case 2 (sharded storage)

IIUC, there isn't actually a requirement that the pod themselves have to be on different nodes.
The requirement is that replica chunks should not be on the same node.
Mongodb already handles some migration of chunks. Is it better to have the scheduler make mongo nodes pending (spreading or bust), or for the scheduler do best-effort and let mongo balancer try to rebalance chunks to minimize risk?

davidopp · 2015-06-11T17:20:07Z

For the first use case, I think an anti-affinity predicate would be useful. I don't think we necessarily need to do the fully generic thing we do in Borg, where you can do an attribute limit on any combination of labels, but at least having attr_limit=host=1 seems pretty reasonable and I don't think it would make the API too confusing. At the very least we need an API extension mechanism so users who want to write their own scheduling policies can pipe options like this all the way through to the scheduler. But my (limited) experience with extensible config languages suggests these turn into a mess. I'd be in favor of adding the feature suggested here in the limited form (just a "max of one of these pods per host" option).

I agree that best-effort spreading works in a not-close-to-full cluster, but I can see the value of making it a guarantee.

The feature isn't a perfect match for the second use case but it seems somewhat more user-friendly than telling the app that it has to fix the layout if it gets unlucky in the best-effort spreading.

chakri-nelluri · 2015-06-11T17:28:17Z

@erictune
Hi eric, IIUC the default spreading scheduler takes anti-affinity into account in its scheduling, but makes no guarantees. We cannot strictly guarantee that no two instances run on one node.

It seems, docker also has a similar option: affinity:container!=neo4jclusternode*. I remember coreos has support for it in their fleet model.

Also, this makes implementing agents very easy. All we have to do is add an anti-affinity rule and update the count of replication controller instances, based on available minions.

Let me know your thoughts..

davidopp · 2015-06-11T17:33:00Z

I would propose we should defer debating new features until after 1.0.

BTW working on the critical-for-1.0 features is not limited to Google folks. Please feel free to help with any of the unassigned 1.0 issues. We welcome your PRs!

https://github.com/googlecloudplatform/kubernetes/issues?q=is%3Aopen+is%3Aissue+milestone%3Av1.0+no%3Aassignee

ravigadde · 2015-06-11T17:34:51Z

@erictune
Couple of things I would like to point out.

These apps are handling their own replication. Hence a replication controller is not necessary to run them. In fact, assuming they are using local storage for performance, they cant be moved on a node failure. Which is perfectly fine because its a clustered app.
Replication controller doesn't offer any guarantee of them not landing on the same node. It leaves it up to the spreading algorithm and is best effort.

@davidopp
Is attr_limit=host=1 on the image name? It is limiting if its not tied to a label. It should be possible for zookeeper pods from two different zookeeper clusters to run on the same node.

ravigadde · 2015-06-11T17:49:54Z

@davidopp
I heard there is some WIP for some of these features (agents). We would like to contribute to/influence those efforts. Agree with the urgency for 1.0 though, so this discussion can wait or happen at a slower pace than normal. Will look through the bug list and see where I can help.

bgrant0607 · 2015-06-25T02:48:29Z

Previous discussions of anti-affinity include #367 and #4301 (comment)

bgrant0607 · 2015-06-25T03:10:24Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

If we were going to add conflicts, we should use standard label selection semantics, as I described here:
#4301 (comment)

@bgrant0607
Thanks for the pointers and the feedback. I could switch it to using a label selector, but I have one concern with the label selector. With a map, its possible to specify if key1 == val1 OR key2 == val2, its a conflict. Label selector has an AND semantic, which wouldn't make this possible.

The OR semantic may be very useful for conflicts. The label selector operators that are likely to be used with conflicts are EqualsOperator, DoubleEqualsOperator, InOperator and ExistsOperator. Other than InOperator, the rest of them are covered by a map albeit with an OR semantic.

Let me know your thoughts on how to proceed.

In the issue I cited, I proposed a list of selectors:

Another alternative model would be a list of label selectors in the Pod to be used for anti-affinity.

While individual selectors wouldn't provide top-level OR, the list would: if a pod's labels matched selector1 OR selector2 OR selector3, then it would be considered a conflict.

I don't recommend this mechanism for anti-affinity for like pods, though it could be used that way.

@bgrant0607
Makes sense, sorry I missed that. I am confused by the last line in your comment. Should I or should I not make the change for anti affinity using a list of label selectors?

@bgrant0607

When I tried to use []labels.LabelSelector, I ran in to this issue with generated conversion code.

pkg/api/v1/conversion_generated.go:1535: cannot use make([]labels.LabelSelector, len(in.Conflicts)) (type []labels.LabelSelector) as type [][]labels.Requirement in assignment

Perhaps someone familiar with the code generation can comment on whether slice of slice is ok.

I changed it to [][]labels.Requirement for now and submitted a patch.

bgrant0607 · 2015-06-25T03:13:13Z

Sorry I didn't see this earlier. I'm swamped with 1.0 at the moment, but let me know if you have questions about the comments in the referenced issues and I'll try to get to them when I can.

bgrant0607 · 2015-06-25T03:13:34Z

Assigning to myself since it's gated on an API change.

bgrant0607 · 2015-06-25T03:22:18Z

@ravigadde If you're looking to help, we're currently focused on bug fixes, critical performance improvements, documentation, and usability improvements (e.g., kubectl improvements), in addition to those things in the v1.0 milestone on github.

bgrant0607 · 2015-06-26T18:50:27Z

pkg/api/v1/types.go

See also #341 and #7053. We need a LabelSelector type that has the right json, description, and patch tags.
This will also need description and patch tags.
Insert the field just below NodeSelector.
cc @sdminonne

ravigadde · 2015-06-26T22:28:00Z

@bgrant0607

Changed to using []labels.LabelSelector and squashed the changes. I had to make some changes to conversion and deep copy code for slices. I am not sure if there is a better way of handling that. The recursive calls covert []labels.LabelSelector to [][]labels.Requirement which the compiler doesn't like.

Passed the unit tests. Not sure why shippable fails.

…netes Conflicts: pkg/api/v1/conversion_generated.go pkg/api/v1/types.go pkg/api/v1beta3/conversion.go pkg/api/v1beta3/deep_copy_generated.go pkg/api/v1beta3/types.go pkg/util/set.go

davidopp · 2015-07-27T05:08:07Z

pkg/api/v1/types.go

I would suggest calling this NodeConflictSelectors

We're already using "affinity" to refer to best-effort spreading policies, so I think it's better if we use a new word ("conflict") here. I believe @bgrant0607 was OK with this.

Also, the reason I'm suggesting NodeConflictSelectors instead of ConflictSelectors is because we might eventually want Pods to be able to conflict with Pods that are already running on the node, not just with labels on the Node (in which case we might call that PodConflictSelectors or something like that).

@davidopp

Thanks for the suggestion. This changeset is for the second use case you mentioned, for Pods to be able to conflict with other Pods that are already scheduled to a Node. Should I call it PodConflictSelectors?

Conflicts: pkg/api/v1/types.go

k8s-bot · 2015-07-27T23:17:05Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

ravigadde · 2015-07-27T23:55:08Z

I will wait for #7053 to be merged before refreshing the code. There is overlap in selector related changes.

bgrant0607 · 2015-07-28T22:05:37Z

pkg/api/v1/types.go

PodConflictSelectors SGTM

timothysc · 2015-07-29T15:41:01Z

This is one piece of the puzzle that is kind of overlaps with some of the things we would like to do. However the change-set in the diff seems pretty conflated..

There is cleanup/preference work alongside feature addition. Imho they are orthogonal and should be broken up.

ravigadde · 2015-07-29T22:23:05Z

@timothysc

Totally agree. I ran in to conversion problems with a slice of LabelSelector which happens to be a slice of Requirement. Are you aware of anyone working on a fix for this? The generated code doesn't compile as it says the left and right sides are of incompatible types. IIRC, the conversion code expands the right side to [][]Requirement while the left side is []Selector.

bgrant0607 · 2015-07-30T02:50:46Z

v1beta3 is gone. Please clean that up.

@sdminonne re. the label selector changes.

davidopp · 2015-08-03T04:15:15Z

Is the thing you want to specify here []labels.LabelSelector (which is what you have in the PR) or []string where the string is a key for a label selector? That is, should the semantics be "don't put any two Pods on the same machine if the Pods have the same value of label X" instead of what you currently have ("don't put any two Pods on the same machine if the Pods have value Y for label X")?

ravigadde · 2015-08-03T05:36:14Z

@davidopp It is the second semantic (labels with the same value conflict). Would like to specify it as a []labels.LabelSelector in the spec, []string is more abstract and may not be very clear to the user on what to specify.

Will clean up the diff after #7053 is merged

k8s-github-robot · 2015-08-27T22:10:55Z

Labelling this PR as size/XXL

k8s-github-robot · 2015-09-01T23:16:00Z

Labelling this PR as size/L

bgrant0607 · 2015-09-16T14:04:21Z

#7053 hasn't progressed in months. I don't think this PR needs to block on it if you want to continue with it. But it should start with the more general style of selector.

ravigadde · 2015-09-24T18:24:01Z

@bgrant0607 @davidopp @sdminonne

I am almost ready with the changes, resolved the deep copy issue with slice of slices in a better way. Should we move selector to the api package? I see atleast one test case that assumes objects to be of "versioned" or "built in" types (validateField in SwaggerSchema).

--- FAIL: TestValidateOk (0.10s)
schema_test.go:80: unexpected error: unexpected type: labels.LabelSelector

I was hoping this issue will be addressed in #7053. There is another test failure TestRoundTripTypes that I still need to debug.

ravigadde · 2015-09-25T06:19:51Z

Continued in #14543

googlebot added the cla: yes label Jun 10, 2015

davidopp self-assigned this Jun 10, 2015

ravigadde force-pushed the master branch from dd792cb to 8c60850 Compare June 10, 2015 09:03

ravigadde mentioned this pull request Jun 11, 2015

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Closed

bgrant0607 reviewed Jun 25, 2015
View reviewed changes

bgrant0607 added this to the v1.0-post milestone Jun 25, 2015

bgrant0607 assigned bgrant0607 and unassigned davidopp Jun 25, 2015

ravigadde referenced this pull request in rjnagal/kubernetes Jun 25, 2015

Initial Daemon controller proposal.

e495dad

ravigadde force-pushed the master branch 2 times, most recently from 4302625 to 3e6c291 Compare June 26, 2015 08:38

bgrant0607 reviewed Jun 26, 2015
View reviewed changes

Anti affinity predicate for pods

4ac318c

ravigadde force-pushed the master branch from c740603 to 4ac318c Compare June 26, 2015 22:20

Merge branch 'master' of https://github.com/GoogleCloudPlatform/kuber…

201ce8d

…netes Conflicts: pkg/api/v1/conversion_generated.go pkg/api/v1/types.go pkg/api/v1beta3/conversion.go pkg/api/v1beta3/deep_copy_generated.go pkg/api/v1beta3/types.go pkg/util/set.go

davidopp reviewed Jul 27, 2015
View reviewed changes

Merge remote-tracking branch 'kubernetes/master'

15190a3

Conflicts: pkg/api/v1/types.go

bgrant0607 reviewed Jul 28, 2015
View reviewed changes

pkg/api/v1/types.go

Copy link

Member

bgrant0607 Jul 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodConflictSelectors SGTM

bgrant0607 assigned davidopp and unassigned bgrant0607 Jul 30, 2015

davidopp added the team/master label Aug 3, 2015

davidopp mentioned this pull request Aug 3, 2015

Limit number of pods of the same type in the same minion #3945

Closed

davidopp added team/control-plane and removed [deprecated--DO NOT USE]old-team/master labels Aug 23, 2015

k8s-github-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 27, 2015

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 1, 2015

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2015

bgrant0607 mentioned this pull request Sep 16, 2015

Generalize label selectors #341

Closed

bgrant0607 mentioned this pull request Sep 24, 2015

Proposal: Affinity Priority for pods of different RC/services #14484

Closed

ravigadde mentioned this pull request Sep 25, 2015

Anti affinity selectors for pod #14543

Closed

ravigadde closed this Sep 25, 2015

Anti affinity predicate for pods #9560

Anti affinity predicate for pods #9560

Uh oh!

Conversation

ravigadde commented Jun 10, 2015

Uh oh!

k8s-bot commented Jun 10, 2015

Uh oh!

k8s-bot commented Jun 10, 2015

Uh oh!

erictune commented Jun 11, 2015

Uh oh!

davidopp commented Jun 11, 2015

Uh oh!

ravigadde commented Jun 11, 2015

Uh oh!

erictune commented Jun 11, 2015

For use case 1 (zookeeper)

For use case 2 (sharded storage)

Uh oh!

davidopp commented Jun 11, 2015

Uh oh!

chakri-nelluri commented Jun 11, 2015

Uh oh!

davidopp commented Jun 11, 2015

Uh oh!

ravigadde commented Jun 11, 2015

Uh oh!

ravigadde commented Jun 11, 2015

Uh oh!

bgrant0607 commented Jun 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bgrant0607 commented Jun 25, 2015

Uh oh!

bgrant0607 commented Jun 25, 2015

Uh oh!

bgrant0607 commented Jun 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravigadde commented Jun 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-bot commented Jul 27, 2015

Uh oh!

ravigadde commented Jul 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timothysc commented Jul 29, 2015

Uh oh!

ravigadde commented Jul 29, 2015

Uh oh!

bgrant0607 commented Jul 30, 2015

Uh oh!

davidopp commented Aug 3, 2015

Uh oh!

ravigadde commented Aug 3, 2015

Uh oh!

k8s-github-robot commented Aug 27, 2015

Uh oh!

k8s-github-robot commented Sep 1, 2015

Uh oh!

bgrant0607 commented Sep 16, 2015

Uh oh!

ravigadde commented Sep 24, 2015

Uh oh!