Skip to content

Conversation

@ravigadde
Copy link
Contributor

This is useful for cases where two instances of a Pod cannot run on the same node. Unlike ServiceAntiAffinity, this is strict and leaves unschedulable Pods in Pending state. This can also be used to implement agents (a single instance of the agent on every node in the cluster). Will submit a separate pull request for that.

PS: If 1.0 is frozen and this needs to wait, would still like to get review comments/feedback.

@k8s-bot
Copy link

k8s-bot commented Jun 10, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 10, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@erictune
Copy link
Contributor

Do you have use cases for strict anti-affinity other than agent-per-node scheduling?
Other than that case, it's not clear to me when being Pending is better than co-scheduling.

Agent-per-node scheduling can be implemented through a controller, rather than requiring a scheduler change.

@davidopp
Copy link
Contributor

This can also be used to implement agents (a single instance of the agent on every node in the cluster).

As @etune alluded to, we will have per-node controller for one-pod-on-every-node scenarios (#1518). I assume that kind of controller will still let you specify a label selector, so it will be flexibile enough to do one-pod-on-every-node-of-a-subset-of-all-nodes as well. I guess the main advantage of the approach you are suggesting is that you wouldn't need to use a special controller. I think there is an interesting open question in Kubernetes about when do you want a new controller vs. a configuration option to an existing controller vs. a configuration option to pods (thus a scheduling feature that will work under any controller). There are tradeoffs of among usability, flexibility, duplication of code, etc.

@ravigadde
Copy link
Contributor Author

@erictune
Yes, there are other common use cases.

  1. 3 pod zookeeper or cassandra cluster - dont want two of those to land on the same host
  2. Mongo with multiple shards is more complicated - dont want any two primaries on the same node, dont want primary & secondary from the same shard to be on the same node, dont want two secondaries from the same shard to be on the same node.

@davidopp
Replication controller cannot guarantee the desired behavior because it can only suggest spreading placement using labels. There are good uses that need these guarantees from the scheduler.

Thank you for the pointer to #1518. We have an implementation of replication controller for that. I will submit a patch for it soon. Not sure if anyone is already working on it, happy to collaborate.

@erictune
Copy link
Contributor

@ravigadde
Thanks for the use cases. I agree with your statement of the use cases.

However, I'm not crystal clear on the broader implications are of using the current "best-effort spreading" versus the proposed "spreading or bust" behavior. My default position is not to expand the Pod/Scheduler API unless we have a pretty deep understanding of the use cases for the new feature - otherwise it is just clutter, both in the API, and in terms of understanding scheduler behaviors.

Here is a first pass at a comparison. A deeper one is welcome.

For use case 1 (zookeeper)

If the cluster is not too full, it seems like the best-effort spreading behavior and the spreading-or-bust behavior should have the same result. It is only when the cluster is close to full that they differ. So, I think we need to lay out what behaviors we want when the statically sized cluster is almost full (such that pods would go pending with the spreading-or-bust policy), and also what behavior we want from an autoscaled cluster when adding nodes is be required to ensure all pod schedule and are spread.

For use case 2 (sharded storage)

IIUC, there isn't actually a requirement that the pod themselves have to be on different nodes.
The requirement is that replica chunks should not be on the same node.
Mongodb already handles some migration of chunks. Is it better to have the scheduler make mongo nodes pending (spreading or bust), or for the scheduler do best-effort and let mongo balancer try to rebalance chunks to minimize risk?

@davidopp
Copy link
Contributor

For the first use case, I think an anti-affinity predicate would be useful. I don't think we necessarily need to do the fully generic thing we do in Borg, where you can do an attribute limit on any combination of labels, but at least having attr_limit=host=1 seems pretty reasonable and I don't think it would make the API too confusing. At the very least we need an API extension mechanism so users who want to write their own scheduling policies can pipe options like this all the way through to the scheduler. But my (limited) experience with extensible config languages suggests these turn into a mess. I'd be in favor of adding the feature suggested here in the limited form (just a "max of one of these pods per host" option).

I agree that best-effort spreading works in a not-close-to-full cluster, but I can see the value of making it a guarantee.

The feature isn't a perfect match for the second use case but it seems somewhat more user-friendly than telling the app that it has to fix the layout if it gets unlucky in the best-effort spreading.

@chakri-nelluri
Copy link
Contributor

@erictune
Hi eric, IIUC the default spreading scheduler takes anti-affinity into account in its scheduling, but makes no guarantees. We cannot strictly guarantee that no two instances run on one node.

It seems, docker also has a similar option: affinity:container!=neo4jclusternode*. I remember coreos has support for it in their fleet model.

Also, this makes implementing agents very easy. All we have to do is add an anti-affinity rule and update the count of replication controller instances, based on available minions.

Let me know your thoughts..

@davidopp
Copy link
Contributor

I would propose we should defer debating new features until after 1.0.

BTW working on the critical-for-1.0 features is not limited to Google folks. Please feel free to help with any of the unassigned 1.0 issues. We welcome your PRs!

https://github.com/googlecloudplatform/kubernetes/issues?q=is%3Aopen+is%3Aissue+milestone%3Av1.0+no%3Aassignee

@ravigadde
Copy link
Contributor Author

@erictune
Couple of things I would like to point out.

  1. These apps are handling their own replication. Hence a replication controller is not necessary to run them. In fact, assuming they are using local storage for performance, they cant be moved on a node failure. Which is perfectly fine because its a clustered app.
  2. Replication controller doesn't offer any guarantee of them not landing on the same node. It leaves it up to the spreading algorithm and is best effort.

@davidopp
Is attr_limit=host=1 on the image name? It is limiting if its not tied to a label. It should be possible for zookeeper pods from two different zookeeper clusters to run on the same node.

@ravigadde
Copy link
Contributor Author

@davidopp
I heard there is some WIP for some of these features (agents). We would like to contribute to/influence those efforts. Agree with the urgency for 1.0 though, so this discussion can wait or happen at a slower pace than normal. Will look through the bug list and see where I can help.

@bgrant0607
Copy link
Member

Previous discussions of anti-affinity include #367 and #4301 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were going to add conflicts, we should use standard label selection semantics, as I described here:
#4301 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607
Thanks for the pointers and the feedback. I could switch it to using a label selector, but I have one concern with the label selector. With a map, its possible to specify if key1 == val1 OR key2 == val2, its a conflict. Label selector has an AND semantic, which wouldn't make this possible.

The OR semantic may be very useful for conflicts. The label selector operators that are likely to be used with conflicts are EqualsOperator, DoubleEqualsOperator, InOperator and ExistsOperator. Other than InOperator, the rest of them are covered by a map albeit with an OR semantic.

Let me know your thoughts on how to proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue I cited, I proposed a list of selectors:

Another alternative model would be a list of label selectors in the Pod to be used for anti-affinity.

While individual selectors wouldn't provide top-level OR, the list would: if a pod's labels matched selector1 OR selector2 OR selector3, then it would be considered a conflict.

I don't recommend this mechanism for anti-affinity for like pods, though it could be used that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607
Makes sense, sorry I missed that. I am confused by the last line in your comment. Should I or should I not make the change for anti affinity using a list of label selectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607

When I tried to use []labels.LabelSelector, I ran in to this issue with generated conversion code.

pkg/api/v1/conversion_generated.go:1535: cannot use make([]labels.LabelSelector, len(in.Conflicts)) (type []labels.LabelSelector) as type [][]labels.Requirement in assignment

Perhaps someone familiar with the code generation can comment on whether slice of slice is ok.

I changed it to [][]labels.Requirement for now and submitted a patch.

@bgrant0607 bgrant0607 added this to the v1.0-post milestone Jun 25, 2015
@bgrant0607
Copy link
Member

Sorry I didn't see this earlier. I'm swamped with 1.0 at the moment, but let me know if you have questions about the comments in the referenced issues and I'll try to get to them when I can.

@bgrant0607
Copy link
Member

Assigning to myself since it's gated on an API change.

@bgrant0607 bgrant0607 assigned bgrant0607 and unassigned davidopp Jun 25, 2015
@bgrant0607
Copy link
Member

@ravigadde If you're looking to help, we're currently focused on bug fixes, critical performance improvements, documentation, and usability improvements (e.g., kubectl improvements), in addition to those things in the v1.0 milestone on github.

ravigadde referenced this pull request in rjnagal/kubernetes Jun 25, 2015
@ravigadde ravigadde force-pushed the master branch 2 times, most recently from 4302625 to 3e6c291 Compare June 26, 2015 08:38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #341 and #7053. We need a LabelSelector type that has the right json, description, and patch tags.
This will also need description and patch tags.
Insert the field just below NodeSelector.
cc @sdminonne

@ravigadde
Copy link
Contributor Author

@bgrant0607

Changed to using []labels.LabelSelector and squashed the changes. I had to make some changes to conversion and deep copy code for slices. I am not sure if there is a better way of handling that. The recursive calls covert []labels.LabelSelector to [][]labels.Requirement which the compiler doesn't like.

Passed the unit tests. Not sure why shippable fails.

…netes

Conflicts:
	pkg/api/v1/conversion_generated.go
	pkg/api/v1/types.go
	pkg/api/v1beta3/conversion.go
	pkg/api/v1beta3/deep_copy_generated.go
	pkg/api/v1beta3/types.go
	pkg/util/set.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest calling this NodeConflictSelectors

We're already using "affinity" to refer to best-effort spreading policies, so I think it's better if we use a new word ("conflict") here. I believe @bgrant0607 was OK with this.

Also, the reason I'm suggesting NodeConflictSelectors instead of ConflictSelectors is because we might eventually want Pods to be able to conflict with Pods that are already running on the node, not just with labels on the Node (in which case we might call that PodConflictSelectors or something like that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp

Thanks for the suggestion. This changeset is for the second use case you mentioned, for Pods to be able to conflict with other Pods that are already scheduled to a Node. Should I call it PodConflictSelectors?

Conflicts:
	pkg/api/v1/types.go
@k8s-bot
Copy link

k8s-bot commented Jul 27, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@ravigadde
Copy link
Contributor Author

I will wait for #7053 to be merged before refreshing the code. There is overlap in selector related changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodConflictSelectors SGTM

@timothysc
Copy link
Contributor

This is one piece of the puzzle that is kind of overlaps with some of the things we would like to do. However the change-set in the diff seems pretty conflated..

There is cleanup/preference work alongside feature addition. Imho they are orthogonal and should be broken up.

@ravigadde
Copy link
Contributor Author

@timothysc

Totally agree. I ran in to conversion problems with a slice of LabelSelector which happens to be a slice of Requirement. Are you aware of anyone working on a fix for this? The generated code doesn't compile as it says the left and right sides are of incompatible types. IIRC, the conversion code expands the right side to [][]Requirement while the left side is []Selector.

@bgrant0607
Copy link
Member

v1beta3 is gone. Please clean that up.

@sdminonne re. the label selector changes.

@davidopp
Copy link
Contributor

davidopp commented Aug 3, 2015

Is the thing you want to specify here []labels.LabelSelector (which is what you have in the PR) or []string where the string is a key for a label selector? That is, should the semantics be "don't put any two Pods on the same machine if the Pods have the same value of label X" instead of what you currently have ("don't put any two Pods on the same machine if the Pods have value Y for label X")?

@ravigadde
Copy link
Contributor Author

@davidopp It is the second semantic (labels with the same value conflict). Would like to specify it as a []labels.LabelSelector in the spec, []string is more abstract and may not be very clear to the user on what to specify.

Will clean up the diff after #7053 is merged

@k8s-github-robot k8s-github-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 27, 2015
@k8s-github-robot
Copy link

Labelling this PR as size/XXL

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 1, 2015
@k8s-github-robot
Copy link

Labelling this PR as size/L

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2015
@bgrant0607
Copy link
Member

#7053 hasn't progressed in months. I don't think this PR needs to block on it if you want to continue with it. But it should start with the more general style of selector.

@ravigadde
Copy link
Contributor Author

@bgrant0607 @davidopp @sdminonne

I am almost ready with the changes, resolved the deep copy issue with slice of slices in a better way. Should we move selector to the api package? I see atleast one test case that assumes objects to be of "versioned" or "built in" types (validateField in SwaggerSchema).

--- FAIL: TestValidateOk (0.10s)
schema_test.go:80: unexpected error: unexpected type: labels.LabelSelector

I was hoping this issue will be addressed in #7053. There is another test failure TestRoundTripTypes that I still need to debug.

@ravigadde
Copy link
Contributor Author

Continued in #14543

@ravigadde ravigadde closed this Sep 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants