Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 1, 2019

We grew replicas-zeroing in c22d042 (#1649) to set the stage for changing the replicas: 0 semantics from "we'll make you some dummy MachineSets" to "we won't make you MachineSets". But that hasn't happened yet, and since 64f96df (#2004) replicas: 0 for compute has also meant "add the worker role to control-plane nodes". That leads to racy problems when ingress comes through a load balancer, because Kubernetes load balancers exclude control-plane nodes from their target set (see here and here, although this may get relaxed soonish). If the router pods get scheduled on the control plane machines due to the worker role, they are not reachable from the load balancer and ingress routing breaks. @sjenning says:

pod nodeSelectors are not like taints/tolerations. They only have effect at scheduling time. They are not continually enforced.

which means that attempting to address this issue as a day-2 operation would mean removing the worker role from the control-plane nodes and then manually evicting the router pods to force rescheduling. So until we get the changes from here, we can either drop the zeroing (#2402) or adjust the scheduler configuration to remove the effect of the zeroing. In both cases, this is a change we'll want to revert later once we bump Kubernetes to pick up a fix for the service load-balancer targets.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Oct 1, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1755073, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1755073: docs/user/*/install_upi:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 1, 2019
@abhinavdahiya
Copy link
Contributor

/approve

for another review
/cc @sdodson

@sdodson
Copy link
Member

sdodson commented Oct 2, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2019
@wking wking force-pushed the explicitly-set-control-plane-unschedulable branch from 7daf18e to 55caef5 Compare October 2, 2019 17:05
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2019
@wking wking changed the title Bug 1755073: docs/user/*/install_upi: Bug 1755073: docs/user/*/install_upi: explicitly-set-control-plane-unschedulable Oct 2, 2019
@wking
Copy link
Member Author

wking commented Oct 2, 2019

I'd left off the meat of the commit-message subject 🤦‍♂️. Fixed with 7daf18eb8 -> 55caef592. Can I get a fresh /lgtm?

@sdodson
Copy link
Member

sdodson commented Oct 2, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routers are now ingress controllers :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routers are now ingress controllers :-)

The pod is still called router. But is "ingress controller pod" the preferred English? Do we have plans to rename the pods to match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "Ingress router pods" in the official docs, so if any rephrasing is needed we'd probably want to bump those too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking
Copy link
Member Author

wking commented Oct 2, 2019

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2019
We grew replicas-zeroing in c22d042 (docs/user/aws/install_upi: Add
'sed' call to zero compute replicas, 2019-05-02, openshift#1649) to set the
stage for changing the 'replicas: 0' semantics from "we'll make you
some dummy MachineSets" to "we won't make you MachineSets".  But that
hasn't happened yet, and since 64f96df (scheduler: Use schedulable
masters if no compute hosts defined, 2019-07-16, openshift#2004) 'replicas: 0'
for compute has also meant "add the 'worker' role to control-plane
nodes".  That leads to racy problems when ingress comes through a load
balancer, because Kubernetes load balancers exclude control-plane
nodes from their target set [1,2] (although this may get relaxed
soonish [3]).  If the router pods get scheduled on the control plane
machines due to the 'worker' role, they are not reachable from the
load balancer and ingress routing breaks [4].  Seth says:

> pod nodeSelectors are not like taints/tolerations.  They only have
> effect at scheduling time.  They are not continually enforced.

which means that attempting to address this issue as a day-2 operation
would mean removing the 'worker' role from the control-plane nodes and
then manually evicting the router pods to force rescheduling.  So
until we get the changes from [3], we can either drop the zeroing [5]
or adjust the scheduler configuration to remove the effect of the
zeroing.  In both cases, this is a change we'll want to revert later
once we bump Kubernetes to pick up a fix for the service load-balancer
targets.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
[2]: kubernetes/kubernetes#65618
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1744370#c6
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1755073
[5]: openshift#2402
@wking wking force-pushed the explicitly-set-control-plane-unschedulable branch from 55caef5 to 485057a Compare October 2, 2019 20:11
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2019
@wking
Copy link
Member Author

wking commented Oct 2, 2019

/hold cancel

Still need to work out #2440 (comment) , but with the True->False bumps clearing the lgtm, we no longer need the hold to keep this from merging.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2019
@abhinavdahiya
Copy link
Contributor

/lgtm

/hold
to see if @danehans thinks the doc change referencing router is required.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2019
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,sdodson,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Oct 3, 2019

/hold cancel
Lets not block on that, surely everyone will understand what is meant and it matches the name of the pod in current implementation. PRs welcome though 😁

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2019
@openshift-merge-robot openshift-merge-robot merged commit 056a561 into openshift:master Oct 3, 2019
@wking wking deleted the explicitly-set-control-plane-unschedulable branch October 3, 2019 16:22
@danehans
Copy link
Contributor

danehans commented Oct 3, 2019

I don't think the router > ingress controller is required, but an overall installer renaming for all instances of router should be considered as a future effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants