Avoid reboots for ICSP changes #1922

beekhof · 2020-07-13T12:03:50Z

- What I did

Cleaned up the PoC code for openshift/enhancements#159, replaces PR #1626

- How to verify it

e2e and unit tests included

- Description for the changelog

Provide an alternative to node reboots in the specific cases that we know it is safe to do so

openshift-ci-robot · 2020-07-13T12:04:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: beekhof
To complete the pull request process, please assign ericavonb
You can assign the PR to them by writing /assign @ericavonb in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2020-07-13T12:12:14Z

pkg/daemon/apply.go

s/Storage/Registy/g perhaps? Also, I think this must be abstracted out, we won't add if/else(s) to this

I'm also in two minds about have flexible vs. simple this should be.
Definitely open to suggestions.

At one point there was basically a lookup table that mapped filenames to actions, with reboot used for anything not present.
How would you feel about re-introducing that?

runcom · 2020-07-13T12:12:45Z

pkg/daemon/apply.go

uhm, same as above - don't have a solution rn, will think about it!

pkg/daemon/update.go

pkg/daemon/apply.go

kikisdeliveryservice

All of the user stories on this are baremetal, why not only roll this out to baremetal clusters (at least first)?

ashcrow · 2020-07-16T13:54:16Z

All of the user stories on this are baremetal, why not only roll this out to baremetal clusters (at least first)?

That makes a lot of sense to me 👍

beekhof · 2020-07-16T23:19:28Z

I'd be happy to do that... any hints on how?

beekhof · 2020-07-17T00:20:35Z

/retest e2e-aws

openshift-ci-robot · 2020-07-17T00:20:49Z

@beekhof: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

/test cluster-bootimages
/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-scaleup-rhel7
/test e2e-azure
/test e2e-gcp-op
/test e2e-gcp-upgrade
/test e2e-metal-ipi
/test e2e-openstack
/test e2e-ovirt
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upi
/test images
/test unit
/test verify

Use /test all to run the following jobs:

pull-ci-openshift-machine-config-operator-master-e2e-aws
pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/retest e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

beekhof · 2020-07-17T00:21:27Z

/test e2e-aws

beekhof · 2020-07-18T11:49:19Z

//nogocyclo

Signed-off-by: Andrew Beekhof <[email protected]>

beekhof · 2020-07-28T04:32:12Z

I0724 03:14:14.087933 2040 update.go:1511] Starting update from rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a to rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:false kernelType:false}

From your previous comment, that seems an update to the very same machine config - is that a result of this PR? is it causing it to restart again?

A second go around the update was a result of this PR.
Looks like the finalise flow was incomplete, I'm testing a followup patch that should address this.

beekhof · 2020-07-28T04:59:13Z

Can you split off a "prep patches" PR from this? For example I split off #1947 from #1946

I'll leave this PR as-is for now, so that I can look into feature sets, but I've created #1953 which is a squashed version that still always performs a reboot.

runcom · 2020-07-28T07:58:47Z

What is the reason for needing to introduce this config flag? To further @cgwalters point above, adding another flag introduces another variant for our clusters, so I'm hesitant to go that route. Is this too risky to enable across the board?

@crawford this is a shift of almost 2y on this pattern of "always reboot", now we're going the route of "avoid reboots" which, given how important reboots are for upgrades, warrants more time to finalize and stabilize. In the enhancement we've clarified that this ICSP case is definitely a start in line with the plan of avoiding reboots and we can evolve from here but we haven't gauged and thought about other cases where maybe this needs further changes and maybe some break here and there (which will make us support another migration? or added logic to handle different cases for compatibility?).
What we're proposing sounds aligned with how Kube introduces features. Sure, it creates another variant but we know this feature targets a specific use case already whether we won't have any enabler on most clouds :) Narrowing this down to whoever opt-in for this feature gives us some space in the future to discuss and come back at this, if we enable it everywhere, we're going to rush fixing whatever cases we missed I think.

There are ~4 ways this can be gated:

featuregates like @cgwalters suggested, the MCD would just read that and act accordingly
gate on the platform we know is seeking this feature like @kikisdeliveryservice suggested (from within the MCD again)
gate at the MCP level (for better testing)
a combination of the above? platform+feature gate somehow?

Generally I agree that having more configuration flag adds to the supportability matrix. However, in this case it may be warranted. Previously in talking with @runcom and members of the MCO team there was a lot of research they wanted to do before making a fully fledged feature allowing for limited rebooting. If this is still the case I support limiting the scope of this feature to just the area that this is needed sooner rather than later with the understanding it will likely change with output of the MCO teams design work. If this is not the case and the above is now as far as we are taking this work, then I'd tend to agree that limiting may not be helpful.

Right; if we're doing gating, doing it on the pool makes a lot more sense to me, because it will allow us to easily test the code on cloud platforms (there's nothing actually metal specific about it).

@cgwalters @ashcrow I've replied above for those comments as well

…reboots flow Signed-off-by: Andrew Beekhof <[email protected]>

beekhof · 2020-07-29T03:01:56Z

I0724 03:14:14.087933 2040 update.go:1511] Starting update from rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a to rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:false kernelType:false}

From your previous comment, that seems an update to the very same machine config - is that a result of this PR? is it causing it to restart again?

A second go around the update was a result of this PR.
Looks like the finalise flow was incomplete, I'm testing a followup patch that should address this.

@runcom fixed in the latest patch

beekhof · 2020-07-29T07:19:29Z

Is there a final decision about whether to gate, and if so, how?

openshift-ci-robot · 2020-09-14T18:08:40Z

@beekhof: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-proxy	ad7765a93df7b7fcfb903a6ea752b8c2f5dd8efc	link	`/test e2e-aws-proxy`
ci/prow/e2e-gcp-upgrade	`a4f56a2`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-gcp-op	`a4f56a2`	link	`/test e2e-gcp-op`
ci/prow/e2e-metal-ipi	`a4f56a2`	link	`/test e2e-metal-ipi`
ci/prow/e2e-upgrade	`a4f56a2`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2020-11-05T21:51:00Z

@beekhof: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-agnostic-upgrade	`a4f56a2`	link	`/test e2e-agnostic-upgrade`
ci/prow/e2e-aws-serial	`a4f56a2`	link	`/test e2e-aws-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-02-04T01:35:48Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-03-06T01:50:08Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-04-05T06:11:12Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-04-05T06:11:24Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 13, 2020

openshift-ci-robot requested review from runcom and sinnykumari July 13, 2020 12:04

runcom reviewed Jul 13, 2020

View reviewed changes

pkg/daemon/apply.go Outdated

Copy link

Member

runcom Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, same as above - don't have a solution rn, will think about it!

runcom reviewed Jul 13, 2020

View reviewed changes

pkg/daemon/update.go Outdated Show resolved Hide resolved

mhrivnak reviewed Jul 14, 2020

View reviewed changes

pkg/daemon/apply.go Outdated Show resolved Hide resolved

pkg/daemon/apply.go Outdated Show resolved Hide resolved

pkg/daemon/apply.go Outdated Show resolved Hide resolved

beekhof force-pushed the PR-1626-refactored branch from 69df51d to 1ead451 Compare July 15, 2020 04:02

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020

beekhof force-pushed the PR-1626-refactored branch 2 times, most recently from d094e44 to 0093ff6 Compare July 15, 2020 06:22

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020

beekhof force-pushed the PR-1626-refactored branch from 0093ff6 to b2bacda Compare July 15, 2020 06:39

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020

kikisdeliveryservice reviewed Jul 15, 2020

View reviewed changes

beekhof mentioned this pull request Jul 16, 2020

RFC: Reboot whitelisting #1626

Closed

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020

beekhof force-pushed the PR-1626-refactored branch 2 times, most recently from 16b9bc0 to 9f02c47 Compare July 18, 2020 11:26

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020

beekhof force-pushed the PR-1626-refactored branch from 9f02c47 to 3c58868 Compare July 18, 2020 11:47

beekhof changed the title ~~WIP: Avoid reboots for ICSP changes~~ Avoid reboots for ICSP changes Jul 18, 2020

beekhof added 11 commits July 28, 2020 12:49

Add basic unit tests

55de13c

Signed-off-by: Andrew Beekhof <[email protected]>

Improved unit tests

6793ff8

Signed-off-by: Andrew Beekhof <[email protected]>

Convert actionResult to a public interface so that we can compare them

5988356

Signed-off-by: Andrew Beekhof <[email protected]>

Add additional unit tests for applying file based changes

8070dab

Signed-off-by: Andrew Beekhof <[email protected]>

Removed unneeded test inputs

ce68cba

Signed-off-by: Andrew Beekhof <[email protected]>

Reduce the cyclomatic complexity of update() in pkg/daemon

c39a8cd

Signed-off-by: Andrew Beekhof <[email protected]>

Remove unnecessary use of Daemon from ActionResult

6f07cb2

Signed-off-by: Andrew Beekhof <[email protected]>

Log each action description in ActionResult tests

647e637

Signed-off-by: Andrew Beekhof <[email protected]>

Prevent use-of-nil when finalizing MachineConfig updates

570fa31

Signed-off-by: Andrew Beekhof <[email protected]>

Add an e2e test for registry changes

d6dfffd

Signed-off-by: Andrew Beekhof <[email protected]>

Rename the interface for config update actions

54b43c3

beekhof force-pushed the PR-1626-refactored branch from 08de2c6 to 54b43c3 Compare July 28, 2020 03:04

beekhof mentioned this pull request Jul 28, 2020

Prepare for rebooting to be an optional part of applying updates #1953

Closed

Refactor MCD bootstrap finalize so that it can be reused for the non-…

a4f56a2

…reboots flow Signed-off-by: Andrew Beekhof <[email protected]>

beekhof force-pushed the PR-1626-refactored branch from e59a449 to a4f56a2 Compare July 29, 2020 03:02

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 6, 2021

openshift-ci-robot closed this Apr 5, 2021

Avoid reboots for ICSP changes #1922

Avoid reboots for ICSP changes #1922

Uh oh!

Conversation

beekhof commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 13, 2020

Uh oh!

runcom Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

beekhof Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

beekhof Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

runcom Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Jul 16, 2020

Uh oh!

beekhof commented Jul 16, 2020

Uh oh!

beekhof commented Jul 17, 2020

Uh oh!

openshift-ci-robot commented Jul 17, 2020

Uh oh!

beekhof commented Jul 17, 2020

Uh oh!

beekhof commented Jul 18, 2020

Uh oh!

beekhof commented Jul 28, 2020

Uh oh!

beekhof commented Jul 28, 2020

Uh oh!

runcom commented Jul 28, 2020

Uh oh!

beekhof commented Jul 29, 2020

Uh oh!

beekhof commented Jul 29, 2020

Uh oh!

openshift-ci-robot commented Sep 14, 2020

Uh oh!

openshift-merge-robot commented Nov 5, 2020

Uh oh!

openshift-bot commented Feb 4, 2021

Uh oh!

openshift-bot commented Mar 6, 2021

Uh oh!

openshift-bot commented Apr 5, 2021

Uh oh!

openshift-ci-robot commented Apr 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

beekhof commented Jul 13, 2020 •

edited

Loading