Skip to content

Conversation

@beekhof
Copy link

@beekhof beekhof commented Jul 13, 2020

- What I did

Cleaned up the PoC code for openshift/enhancements#159, replaces PR #1626

- How to verify it

e2e and unit tests included

- Description for the changelog

Provide an alternative to node reboots in the specific cases that we know it is safe to do so

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 13, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: beekhof
To complete the pull request process, please assign ericavonb
You can assign the PR to them by writing /assign @ericavonb in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Storage/Registy/g perhaps? Also, I think this must be abstracted out, we won't add if/else(s) to this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in two minds about have flexible vs. simple this should be.
Definitely open to suggestions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point there was basically a lookup table that mapped filenames to actions, with reboot used for anything not present.
How would you feel about re-introducing that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, same as above - don't have a solution rn, will think about it!

@beekhof beekhof force-pushed the PR-1626-refactored branch from 69df51d to 1ead451 Compare July 15, 2020 04:02
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
@beekhof beekhof force-pushed the PR-1626-refactored branch 2 times, most recently from d094e44 to 0093ff6 Compare July 15, 2020 06:22
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
@beekhof beekhof force-pushed the PR-1626-refactored branch from 0093ff6 to b2bacda Compare July 15, 2020 06:39
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the user stories on this are baremetal, why not only roll this out to baremetal clusters (at least first)?

@ashcrow
Copy link
Member

ashcrow commented Jul 16, 2020

All of the user stories on this are baremetal, why not only roll this out to baremetal clusters (at least first)?

That makes a lot of sense to me 👍

@beekhof
Copy link
Author

beekhof commented Jul 16, 2020

I'd be happy to do that... any hints on how?

@beekhof
Copy link
Author

beekhof commented Jul 17, 2020

/retest e2e-aws

@openshift-ci-robot
Copy link
Contributor

@beekhof: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-scaleup-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-gcp-upgrade
  • /test e2e-metal-ipi
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upi
  • /test images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/retest e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@beekhof
Copy link
Author

beekhof commented Jul 17, 2020

/test e2e-aws

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020
@beekhof beekhof force-pushed the PR-1626-refactored branch 2 times, most recently from 16b9bc0 to 9f02c47 Compare July 18, 2020 11:26
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020
@beekhof beekhof force-pushed the PR-1626-refactored branch from 9f02c47 to 3c58868 Compare July 18, 2020 11:47
@beekhof
Copy link
Author

beekhof commented Jul 18, 2020

//nogocyclo

@beekhof beekhof changed the title WIP: Avoid reboots for ICSP changes Avoid reboots for ICSP changes Jul 18, 2020
@beekhof beekhof force-pushed the PR-1626-refactored branch from 08de2c6 to 54b43c3 Compare July 28, 2020 03:04
@beekhof
Copy link
Author

beekhof commented Jul 28, 2020

I0724 03:14:14.087933 2040 update.go:1511] Starting update from rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a to rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:false kernelType:false}

From your previous comment, that seems an update to the very same machine config - is that a result of this PR? is it causing it to restart again?

A second go around the update was a result of this PR.
Looks like the finalise flow was incomplete, I'm testing a followup patch that should address this.

@beekhof
Copy link
Author

beekhof commented Jul 28, 2020

Can you split off a "prep patches" PR from this? For example I split off #1947 from #1946

I'll leave this PR as-is for now, so that I can look into feature sets, but I've created #1953 which is a squashed version that still always performs a reboot.

@runcom
Copy link
Member

runcom commented Jul 28, 2020

What is the reason for needing to introduce this config flag? To further @cgwalters point above, adding another flag introduces another variant for our clusters, so I'm hesitant to go that route. Is this too risky to enable across the board?

@crawford this is a shift of almost 2y on this pattern of "always reboot", now we're going the route of "avoid reboots" which, given how important reboots are for upgrades, warrants more time to finalize and stabilize. In the enhancement we've clarified that this ICSP case is definitely a start in line with the plan of avoiding reboots and we can evolve from here but we haven't gauged and thought about other cases where maybe this needs further changes and maybe some break here and there (which will make us support another migration? or added logic to handle different cases for compatibility?).
What we're proposing sounds aligned with how Kube introduces features. Sure, it creates another variant but we know this feature targets a specific use case already whether we won't have any enabler on most clouds :) Narrowing this down to whoever opt-in for this feature gives us some space in the future to discuss and come back at this, if we enable it everywhere, we're going to rush fixing whatever cases we missed I think.

There are ~4 ways this can be gated:

  • featuregates like @cgwalters suggested, the MCD would just read that and act accordingly
  • gate on the platform we know is seeking this feature like @kikisdeliveryservice suggested (from within the MCD again)
  • gate at the MCP level (for better testing)
  • a combination of the above? platform+feature gate somehow?

Generally I agree that having more configuration flag adds to the supportability matrix. However, in this case it may be warranted. Previously in talking with @runcom and members of the MCO team there was a lot of research they wanted to do before making a fully fledged feature allowing for limited rebooting. If this is still the case I support limiting the scope of this feature to just the area that this is needed sooner rather than later with the understanding it will likely change with output of the MCO teams design work. If this is not the case and the above is now as far as we are taking this work, then I'd tend to agree that limiting may not be helpful.

Right; if we're doing gating, doing it on the pool makes a lot more sense to me, because it will allow us to easily test the code on cloud platforms (there's nothing actually metal specific about it).

@cgwalters @ashcrow I've replied above for those comments as well

@beekhof
Copy link
Author

beekhof commented Jul 29, 2020

I0724 03:14:14.087933 2040 update.go:1511] Starting update from rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a to rendered-worker-ae9b36c934fcb6457ef7cd6a2165608a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:false kernelType:false}

From your previous comment, that seems an update to the very same machine config - is that a result of this PR? is it causing it to restart again?

A second go around the update was a result of this PR.
Looks like the finalise flow was incomplete, I'm testing a followup patch that should address this.

@runcom fixed in the latest patch

@beekhof beekhof force-pushed the PR-1626-refactored branch from e59a449 to a4f56a2 Compare July 29, 2020 03:02
@beekhof
Copy link
Author

beekhof commented Jul 29, 2020

Is there a final decision about whether to gate, and if so, how?

@openshift-ci-robot
Copy link
Contributor

@beekhof: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-proxy ad7765a93df7b7fcfb903a6ea752b8c2f5dd8efc link /test e2e-aws-proxy
ci/prow/e2e-gcp-upgrade a4f56a2 link /test e2e-gcp-upgrade
ci/prow/e2e-gcp-op a4f56a2 link /test e2e-gcp-op
ci/prow/e2e-metal-ipi a4f56a2 link /test e2e-metal-ipi
ci/prow/e2e-upgrade a4f56a2 link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@beekhof: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic-upgrade a4f56a2 link /test e2e-agnostic-upgrade
ci/prow/e2e-aws-serial a4f56a2 link /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 6, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants