Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 14, 2021

Since this package was created in d9f6718 (#7), the volume(mount) merge logic has required manifest entries to exist, but has allowed in-cluster entries to persist without removal. That hasn't been a problem until:

  1. In 4.3, the autoscaler asked for a ca-cert volume mount, based on the cluster-autoscaler-operator-ca config map.
  2. In 4.4, the autoscaler dropped those manifest entries.
  3. In 4.9, the autoscaler asked the CVO to remove the config map.

That lead some born-in 4.3 clusters to have crashlooping autoscalers, because the mount attempts kept failing on the missing config map.

We couldn't think of a plausible reason why cluster admins would want to inject additional volume mounts in a CVO-managed pod configuration, so this commit removes that ability and begins clearing away any volume(mount) configuration that is not present in the reconciling manifest. Cluster administrators who do need to add additional mounts in an emergency are free to use ClusterVersion's spec.overrides to take control of a particular CVO-managed resource.

This joins a series of similar previous tightenings, including 02bb9ba (#549) and ca299b8 (#322).

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 14, 2021

@wking: This pull request references Bugzilla bug 2002834, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

In response to this:

Bug 2002834: lib/resourcemerge/core: Remove unrecognized volumes and mounts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 14, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021
@wking wking force-pushed the remove-volumes-not-in-manifest branch 4 times, most recently from 44618c3 to b5cba93 Compare September 14, 2021 23:50
Since this package was created in d9f6718 (lib: add lib for
applying objects, 2018-08-14, openshift#7), the volume(mount) merge logic has
required manifest entries to exist, but has allowed in-cluster entries
to persist without removal.  That hasn't been a problem until [1]:

1. In 4.3, the autoscaler asked for a ca-cert volume mount, based on
   the cluster-autoscaler-operator-ca config map.
2. In 4.4, the autoscaler dropped those manifest entries [2].
3. In 4.9, the autoscaler asked the CVO to remove the config map [3].

That lead some born-in 4.3 clusters to have crashlooping autoscalers,
because the mount attempts kept failing on the missing config map.

We couldn't think of a plausible reason why cluster admins would want
to inject additional volume mounts in a CVO-managed pod configuration,
so this commit removes that ability and begins clearing away any
volume(mount) configuration that is not present in the reconciling
manifest.  Cluster administrators who do need to add additional mounts
in an emergency are free to use ClusterVersion's spec.overrides to
take control of a particular CVO-managed resource.

This joins a series of similar previous tightenings, including
02bb9ba (lib/resourcemerge/core: Clear env and envFrom if unset in
manifest, 2021-04-20, openshift#549) and ca299b8 (lib/resourcemerge: remove
ports which are no longer required, 2020-02-13, openshift#322).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2002834
[2]: openshift/cluster-autoscaler-operator@f08589d#diff-547486373183980619528df695869ed32b80c18383bc16b57a5ee931bf0edd39L89
[3]: openshift/cluster-autoscaler-operator@9a7b3be#diff-d0cf785e044c611986a4d9bdd65bb373c86f9eb1c97bd3f105062184342a872dR4
@wking wking force-pushed the remove-volumes-not-in-manifest branch from b5cba93 to 83faa6e Compare September 14, 2021 23:53
Copy link

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@vrutkovs
Copy link

/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Sep 15, 2021

/cherrypick release-4.9

@openshift-cherrypick-robot

@wking: once the present PR merges, I will cherry-pick it on top of release-4.9 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member

sdodson commented Sep 15, 2021

/test e2e-agnostic-upgrade

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Sep 15, 2021

Sandbox issue is orthogonal.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

In response to this:

Sandbox issue is orthogonal.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Sep 16, 2021

Sandbox issue is still orthogonal.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

In response to this:

Sandbox issue is still orthogonal.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Sep 16, 2021

I dunno what this TestIntegrationCVO_initializeAndHandleError flake is about, but that's orthogonal too, because we have no e2e-operator exposure to pod manifests changing volume settings.

/override ci/prow/e2e-agnostic-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-operator

In response to this:

I dunno what this TestIntegrationCVO_initializeAndHandleError flake is about, but that's orthogonal too, because we have no e2e-operator exposure to pod manifests changing volume settings.

/override ci/prow/e2e-agnostic-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 09cddce into openshift:master Sep 16, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2002834 has been moved to the MODIFIED state.

In response to this:

Bug 2002834: lib/resourcemerge/core: Remove unrecognized volumes and mounts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@wking: new pull request created: #657

In response to this:

/cherrypick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the remove-volumes-not-in-manifest branch September 16, 2021 01:29
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 21, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 21, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 21, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 21, 2021
…-api-access

This content is injected by an admission webhook [1,2].  When we
started removing not-in-manifest volumes in 83faa6e
(lib/resourcemerge/core: Remove unrecognized volumes and mounts,
2021-09-14, openshift#654), the cluster-version operator started removing the
webhook-injected volume, leading to the cluster-version operator
crash-looping on updates from 4.8 to 4.9 with messages like [3]:

  F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

With this commit, we follow the precedent of the Kubernetes API
server's own manifest [4,5].

[1]: https://github.com/kubernetes/kubernetes/blob/2f68346fbb6246961ce0a3176418630950aea500/plugin/pkg/admission/serviceaccount/admission.go#L53-L54
[2]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2005581
[4]: openshift/cluster-kube-apiserver-operator#1142
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1946479
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 3, 2021
We had been merging by name since ensureVolumeMounts landed in
83faa6e (lib/resourcemerge/core: Remove unrecognized volumes and
mounts, 2021-09-14, openshift#654).  But as pointed out in [1], a single volume
may be mounted at multiple paths.  With this commit, I'm pivoting to
merge by mountPath, which is the patchMergeKey [2] (and it makes sense
that you wouldn't have multiple volumes mounted at the same path).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2026560
[2]: https://github.com/kubernetes/api/blob/1d6faf224f146dd002553f55cd9fcaaaa0dc00cb/core/v1/types.go#L2367
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Dec 14, 2021
We had been merging by name since ensureVolumeMounts landed in
83faa6e (lib/resourcemerge/core: Remove unrecognized volumes and
mounts, 2021-09-14, openshift#654).  But as pointed out in [1], a single volume
may be mounted at multiple paths.  With this commit, I'm pivoting to
merge by mountPath, which is the patchMergeKey [2] (and it makes sense
that you wouldn't have multiple volumes mounted at the same path).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2026560
[2]: https://github.com/kubernetes/api/blob/1d6faf224f146dd002553f55cd9fcaaaa0dc00cb/core/v1/types.go#L2367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants