Skip to content

Conversation

@matheuscscp
Copy link
Contributor

@matheuscscp matheuscscp commented Oct 29, 2025

Closes: #31382

The behavior expected for a release with a custom wait context that gets done during a wait operation is expected to be the same as the behavior for a timeout (which is the default behavior).

@pull-request-size pull-request-size bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 29, 2025
@matheuscscp matheuscscp force-pushed the cancel-health-checks branch 2 times, most recently from d471fea to 27f6766 Compare October 29, 2025 16:58
@pull-request-size pull-request-size bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 29, 2025
@matheuscscp
Copy link
Contributor Author

@joejulian I think you mentioned in the dev meeting that some changes are required? Can you please clarify? 🙏 Thanks!

@pull-request-size pull-request-size bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 31, 2025
@matheuscscp matheuscscp marked this pull request as ready for review October 31, 2025 21:08
@matheuscscp
Copy link
Contributor Author

@joejulian @gjenkins8 @mattfarina I added tests, please review and let me know if anything else is missing 🙏

@matheuscscp
Copy link
Contributor Author

@gjenkins8 Fixed both comments, thanks! PTAL one more time 🙏

@matheuscscp
Copy link
Contributor Author

@joejulian I was reflecting about what you said in the last dev meeting and I think I understood what you meant. Upon a wait timeout, the release is left in the pending state i.e. pending-install, pending-upgrade or pending-rollback.

Flux has dealt with this a long time ago: we detect this pending state and "unlock" the release to keep moving forward. A wait timeout is a very common issue and Flux couldn't leave this unattended for so long. Here's how we do it:

Detection:

  1. https://github.com/fluxcd/helm-controller/blob/96f1d9e1e28c3228d27524a5c28457df412c12e3/internal/reconcile/state.go#L89-L105
  2. https://github.com/fluxcd/helm-controller/blob/96f1d9e1e28c3228d27524a5c28457df412c12e3/internal/reconcile/atomic_release.go#L375-L377

Resolution: https://github.com/fluxcd/helm-controller/blob/96f1d9e1e28c3228d27524a5c28457df412c12e3/internal/reconcile/unlock.go#L55-L91

So, as I said before, from the perspective of us SDK users, the feature being introduced in this PR is merely "a faster/smarter/customized way for producing timeouts on wait operations". It's beyond the scope of this PR dealing with timeouts.

I think with this we can lift the embargo here, what do you think? @joejulian @gjenkins8 @mattfarina @scottrigby @stealthybox @stefanprodan

Furthermore, I have backported the legacy waiter changes here to a fork from v3.19.0 and built an e2e PoC for the Flux feature, which is working as expected. I created a Flux HelmRelease with a bogus image tag, causing Helm to get stuck in a wait operation with timeout set to 3m due to the pods in ErrImagePull. When the Flux HelmRelease .spec.values.image.tag is updated to a valid tag, I immediately see this Kubernetes event emitted by helm-controller:

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2025-11-01T12:13:44Z"
  involvedObject:
    apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    name: podinfo
    namespace: frontend
    resourceVersion: "92444307"
    uid: 29c2a513-5662-42b6-afe4-ed157a5c60f2
  kind: Event
  lastTimestamp: "2025-11-01T12:13:44Z"
  message: |-
    Helm upgrade failed for release frontend/podinfo with chart [email protected]+971fef0d04d5: New reconciliation triggered by HelmRelease/frontend/podinfo

    Last Helm logs:

    2025-11-01T12:12:50.46219424Z: Looks like there are no changes for Service "podinfo"
    2025-11-01T12:12:50.477013785Z: Patch Deployment "podinfo" in namespace frontend
    2025-11-01T12:12:50.494020691Z: Looks like there are no changes for Ingress "podinfo"
    2025-11-01T12:12:50.508865623Z: Patch ServiceMonitor "podinfo" in namespace frontend
    2025-11-01T12:12:50.526101352Z: waiting for release podinfo resources (created: 0 updated: 4  deleted: 0)
    2025-11-01T12:12:50.526105791Z: beginning wait for 4 resources with timeout of 3m0s
    2025-11-01T12:12:50.550119838Z: Deployment is not ready: frontend/podinfo. observedGeneration (22) does not match spec generation (23).
    2025-11-01T12:12:52.547186135Z: Deployment is not ready: frontend/podinfo. 0 out of 1 expected pods are ready
    2025-11-01T12:13:42.540396144Z: Deployment is not ready: frontend/podinfo. 0 out of 1 expected pods are ready (24 duplicate lines omitted)
    2025-11-01T12:13:44.349242962Z: wait for resources failed after 54s: context canceled
    2025-11-01T12:13:44.367689523Z: warning: Upgrade "podinfo" failed: context canceled
  metadata:
    annotations:
      helm.toolkit.fluxcd.io/app-version: 6.9.2
      helm.toolkit.fluxcd.io/oci-digest: sha256:971fef0d04d5b3d03d035701dad59411ea0f60e28d16190f02469ddfe5587588
      helm.toolkit.fluxcd.io/revision: 6.9.2+971fef0d04d5
      helm.toolkit.fluxcd.io/token: sha256:1bb60c0d9974502e0cbea5486437617fcec597d9c99f034796135fdabf91b062
    creationTimestamp: "2025-11-01T12:13:44Z"
    name: podinfo.1873e0f58808d7bd
    namespace: frontend
    resourceVersion: "92444900"
    uid: 2a338055-6054-4038-8072-44c389311d4b
  reason: HealthCheckCanceled
  reportingComponent: helm-controller
  reportingInstance: ""
  source:
    component: helm-controller
  type: Warning

And immediately after this event appears, helm-controller reconciles the Flux HelmRelease once again, this time successfully:

NAME                      READY   STATUS    RESTARTS   AGE
podinfo-5564f7698-bmtld   1/1     Running   0          4m4s
podinfo-5564f7698-wgpc8   1/1     Running   0          2d5h

NAME      AGE   READY   STATUS
podinfo   27d   True    Helm upgrade succeeded for release frontend/podinfo.v28 with chart [email protected]+971fef0d04
d5

This shows that Flux is correctly dealing with the release left on the pending state, i.e. the same code path that deals with wait timeouts is kicking in.

@joejulian
Copy link
Contributor

Sure, that's fine with me. I just wanted to make sure we had an expected outcome that can be documented.

gjenkins8
gjenkins8 previously approved these changes Nov 1, 2025
Copy link
Member

@gjenkins8 gjenkins8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here -- introducing a context for the wait operation (that can be canceled by the caller) -- look good to me.

Additional behavior around what to do with the release once the wait has been canceled, remains/is the callers responsibility (and hopefully Helm can improve here over time too)

Authors have tested the changes above (via flux)

@gjenkins8 gjenkins8 added the Has One Approval This PR has one approval. It still needs a second approval to be merged. label Nov 1, 2025
Copy link
Member

@scottrigby scottrigby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@scottrigby scottrigby added feature and removed Has One Approval This PR has one approval. It still needs a second approval to be merged. labels Nov 3, 2025
@scottrigby scottrigby added this to the 4.0.0 milestone Nov 3, 2025
@scottrigby scottrigby merged commit 99cd196 into helm:main Nov 3, 2025
5 checks passed
@scottrigby scottrigby added the docs needed Indicates that a PR includes changes that need to be documented. Should not block a PR being merged label Nov 3, 2025
@matheuscscp matheuscscp deleted the cancel-health-checks branch November 3, 2025 18:58
Copy link
Contributor

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already been approved and merged, but I'll still add my posthumous (not sure if that's a good word for this) approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs needed Indicates that a PR includes changes that need to be documented. Should not block a PR being merged feature size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDK Feature Request: Introduce a context for canceling wait operations

4 participants