fix(controller): health assessment fixes #4674

krancour · 2025-07-23T01:41:04Z

Fixes #4673 among other problems.

Per #4673, the post-op "cool down" period before App health status is deemed reliable was not being enforced in the (very common) case where desired revisions are not specified by the health check (because they were not specified by the argocd-update step that created the health check).

Apart from fixing that, this PR also fixes some other problems I've noted while working on this section of code:

We probably shouldn't count App health status as reliable if the current/last operation is in-progress or has any status other than Succeeded.
Pausing for the cool down period to elapse before deeming App health reliable is inadequate if we don't then repeat all checks we'd previously cleared to determine what new health problems may have arisen during the cool down period. New error conditions could have been introduced. An entirely new operation may have started.

The changes in this PR amount to the following:

If the current/last operation is in any state other than Succeeded, App health is immaterial and Stage health is automatically Unknown.
If last operation Succeeded, but fewer than 10 seconds ago, App health is immaterial and Stage health is automatically Unknown.
No other aspect of App health is examined until/unless the last operation Succeeded more than 10 seconds prior.
"Cool down," no longer occurs synchronously within the health check logic.
The (regular) Stage reconciler now treats Unknown Stage heath as an error condition, prompting the Stage to be requeued for reconciliation while observing a progressive backoff. This allows "cool down" to happen naturally and without blocking other pending Stage reconciliations.

netlify · 2025-07-23T01:41:13Z

✅ Deploy Preview for docs-kargo-io ready!

Name	Link
🔨 Latest commit	`2c96746`
🔍 Latest deploy log	https://app.netlify.com/projects/docs-kargo-io/deploys/688131f715e2c100087c1c3a
😎 Deploy Preview	https://deploy-preview-4674.docs.kargo.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

internal/health/checker/builtin/argocd_test.go

codecov · 2025-07-23T01:48:33Z

Codecov Report

❌ Patch coverage is 84.21053% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.38%. Comparing base (045305d) to head (2c96746).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/controller/stages/regular_stages.go	25.00%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4674      +/-   ##
==========================================
+ Coverage   53.35%   53.38%   +0.03%     
==========================================
  Files         388      388              
  Lines       32711    32718       +7     
==========================================
+ Hits        17454    17468      +14     
+ Misses      14380    14373       -7     
  Partials      877      877

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

internal/health/checker/builtin/argocd.go

Signed-off-by: Kent Rancourt <[email protected]>

akuitybot · 2025-07-25T21:30:28Z

Backport failed for release-1.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.6
git worktree add -d .worktree/backport-4674-to-release-1.6 origin/release-1.6
cd .worktree/backport-4674-to-release-1.6
git switch --create backport-4674-to-release-1.6
git cherry-pick -x bc6de1652fda0a30aacc505dbcc8781bb952531e

krancour added this to the v1.6.2 milestone Jul 23, 2025

krancour self-assigned this Jul 23, 2025

krancour added the kind/bug Something isn't working as intended; If unsure that something IS a bug, start a discussion instead label Jul 23, 2025

krancour requested a review from a team as a code owner July 23, 2025 01:41

krancour added priority/normal This is the priority for most work area/controller Affects the (main) controller backport/release-1.6 PRs with this label will automatically be back-ported to the release-1.6 branch labels Jul 23, 2025

krancour force-pushed the krancour/app-health branch from ad80250 to 1a9cdde Compare July 23, 2025 01:43

krancour commented Jul 23, 2025

View reviewed changes

internal/health/checker/builtin/argocd_test.go Outdated Show resolved Hide resolved

krancour requested a review from hiddeco July 23, 2025 01:45

krancour force-pushed the krancour/app-health branch from 1a9cdde to db85ed8 Compare July 23, 2025 03:37

hiddeco reviewed Jul 23, 2025

View reviewed changes

internal/health/checker/builtin/argocd.go Outdated Show resolved Hide resolved

krancour added 2 commits July 23, 2025 14:05

apply app health cool down when desired revisions are unknown

8dfb772

Signed-off-by: Kent Rancourt <[email protected]>

revise approach to trusting/not trusting app health

2c96746

Signed-off-by: Kent Rancourt <[email protected]>

krancour force-pushed the krancour/app-health branch from db85ed8 to 2c96746 Compare July 23, 2025 19:03

krancour changed the title ~~fix(controller): apply app health cool down when desired revisions are unknown~~ fix(controller): health assessment fixes Jul 23, 2025

hiddeco approved these changes Jul 24, 2025

View reviewed changes

krancour added this pull request to the merge queue Jul 25, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 25, 2025

krancour added this pull request to the merge queue Jul 25, 2025

Merged via the queue into akuity:main with commit bc6de16 Jul 25, 2025
20 checks passed

krancour deleted the krancour/app-health branch July 25, 2025 21:30

krancour removed the backport/release-1.6 PRs with this label will automatically be back-ported to the release-1.6 branch label Jul 25, 2025

krancour modified the milestones: v1.6.2, v1.7.0 Jul 25, 2025

krancour mentioned this pull request Aug 16, 2025

fix(controller): be more intentional about treating unknown health status like an error #4871

Merged

krancour mentioned this pull request Sep 12, 2025

Should Stage health evaluated be a ERROR in the kargo-controller? #5045

Closed

krancour mentioned this pull request Sep 24, 2025

fix: return nil err when there are no healthchecks #5093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(controller): health assessment fixes #4674

fix(controller): health assessment fixes #4674

Uh oh!

krancour commented Jul 23, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akuitybot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(controller): health assessment fixes #4674

fix(controller): health assessment fixes #4674

Uh oh!

Conversation

krancour commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for docs-kargo-io ready!

Uh oh!

Uh oh!

codecov bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akuitybot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krancour commented Jul 23, 2025 •

edited

Loading

netlify bot commented Jul 23, 2025 •

edited

Loading

codecov bot commented Jul 23, 2025 •

edited

Loading