Skip to content

Conversation

@krancour
Copy link
Member

@krancour krancour commented Jul 23, 2025

Fixes #4673 among other problems.

Per #4673, the post-op "cool down" period before App health status is deemed reliable was not being enforced in the (very common) case where desired revisions are not specified by the health check (because they were not specified by the argocd-update step that created the health check).

Apart from fixing that, this PR also fixes some other problems I've noted while working on this section of code:

  1. We probably shouldn't count App health status as reliable if the current/last operation is in-progress or has any status other than Succeeded.

  2. Pausing for the cool down period to elapse before deeming App health reliable is inadequate if we don't then repeat all checks we'd previously cleared to determine what new health problems may have arisen during the cool down period. New error conditions could have been introduced. An entirely new operation may have started.

The changes in this PR amount to the following:

  1. If the current/last operation is in any state other than Succeeded, App health is immaterial and Stage health is automatically Unknown.

  2. If last operation Succeeded, but fewer than 10 seconds ago, App health is immaterial and Stage health is automatically Unknown.

  3. No other aspect of App health is examined until/unless the last operation Succeeded more than 10 seconds prior.

  4. "Cool down," no longer occurs synchronously within the health check logic.

  5. The (regular) Stage reconciler now treats Unknown Stage heath as an error condition, prompting the Stage to be requeued for reconciliation while observing a progressive backoff. This allows "cool down" to happen naturally and without blocking other pending Stage reconciliations.

@krancour krancour added this to the v1.6.2 milestone Jul 23, 2025
@krancour krancour self-assigned this Jul 23, 2025
@krancour krancour added the kind/bug Something isn't working as intended; If unsure that something IS a bug, start a discussion instead label Jul 23, 2025
@krancour krancour requested a review from a team as a code owner July 23, 2025 01:41
@krancour krancour added priority/normal This is the priority for most work area/controller Affects the (main) controller backport/release-1.6 PRs with this label will automatically be back-ported to the release-1.6 branch labels Jul 23, 2025
@netlify
Copy link

netlify bot commented Jul 23, 2025

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit 2c96746
🔍 Latest deploy log https://app.netlify.com/projects/docs-kargo-io/deploys/688131f715e2c100087c1c3a
😎 Deploy Preview https://deploy-preview-4674.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@krancour krancour force-pushed the krancour/app-health branch from ad80250 to 1a9cdde Compare July 23, 2025 01:43
@krancour krancour requested a review from hiddeco July 23, 2025 01:45
@codecov
Copy link

codecov bot commented Jul 23, 2025

Codecov Report

❌ Patch coverage is 84.21053% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.38%. Comparing base (045305d) to head (2c96746).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/stages/regular_stages.go 25.00% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4674      +/-   ##
==========================================
+ Coverage   53.35%   53.38%   +0.03%     
==========================================
  Files         388      388              
  Lines       32711    32718       +7     
==========================================
+ Hits        17454    17468      +14     
+ Misses      14380    14373       -7     
  Partials      877      877              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@krancour krancour force-pushed the krancour/app-health branch from 1a9cdde to db85ed8 Compare July 23, 2025 03:37
@krancour krancour force-pushed the krancour/app-health branch from db85ed8 to 2c96746 Compare July 23, 2025 19:03
@krancour krancour changed the title fix(controller): apply app health cool down when desired revisions are unknown fix(controller): health assessment fixes Jul 23, 2025
@krancour krancour added this pull request to the merge queue Jul 25, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 25, 2025
@krancour krancour added this pull request to the merge queue Jul 25, 2025
Merged via the queue into akuity:main with commit bc6de16 Jul 25, 2025
20 checks passed
@krancour krancour deleted the krancour/app-health branch July 25, 2025 21:30
@akuitybot
Copy link

Backport failed for release-1.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.6
git worktree add -d .worktree/backport-4674-to-release-1.6 origin/release-1.6
cd .worktree/backport-4674-to-release-1.6
git switch --create backport-4674-to-release-1.6
git cherry-pick -x bc6de1652fda0a30aacc505dbcc8781bb952531e

@krancour krancour removed the backport/release-1.6 PRs with this label will automatically be back-ported to the release-1.6 branch label Jul 25, 2025
@krancour krancour modified the milestones: v1.6.2, v1.7.0 Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/controller Affects the (main) controller kind/bug Something isn't working as intended; If unsure that something IS a bug, start a discussion instead priority/normal This is the priority for most work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: app health "cool down" period not enforced when desire revision(s) are not specified

3 participants