-
Notifications
You must be signed in to change notification settings - Fork 295
fix(controller): health assessment fixes #4674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for docs-kargo-io ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
ad80250 to
1a9cdde
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4674 +/- ##
==========================================
+ Coverage 53.35% 53.38% +0.03%
==========================================
Files 388 388
Lines 32711 32718 +7
==========================================
+ Hits 17454 17468 +14
+ Misses 14380 14373 -7
Partials 877 877 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1a9cdde to
db85ed8
Compare
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
db85ed8 to
2c96746
Compare
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release-1.6
git worktree add -d .worktree/backport-4674-to-release-1.6 origin/release-1.6
cd .worktree/backport-4674-to-release-1.6
git switch --create backport-4674-to-release-1.6
git cherry-pick -x bc6de1652fda0a30aacc505dbcc8781bb952531e |
Fixes #4673 among other problems.
Per #4673, the post-op "cool down" period before App health status is deemed reliable was not being enforced in the (very common) case where desired revisions are not specified by the health check (because they were not specified by the argocd-update step that created the health check).
Apart from fixing that, this PR also fixes some other problems I've noted while working on this section of code:
We probably shouldn't count App health status as reliable if the current/last operation is in-progress or has any status other than Succeeded.
Pausing for the cool down period to elapse before deeming App health reliable is inadequate if we don't then repeat all checks we'd previously cleared to determine what new health problems may have arisen during the cool down period. New error conditions could have been introduced. An entirely new operation may have started.
The changes in this PR amount to the following:
If the current/last operation is in any state other than Succeeded, App health is immaterial and Stage health is automatically Unknown.
If last operation Succeeded, but fewer than 10 seconds ago, App health is immaterial and Stage health is automatically Unknown.
No other aspect of App health is examined until/unless the last operation Succeeded more than 10 seconds prior.
"Cool down," no longer occurs synchronously within the health check logic.
The (regular) Stage reconciler now treats Unknown Stage heath as an error condition, prompting the Stage to be requeued for reconciliation while observing a progressive backoff. This allows "cool down" to happen naturally and without blocking other pending Stage reconciliations.