-
Notifications
You must be signed in to change notification settings - Fork 291
feat: optimize health checks using LastTransitionTime field from ArgoCD #4608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for docs-kargo-io ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Co-authored-by: krancour <[email protected]>
|
I don't think the strategy used here is quite correct. What I'd like to do is if Let's retry with a progressive backoff until those conditions are met. Once we've gone 30 seconds past the completion of the operation, let's stop retrying. At that point:
|
…ealth checks - Changed strategy to use progressive backoff instead of fixed cooldown - Wait for LastTransitionTime to be greater than operation finish time - After 30 seconds, trust health status if LastTransitionTime is nil (older ArgoCD) - Return error if LastTransitionTime exists but isn't updated after 30 seconds - Only apply logic for operations that finished recently (within 10 seconds) - Updated tests to reflect new behavior and avoid long test execution times Co-authored-by: krancour <[email protected]>
I've implemented the progressive backoff strategy you requested. The new logic:
The implementation now properly validates that |
| // Only wait if the operation finished very recently (within last 10 seconds) | ||
| // If operation finished longer ago, assume health status is already stable | ||
| if time.Since(operationFinishTime) > 10*time.Second { | ||
| // Operation finished long ago, health status should be stable | ||
| // No need to wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just not true that the status is automatically trustworthy after 10 seconds.
Let me give you more context here.
The code you're looking at here is involved in determining the health of an Argo CD Application shortly after running an operation that may have affected its health. In Argo CD, the reconciliation loops for handling operations and assessing health are separate from one another. So it's possible that for a short time after an operation completes, its health status is inaccurate.
In older versions of Argo CD, it was not possible for us to determine whether the health status was trustworthy or not, so we did indeed simply pause for 10 seconds, re-check the status and then hoped it was accurate.
In newer versions of Argo CD, we can check when the health status last changed. If the last change was after the operation completed, we know the health status is trustworthy. If the last change was before, we know that health actually has not been assessed since the operation completed.
So forget about 10 seconds.
What we want to do, with a progressive backoff, is wait for the last change time to be greater than the operation complete time. At that point we know we can trust the result.
I'd like to give up after 30 seconds (which is very generous) and return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to the above. If 30 seconds pass without reaching the desired condition, the final result depends on whether last transition time is nil/zero or non-nil/non-zero.
If it's nil or zero, we can infer we're dealing with an older version of Argo CD and we should trust the status and hope for the best (just like we used to).
If it's non-nil/non-zero, we can infer we're dealing with a newer version of Argo CD and we should only trust the status if it last changed after the operation completed. Since, at this point, we'll have timed out waiting for that to be so, this is an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. I've removed the incorrect 10-second assumption that was automatically trusting the health status. The logic now properly uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, regardless of when the operation completed. Commit: b09d135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right about the 10-second assumption. I've already addressed this in commit b09d135, where I removed the logic that automatically trusted the health status after 10 seconds. The current implementation uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, with proper 30-second timeout handling as you specified.
Co-authored-by: krancour <[email protected]>
|
@copilot there's a package we've used elsewhere in the code base for retries with progressive backoff. Please find that and use it here as well instead of rolling your own retry/backoff. |
Co-authored-by: krancour <[email protected]>
I've replaced the manual backoff implementation with the standard
This provides cleaner, more maintainable code while maintaining the same functionality. All tests pass with the new implementation. Commit: 5f3975b |
|
And I've now realized that Argo CD has already deprecated the |
Summary
This PR implements a more strategic approach to ArgoCD application health checking by utilizing the
LastTransitionTimefield introduced in ArgoCD PR #18660, while maintaining backward compatibility with older ArgoCD versions.Background
Previously, Kargo used a fixed 10-second cooldown period after sync operations to allow ArgoCD's health checks to stabilize before trusting the application's self-reported health status. This was a workaround to address timing issues where health status might be stale immediately after a sync operation.
Changes Made
1. Enhanced HealthStatus Structure
LastTransitionTime *metav1.Timefield to theHealthStatusstruct inapplication_types.go2. Improved Cooldown Logic
LastTransitionTimewhen availableLastTransitionTimeis nil3. Comprehensive Testing
uses_health_LastTransitionTime_for_cooldown_when_availableto verify the new behaviorfalls_back_to_operation_cooldown_when_health_LastTransitionTime_is_nilto ensure backward compatibilityBenefits
Example
Fixes #4595.
Warning
Firewall rules blocked me from connecting to one or more addresses
I tried to connect to the following addresses, but was blocked by firewall rules:
https://api.github.com/repos/argoproj/argo-cd/pulls/18660curl -s REDACTED(http block)If you need me to access, download, or install something from one of these locations, you can either:
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.