feat: optimize health checks using LastTransitionTime field from ArgoCD #4608

Copilot · 2025-07-11T22:21:03Z

Summary

This PR implements a more strategic approach to ArgoCD application health checking by utilizing the LastTransitionTime field introduced in ArgoCD PR #18660, while maintaining backward compatibility with older ArgoCD versions.

Background

Previously, Kargo used a fixed 10-second cooldown period after sync operations to allow ArgoCD's health checks to stabilize before trusting the application's self-reported health status. This was a workaround to address timing issues where health status might be stale immediately after a sync operation.

Changes Made

1. Enhanced HealthStatus Structure

Added LastTransitionTime *metav1.Time field to the HealthStatus struct in application_types.go
Updated generated deepcopy methods to handle the new field

2. Improved Cooldown Logic

Modified the health check logic to prefer using the health status's LastTransitionTime when available
Maintains backward compatibility by falling back to operation-based cooldown when LastTransitionTime is nil
Replaced the TODO comment with the actual implementation

3. Comprehensive Testing

Added test case uses_health_LastTransitionTime_for_cooldown_when_available to verify the new behavior
Added test case falls_back_to_operation_cooldown_when_health_LastTransitionTime_is_nil to ensure backward compatibility
All existing tests continue to pass

Benefits

More Precise Timing: Uses actual health transition timestamps instead of fixed delays
Backward Compatibility: Gracefully handles older ArgoCD versions without the new field
Better Performance: Potentially reduces unnecessary waiting time when health status is already up-to-date
Strategic Implementation: Addresses the original TODO comment with a proper solution

Example

// New logic automatically detects and uses LastTransitionTime when available
if app.Status.Health.LastTransitionTime != nil && !app.Status.Health.LastTransitionTime.IsZero() {
    cooldown = app.Status.Health.LastTransitionTime.Time
} else {
    // Falls back to operation-based cooldown for older ArgoCD versions
    cooldown = app.Status.OperationState.FinishedAt.Time
}

Fixes #4595.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/repos/argoproj/argo-cd/pulls/18660
- Triggering command: curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to my firewall allow list

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

netlify · 2025-07-11T22:21:08Z

✅ Deploy Preview for docs-kargo-io ready!

Name	Link
🔨 Latest commit	`5f3975b`
🔍 Latest deploy log	https://app.netlify.com/projects/docs-kargo-io/deploys/6872ff2cee7faa000896d991
😎 Deploy Preview	https://deploy-preview-4608.docs.kargo.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Co-authored-by: krancour <[email protected]>

krancour · 2025-07-12T18:38:16Z

I don't think the strategy used here is quite correct.

What I'd like to do is if LastTransitionTime is non-nil and non-zero, and greater than the time at which the operation finished, we will deem the Application's self-reported health status to be trustworthy.

Let's retry with a progressive backoff until those conditions are met. Once we've gone 30 seconds past the completion of the operation, let's stop retrying. At that point:

If LastTransitionTime remains nil or zero, let's assume we're simply dealing with an older version of Argo CD and we will trust the Application's self-reported health status (we'll have no other choice).
If LastTransitionTime is non-nil and non-zero, we're dealing with a newer version of Argo CD, but since, at this point, we've given up waiting for the LastTransitionTime to be greater than the time the operation completed, Argo CD must not be properly updating Application health information. We can return Kargo's unknown health state, the App's untrustworthy health state, and an appropriate error.

…ealth checks - Changed strategy to use progressive backoff instead of fixed cooldown - Wait for LastTransitionTime to be greater than operation finish time - After 30 seconds, trust health status if LastTransitionTime is nil (older ArgoCD) - Return error if LastTransitionTime exists but isn't updated after 30 seconds - Only apply logic for operations that finished recently (within 10 seconds) - Updated tests to reflect new behavior and avoid long test execution times Co-authored-by: krancour <[email protected]>

Copilot · 2025-07-12T19:00:03Z

I don't think the strategy used here is quite correct.

What I'd like to do is if LastTransitionTime is non-nil and non-zero, and greater than the time at which the operation finished, we will d...

I've implemented the progressive backoff strategy you requested. The new logic:

Uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time
Applies only to recent operations (within 10 seconds) to avoid unnecessary waiting
Handles the 30-second timeout correctly:
- If LastTransitionTime is nil/zero: assumes older ArgoCD version and trusts health status
- If LastTransitionTime exists but isn't updated: returns error indicating ArgoCD isn't updating properly
Starts with 100ms intervals and doubles up to 1 second maximum for efficient backoff

The implementation now properly validates that LastTransitionTime is greater than the operation finish time before trusting the health status. Commit: e05ddbc

krancour · 2025-07-12T21:10:08Z

internal/health/checker/builtin/argocd.go

+			// Only wait if the operation finished very recently (within last 10 seconds)
+			// If operation finished longer ago, assume health status is already stable
+			if time.Since(operationFinishTime) > 10*time.Second {
+				// Operation finished long ago, health status should be stable
+				// No need to wait


This is just not true that the status is automatically trustworthy after 10 seconds.

Let me give you more context here.

The code you're looking at here is involved in determining the health of an Argo CD Application shortly after running an operation that may have affected its health. In Argo CD, the reconciliation loops for handling operations and assessing health are separate from one another. So it's possible that for a short time after an operation completes, its health status is inaccurate.

In older versions of Argo CD, it was not possible for us to determine whether the health status was trustworthy or not, so we did indeed simply pause for 10 seconds, re-check the status and then hoped it was accurate.

In newer versions of Argo CD, we can check when the health status last changed. If the last change was after the operation completed, we know the health status is trustworthy. If the last change was before, we know that health actually has not been assessed since the operation completed.

So forget about 10 seconds.

What we want to do, with a progressive backoff, is wait for the last change time to be greater than the operation complete time. At that point we know we can trust the result.

I'd like to give up after 30 seconds (which is very generous) and return an error.

Adding to the above. If 30 seconds pass without reaching the desired condition, the final result depends on whether last transition time is nil/zero or non-nil/non-zero.

If it's nil or zero, we can infer we're dealing with an older version of Argo CD and we should trust the status and hope for the best (just like we used to).

If it's non-nil/non-zero, we can infer we're dealing with a newer version of Argo CD and we should only trust the status if it last changed after the operation completed. Since, at this point, we'll have timed out waiting for that to be so, this is an error.

You're absolutely right. I've removed the incorrect 10-second assumption that was automatically trusting the health status. The logic now properly uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, regardless of when the operation completed. Commit: b09d135

You're absolutely right about the 10-second assumption. I've already addressed this in commit b09d135, where I removed the logic that automatically trusted the health status after 10 seconds. The current implementation uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, with proper 30-second timeout handling as you specified.

Co-authored-by: krancour <[email protected]>

krancour · 2025-07-13T00:13:49Z

@copilot there's a package we've used elsewhere in the code base for retries with progressive backoff. Please find that and use it here as well instead of rolling your own retry/backoff.

Co-authored-by: krancour <[email protected]>

Copilot · 2025-07-13T00:35:02Z

@copilot there's a package we've used elsewhere in the code base for retries with progressive backoff. Please find that and use it here as well instead of rolling your own retry/backoff.

I've replaced the manual backoff implementation with the standard wait.ExponentialBackoff package that's used elsewhere in the codebase. The implementation now follows the same patterns as other parts of the code, using:

k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff for progressive backoff
wait.Interrupted() for timeout detection (replacing deprecated ErrWaitTimeout)
Standard backoff configuration (100ms duration, 2.0 factor, 1s cap)

This provides cleaner, more maintainable code while maintaining the same functionality. All tests pass with the new implementation. Commit: 5f3975b

krancour · 2025-07-22T23:54:29Z

And I've now realized that Argo CD has already deprecated the LastTransitionTime field in argoproj/argo-cd#23018 so this PR cannot realistically proceed.

Initial plan

6d1edce

Copilot AI assigned Copilot and krancour Jul 11, 2025

Copilot started work on behalf of krancour July 11, 2025 22:21 View session

Implement LastTransitionTime-based health check optimization

18dae1a

Co-authored-by: krancour <[email protected]>

Copilot AI changed the title ~~[WIP] check app health's LastTransitionTime field~~ feat: optimize health checks using LastTransitionTime field from ArgoCD Jul 11, 2025

Copilot AI requested a review from krancour July 11, 2025 22:44

Copilot finished work on behalf of krancour July 11, 2025 22:44

Copilot started work on behalf of krancour July 12, 2025 18:38 View session

Copilot finished work on behalf of krancour July 12, 2025 19:00

krancour reviewed Jul 12, 2025

View reviewed changes

Copilot started work on behalf of krancour July 12, 2025 21:10 View session

Remove incorrect 10-second assumption from health check logic

b09d135

Co-authored-by: krancour <[email protected]>

Copilot AI requested a review from krancour July 12, 2025 21:17

Copilot finished work on behalf of krancour July 12, 2025 21:17

Copilot started work on behalf of krancour July 12, 2025 21:17 View session

Copilot finished work on behalf of krancour July 12, 2025 21:24

Copilot started work on behalf of krancour July 13, 2025 00:13 View session

Replace manual backoff with standard wait.ExponentialBackoff package

5f3975b

Co-authored-by: krancour <[email protected]>

Copilot finished work on behalf of krancour July 13, 2025 00:35

krancour closed this Jul 22, 2025

krancour mentioned this pull request Jul 23, 2025

bug: app health "cool down" period not enforced when desire revision(s) are not specified #4673

Closed

krancour deleted the copilot/fix-4595 branch October 3, 2025 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize health checks using LastTransitionTime field from ArgoCD #4608

feat: optimize health checks using LastTransitionTime field from ArgoCD #4608

Uh oh!

Copilot AI commented Jul 11, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jul 11, 2025 •

edited

Loading

Uh oh!

krancour commented Jul 12, 2025

Uh oh!

Copilot AI commented Jul 12, 2025

Uh oh!

krancour Jul 12, 2025

Uh oh!

krancour Jul 12, 2025

Uh oh!

Copilot AI Jul 12, 2025

Uh oh!

Copilot AI Jul 12, 2025

Uh oh!

krancour commented Jul 13, 2025

Uh oh!

Copilot AI commented Jul 13, 2025

Uh oh!

krancour commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: optimize health checks using LastTransitionTime field from ArgoCD #4608

feat: optimize health checks using LastTransitionTime field from ArgoCD #4608

Uh oh!

Conversation

Copilot AI commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes Made

1. Enhanced HealthStatus Structure

2. Improved Cooldown Logic

3. Comprehensive Testing

Benefits

Example

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

netlify bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for docs-kargo-io ready!

Uh oh!

krancour commented Jul 12, 2025

Uh oh!

Copilot AI commented Jul 12, 2025

Uh oh!

krancour Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

krancour Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

krancour commented Jul 13, 2025

Uh oh!

Copilot AI commented Jul 13, 2025

Uh oh!

krancour commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jul 11, 2025 •

edited

Loading

netlify bot commented Jul 11, 2025 •

edited

Loading