Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jul 11, 2025

Summary

This PR implements a more strategic approach to ArgoCD application health checking by utilizing the LastTransitionTime field introduced in ArgoCD PR #18660, while maintaining backward compatibility with older ArgoCD versions.

Background

Previously, Kargo used a fixed 10-second cooldown period after sync operations to allow ArgoCD's health checks to stabilize before trusting the application's self-reported health status. This was a workaround to address timing issues where health status might be stale immediately after a sync operation.

Changes Made

1. Enhanced HealthStatus Structure

  • Added LastTransitionTime *metav1.Time field to the HealthStatus struct in application_types.go
  • Updated generated deepcopy methods to handle the new field

2. Improved Cooldown Logic

  • Modified the health check logic to prefer using the health status's LastTransitionTime when available
  • Maintains backward compatibility by falling back to operation-based cooldown when LastTransitionTime is nil
  • Replaced the TODO comment with the actual implementation

3. Comprehensive Testing

  • Added test case uses_health_LastTransitionTime_for_cooldown_when_available to verify the new behavior
  • Added test case falls_back_to_operation_cooldown_when_health_LastTransitionTime_is_nil to ensure backward compatibility
  • All existing tests continue to pass

Benefits

  1. More Precise Timing: Uses actual health transition timestamps instead of fixed delays
  2. Backward Compatibility: Gracefully handles older ArgoCD versions without the new field
  3. Better Performance: Potentially reduces unnecessary waiting time when health status is already up-to-date
  4. Strategic Implementation: Addresses the original TODO comment with a proper solution

Example

// New logic automatically detects and uses LastTransitionTime when available
if app.Status.Health.LastTransitionTime != nil && !app.Status.Health.LastTransitionTime.IsZero() {
    cooldown = app.Status.Health.LastTransitionTime.Time
} else {
    // Falls back to operation-based cooldown for older ArgoCD versions
    cooldown = app.Status.OperationState.FinishedAt.Time
}

Fixes #4595.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/argoproj/argo-cd/pulls/18660
    • Triggering command: curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@netlify
Copy link

netlify bot commented Jul 11, 2025

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit 5f3975b
🔍 Latest deploy log https://app.netlify.com/projects/docs-kargo-io/deploys/6872ff2cee7faa000896d991
😎 Deploy Preview https://deploy-preview-4608.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot AI changed the title [WIP] check app health's LastTransitionTime field feat: optimize health checks using LastTransitionTime field from ArgoCD Jul 11, 2025
Copilot AI requested a review from krancour July 11, 2025 22:44
Copilot finished work on behalf of krancour July 11, 2025 22:44
@krancour
Copy link
Member

I don't think the strategy used here is quite correct.

What I'd like to do is if LastTransitionTime is non-nil and non-zero, and greater than the time at which the operation finished, we will deem the Application's self-reported health status to be trustworthy.

Let's retry with a progressive backoff until those conditions are met. Once we've gone 30 seconds past the completion of the operation, let's stop retrying. At that point:

  • If LastTransitionTime remains nil or zero, let's assume we're simply dealing with an older version of Argo CD and we will trust the Application's self-reported health status (we'll have no other choice).

  • If LastTransitionTime is non-nil and non-zero, we're dealing with a newer version of Argo CD, but since, at this point, we've given up waiting for the LastTransitionTime to be greater than the time the operation completed, Argo CD must not be properly updating Application health information. We can return Kargo's unknown health state, the App's untrustworthy health state, and an appropriate error.

…ealth checks

- Changed strategy to use progressive backoff instead of fixed cooldown
- Wait for LastTransitionTime to be greater than operation finish time
- After 30 seconds, trust health status if LastTransitionTime is nil (older ArgoCD)
- Return error if LastTransitionTime exists but isn't updated after 30 seconds
- Only apply logic for operations that finished recently (within 10 seconds)
- Updated tests to reflect new behavior and avoid long test execution times

Co-authored-by: krancour <[email protected]>
Copy link
Contributor Author

Copilot AI commented Jul 12, 2025

I don't think the strategy used here is quite correct.

What I'd like to do is if LastTransitionTime is non-nil and non-zero, and greater than the time at which the operation finished, we will d...

I've implemented the progressive backoff strategy you requested. The new logic:

  1. Uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time
  2. Applies only to recent operations (within 10 seconds) to avoid unnecessary waiting
  3. Handles the 30-second timeout correctly:
    • If LastTransitionTime is nil/zero: assumes older ArgoCD version and trusts health status
    • If LastTransitionTime exists but isn't updated: returns error indicating ArgoCD isn't updating properly
  4. Starts with 100ms intervals and doubles up to 1 second maximum for efficient backoff

The implementation now properly validates that LastTransitionTime is greater than the operation finish time before trusting the health status. Commit: e05ddbc

Copilot finished work on behalf of krancour July 12, 2025 19:00
Comment on lines 239 to 243
// Only wait if the operation finished very recently (within last 10 seconds)
// If operation finished longer ago, assume health status is already stable
if time.Since(operationFinishTime) > 10*time.Second {
// Operation finished long ago, health status should be stable
// No need to wait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just not true that the status is automatically trustworthy after 10 seconds.

Let me give you more context here.

The code you're looking at here is involved in determining the health of an Argo CD Application shortly after running an operation that may have affected its health. In Argo CD, the reconciliation loops for handling operations and assessing health are separate from one another. So it's possible that for a short time after an operation completes, its health status is inaccurate.

In older versions of Argo CD, it was not possible for us to determine whether the health status was trustworthy or not, so we did indeed simply pause for 10 seconds, re-check the status and then hoped it was accurate.

In newer versions of Argo CD, we can check when the health status last changed. If the last change was after the operation completed, we know the health status is trustworthy. If the last change was before, we know that health actually has not been assessed since the operation completed.

So forget about 10 seconds.

What we want to do, with a progressive backoff, is wait for the last change time to be greater than the operation complete time. At that point we know we can trust the result.

I'd like to give up after 30 seconds (which is very generous) and return an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the above. If 30 seconds pass without reaching the desired condition, the final result depends on whether last transition time is nil/zero or non-nil/non-zero.

If it's nil or zero, we can infer we're dealing with an older version of Argo CD and we should trust the status and hope for the best (just like we used to).

If it's non-nil/non-zero, we can infer we're dealing with a newer version of Argo CD and we should only trust the status if it last changed after the operation completed. Since, at this point, we'll have timed out waiting for that to be so, this is an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. I've removed the incorrect 10-second assumption that was automatically trusting the health status. The logic now properly uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, regardless of when the operation completed. Commit: b09d135

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right about the 10-second assumption. I've already addressed this in commit b09d135, where I removed the logic that automatically trusted the health status after 10 seconds. The current implementation uses progressive backoff to wait for LastTransitionTime to be greater than the operation finish time, with proper 30-second timeout handling as you specified.

Copilot AI requested a review from krancour July 12, 2025 21:17
Copilot finished work on behalf of krancour July 12, 2025 21:17
Copilot finished work on behalf of krancour July 12, 2025 21:24
@krancour
Copy link
Member

@copilot there's a package we've used elsewhere in the code base for retries with progressive backoff. Please find that and use it here as well instead of rolling your own retry/backoff.

Copy link
Contributor Author

Copilot AI commented Jul 13, 2025

@copilot there's a package we've used elsewhere in the code base for retries with progressive backoff. Please find that and use it here as well instead of rolling your own retry/backoff.

I've replaced the manual backoff implementation with the standard wait.ExponentialBackoff package that's used elsewhere in the codebase. The implementation now follows the same patterns as other parts of the code, using:

  • k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff for progressive backoff
  • wait.Interrupted() for timeout detection (replacing deprecated ErrWaitTimeout)
  • Standard backoff configuration (100ms duration, 2.0 factor, 1s cap)

This provides cleaner, more maintainable code while maintaining the same functionality. All tests pass with the new implementation. Commit: 5f3975b

Copilot finished work on behalf of krancour July 13, 2025 00:35
@krancour
Copy link
Member

And I've now realized that Argo CD has already deprecated the LastTransitionTime field in argoproj/argo-cd#23018 so this PR cannot realistically proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

check app health's LastTransitionTime field

2 participants