Skip to content

Conversation

@lplazas
Copy link
Contributor

@lplazas lplazas commented Oct 31, 2025

Fixes: http://github.com/argoproj/argo-cd/issues/19854

When using any sharding algorithm that moves shards with some frequency this issue becomes more apparent.

Simple steps to reproduce:

1. When a cluster is moved for any reason (new cluster added, new app added, etc) you will see in the logs something like Cluster https://XYZ.gr7.us-east-1.eks.amazonaws.com has changed shard from 4 to 6
2.1 handleModEvent() will be triggered on shard 4:
It will go into if !c.canHandleCluster(newCluster) { and delete the cluster from the local cache: https://github.com/argoproj/argo-cd/blob/master/controller/cache/cache.go#L833
2.2 handleModEvent() will be triggered on shard 6:
As shard 6 hasn't had newCluster.Server in its cache before, ok is set to false and the entire logic of handleModEvent is skipped and the cluster is never added to this shard's cache.

	c.clusterSharding.Update(oldCluster, newCluster)
	c.lock.Lock()
	cluster, ok := c.clusters[newCluster.Server]
	c.lock.Unlock()
	if ok { // THIS IS COMPLETELY SKIPPED, THE FUNCTION DOES NOTHING 

This fix checks if the cluster is not in local cache, wether it should be added. The call to getSyncedCluster populates the local cache as it calls getCluster that sets the entry in the cache: https://github.com/argoproj/argo-cd/blob/master/controller/cache/cache.go#L479

I don't have the time to look into adding tests for a few days so if someone can guide me to the right place to add them it will save me a lot of time.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Title of the PR
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@lplazas lplazas requested a review from a team as a code owner October 31, 2025 02:00
@bunnyshell
Copy link

bunnyshell bot commented Oct 31, 2025

❗ Preview Environment deployment failed on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant