-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Description
Checklist:
- I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- I've included steps to reproduce the bug.
- I've pasted the output of
argocd version.
Describe the bug
For context - we are running a pretty large (~30k applications) setup with ArgoCD. We are running on GKE, with application controller shards managing clusters across all three cloud providers.
For sharding, we are using legacy, but with some custom logic. We have a controller that sets ARGOCD_CONTROLLER_REPLICAS to be math.MaxInt32, and the ARGOCD_CONTROLLER_SHARD so that each application controller manages at most 10 clusters.
When upgrading to version 3.0.11, we followed the docs and accepted all the new changes, except for ignoreResourceStatusField, which we kept as crd. (this item in the upgrade docs).
When updating from version 2.14.3 to 3.0.11, we noticed a significant increase in CPU utilisation across all of our application controllers.
We profiled CPU time of all application controllers, here is a 200 second CPU profile:
From this, we decided to change the ARGO_CD_UPDATE_CLUSTER_INFO_TIMEOUT setting to 59 seconds (max is 1 minute), which is why you see a dip in CPU utilisation in Grafana - but it is nowhere near where it previously was before the upgrade.
As well as this, we also noticed that some shards were OOM'ing when they previously were not. Investigations into these shards revealed that the clusters being managed by these shards have significantly more resources to watch (they were Azure clusters, which have azure-service-operator custom resources) than the others. However, even with reducing the number of resources watched via exclusions, we are still seeing OOMs - albeit less frequently.
Whilst investigating the issue, we took repeated heap dumps of the application controller that periodically OOM'ed, here is the heap dump 15 seconds before the application controller OOM'ed:
When enabling the debug logs, we noticed that OOM's pretty much as soon as it needed to reconcile a resource (or sometimes even earlier , see below for a familiar log pattern):
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
To Reproduce
Difficult to reproduce, hence providing as much context about our setup as possible!
Expected behavior
Similar (or better) performance from ArgoCD when upgrading.
Version
argocd: v3.0.11+240a183
BuildDate: 2025-07-10T14:53:50Z
GitCommit: 240a1833c0f3ce73078575d692146673e69b6990
GitTreeState: clean
GoVersion: go1.24.4
Compiler: gc
Platform: linux/arm64