High Application Controller CPU/Memory Usage in version 3.0.11

Checklist:

* [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
* [x] I've included steps to reproduce the bug.
* [x] I've pasted the output of `argocd version`.

**Describe the bug**

For context - we are running a pretty large (~30k applications) setup with ArgoCD. We are running on GKE, with application controller shards managing clusters across all three cloud providers.

For sharding, we are using legacy, but with some custom logic. We have a controller that sets `ARGOCD_CONTROLLER_REPLICAS` to be `math.MaxInt32`, and the `ARGOCD_CONTROLLER_SHARD` so that each application controller manages at most 10 clusters.

When upgrading to version 3.0.11, we followed the docs and accepted all the new changes, except for `ignoreResourceStatusField`, which we kept as `crd`. ([this item in the upgrade docs](https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/2.14-3.0/#ignoring-status-field-from-differences-by-default)).

When updating from version 2.14.3 to 3.0.11, we noticed a significant increase in CPU utilisation across all of our application controllers. 

<img width="2221" height="1135" alt="Image" src="https://github.com/user-attachments/assets/756ba4a5-c9ef-49a5-87d2-8907d3ea54d5" />

We profiled CPU time of all application controllers, here is a 200 second CPU profile:

<img width="2471" height="4298" alt="Image" src="https://github.com/user-attachments/assets/4cb6095a-a2d4-4eb5-8a8e-d0075c828016" />

From this, we decided to change the `ARGO_CD_UPDATE_CLUSTER_INFO_TIMEOUT` setting to 59 seconds (max is 1 minute), which is why you see a dip in CPU utilisation in Grafana - but it is nowhere near where it previously was before the upgrade.

As well as this, we also noticed that some shards were OOM'ing when they previously were not. Investigations into these shards revealed that the clusters being managed by these shards have significantly more resources to watch (they were Azure clusters, which have azure-service-operator custom resources) than the others. However, even with reducing the number of resources watched via exclusions, we are still seeing OOMs - albeit less frequently.  

Whilst investigating the issue, we took repeated heap dumps of the application controller that periodically OOM'ed, here is the heap dump 15 seconds before the application controller OOM'ed:

<img width="3477" height="2487" alt="Image" src="https://github.com/user-attachments/assets/e25e5bed-05f7-4719-b346-93f47fc94186" />

When enabling the debug logs, we noticed that OOM's pretty much as soon as it needed to reconcile a resource (or sometimes even earlier , see below for a familiar log pattern):

```
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
```

**To Reproduce**

Difficult to reproduce, hence providing as much context about our setup as possible!

**Expected behavior**



Similar (or better) performance from ArgoCD when upgrading.

**Version**

```shell
argocd: v3.0.11+240a183
  BuildDate: 2025-07-10T14:53:50Z
  GitCommit: 240a1833c0f3ce73078575d692146673e69b6990
  GitTreeState: clean
  GoVersion: go1.24.4
  Compiler: gc
  Platform: linux/arm64
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High Application Controller CPU/Memory Usage in version 3.0.11 #24167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High Application Controller CPU/Memory Usage in version 3.0.11 #24167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions