Skip to content

High Application Controller CPU/Memory Usage in version 3.0.11 #24167

@HarleyB123

Description

@HarleyB123

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

For context - we are running a pretty large (~30k applications) setup with ArgoCD. We are running on GKE, with application controller shards managing clusters across all three cloud providers.

For sharding, we are using legacy, but with some custom logic. We have a controller that sets ARGOCD_CONTROLLER_REPLICAS to be math.MaxInt32, and the ARGOCD_CONTROLLER_SHARD so that each application controller manages at most 10 clusters.

When upgrading to version 3.0.11, we followed the docs and accepted all the new changes, except for ignoreResourceStatusField, which we kept as crd. (this item in the upgrade docs).

When updating from version 2.14.3 to 3.0.11, we noticed a significant increase in CPU utilisation across all of our application controllers.

Image

We profiled CPU time of all application controllers, here is a 200 second CPU profile:

Image

From this, we decided to change the ARGO_CD_UPDATE_CLUSTER_INFO_TIMEOUT setting to 59 seconds (max is 1 minute), which is why you see a dip in CPU utilisation in Grafana - but it is nowhere near where it previously was before the upgrade.

As well as this, we also noticed that some shards were OOM'ing when they previously were not. Investigations into these shards revealed that the clusters being managed by these shards have significantly more resources to watch (they were Azure clusters, which have azure-service-operator custom resources) than the others. However, even with reducing the number of resources watched via exclusions, we are still seeing OOMs - albeit less frequently.

Whilst investigating the issue, we took repeated heap dumps of the application controller that periodically OOM'ed, here is the heap dump 15 seconds before the application controller OOM'ed:

Image

When enabling the debug logs, we noticed that OOM's pretty much as soon as it needed to reconcile a resource (or sometimes even earlier , see below for a familiar log pattern):

Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z
Checking if cluster x with clusterShard y should be processed by shard z

To Reproduce

Difficult to reproduce, hence providing as much context about our setup as possible!

Expected behavior

Similar (or better) performance from ArgoCD when upgrading.

Version

argocd: v3.0.11+240a183
  BuildDate: 2025-07-10T14:53:50Z
  GitCommit: 240a1833c0f3ce73078575d692146673e69b6990
  GitTreeState: clean
  GoVersion: go1.24.4
  Compiler: gc
  Platform: linux/arm64

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions