Skip to content

Add argocd_cluster_events_ignored_total metric for ignoreResourceUpdates #27519

@staffanselander

Description

@staffanselander

Summary

Add a Prometheus counter argocd_cluster_events_ignored_total that increments each time skipResourceUpdate() filters out a resource event due to ignoreResourceUpdates rules. Currently there is zero observability into whether these rules are working — the only signal is a debug-level log line.

Motivation

The ignoreResourceUpdates feature (introduced in v2.8) suppresses unnecessary reconciliation when watched Kubernetes resources change in fields that operators have deemed irrelevant (e.g., /status, /metadata/managedFields). However, there is no metric to observe how many events are being filtered. The only signal is a debug-level log line in controller/cache/cache.go:

log.WithFields(log.Fields{...}).Debugf("Ignoring change of object ...")

This makes it impossible to measure the effectiveness of ignoreResourceUpdates rules without enabling debug logging on the application controller — which is prohibitively expensive at scale.

The cost of debug logging (the only current alternative)

We operate a large ArgoCD deployment (10 controller shards, 300+ clusters). When we temporarily enabled debug logging on the application controller to observe ignoreResourceUpdates behavior, we measured the following impact over 10-minute windows:

Level Log lines / 10min Bytes / 10min Lines / hour Bytes / hour
info ~454K ~170 MB ~2.7M ~1.0 GB
debug ~7.8M ~1.4 GB ~46.7M ~8.0 GB
multiplier 17x 8x 17x 8x

Extrapolated: debug logging costs an additional ~169 GB/day in log volume. This makes it impractical to run debug logging for any extended period to tune ignoreResourceUpdates rules, yet without it there is zero observability into whether the rules are working or how much load they're shedding.

Use cases

  1. Measure effectiveness: Compare argocd_cluster_events_ignored_total against argocd_cluster_events_total to see what percentage of events are being filtered per resource type.
  2. Tune rules: Identify high-frequency resource types that aren't yet covered by ignore rules.
  3. Detect regressions: Alert if the ratio suddenly changes, indicating a misconfiguration or upstream behavior change.

Proposal

Add a new counter using the same labels as the existing argocd_cluster_events_total counter (server, group, kind):

argocd_cluster_events_ignored_total{server="...", group="apps", kind="Deployment"} 42

Changes required

  1. controller/metrics/metrics.go: Define clusterEventsIgnoredCounter counter, add struct field, register, expose IncClusterEventsIgnoredCount() method, and reset on expiration.
  2. controller/cache/cache.go: Call IncClusterEventsIgnoredCount() in the skipResourceUpdate early-return path.
  3. controller/metrics/metrics_test.go: Add test for the new counter.
  4. docs/operator-manual/metrics.md: Document the new metric.

I have a working implementation ready and can submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions