Skip to content

feat: add argocd_cluster_events_ignored_total metric#27520

Open
staffanselander wants to merge 2 commits into
argoproj:masterfrom
staffanselander:feat/add-ignored-events-metric
Open

feat: add argocd_cluster_events_ignored_total metric#27520
staffanselander wants to merge 2 commits into
argoproj:masterfrom
staffanselander:feat/add-ignored-events-metric

Conversation

@staffanselander
Copy link
Copy Markdown

@staffanselander staffanselander commented Apr 23, 2026

Closes #27519

Description

Adds a Prometheus counter argocd_cluster_events_ignored_total that tracks k8s resource events
filtered by ignoreResourceUpdates rules. This enables operators to measure rule effectiveness
without debug logging (which increases log volume 8-17x at scale).

Uses the same labels (server, group, kind) as the existing argocd_cluster_events_total.

Checklist

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Title of the PR
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.

@bunnyshell
Copy link
Copy Markdown

bunnyshell Bot commented Apr 23, 2026

❌ Preview Environment undeployed from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@staffanselander staffanselander force-pushed the feat/add-ignored-events-metric branch from 3f4b9b1 to 7080d4d Compare April 23, 2026 13:38
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.24%. Comparing base (5a20f9b) to head (2776a73).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #27520   +/-   ##
=======================================
  Coverage   64.23%   64.24%           
=======================================
  Files         422      422           
  Lines       57853    57860    +7     
=======================================
+ Hits        37163    37173   +10     
+ Misses      17183    17179    -4     
- Partials     3507     3508    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@staffanselander staffanselander force-pushed the feat/add-ignored-events-metric branch from 7080d4d to fd8be48 Compare April 27, 2026 11:24
@staffanselander staffanselander marked this pull request as ready for review April 27, 2026 11:27
@staffanselander staffanselander requested review from a team as code owners April 27, 2026 11:27
Copy link
Copy Markdown
Contributor

@ppapapetrou76 ppapapetrou76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I have a question

argocd_cluster_events_total uses OnEvent; this new counter uses OnResourceUpdated.

Is there any reason for this?

Comment thread controller/metrics/metrics_test.go Outdated
@rumstead rumstead requested a review from Copilot April 27, 2026 17:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new application-controller Prometheus counter (argocd_cluster_events_ignored_total) to provide observability into how often Kubernetes watch events are filtered out by ignoreResourceUpdates, enabling operators to evaluate rule effectiveness without enabling high-volume debug logs.

Changes:

  • Add and register argocd_cluster_events_ignored_total{server,group,kind} in the controller metrics server, including cache-expiration reset support.
  • Increment the new counter when skipResourceUpdate() causes an update event to be ignored.
  • Add a unit test and document the new metric in the operator metrics reference.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
docs/operator-manual/metrics.md Documents the new argocd_cluster_events_ignored_total counter.
controller/metrics/metrics.go Defines/registers the new counter and exposes IncClusterEventsIgnoredCount(), including reset on expiration.
controller/cache/cache.go Increments the ignored-events counter on the ignoreResourceUpdates early-return path.
controller/metrics/metrics_test.go Verifies the new counter is emitted with expected labels/values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rumstead
Copy link
Copy Markdown
Member

LGTM but I have a question

argocd_cluster_events_total uses OnEvent; this new counter uses OnResourceUpdated.

Is there any reason for this?

This is a good call out because it will make something like argocd_cluster_events_ignored_total / argocd_cluster_events_total result in misleading results. We should document the difference between the two.

@ppapapetrou76
Copy link
Copy Markdown
Contributor

LGTM but I have a question
argocd_cluster_events_total uses OnEvent; this new counter uses OnResourceUpdated.
Is there any reason for this?

This is a good call out because it will make something like argocd_cluster_events_ignored_total / argocd_cluster_events_total result in misleading results. We should document the difference between the two.

Yup but I would just go with OnEvent here to make the code aligned / consistent

@crenshaw-dev
Copy link
Copy Markdown
Member

@staffanselander please fix the DCO check

@staffanselander
Copy link
Copy Markdown
Author

Thanks for the careful review — really useful feedback.

The more I dig in, the more I think the right answer here isn't a one-line move. Looking at the cache code, OnEvent doesn't have the (oldRes, newRes) pair that skipResourceUpdate needs to make the decision, and OnResourceUpdated fires for several distinct sources (initial list, relists, Add/Modify/Delete) where only Modifies are ever ignorable — so even a paired counter would still produce a biased ratio.

I'd like to take a bit more time to evaluate options that fit Argo CD's existing patterns better and avoid the misleading-ratio confusion you both flagged. Will report back with a concrete proposal.

@staffanselander staffanselander force-pushed the feat/add-ignored-events-metric branch from fee5e0d to 045bdd9 Compare April 30, 2026 09:54
@agaudreault
Copy link
Copy Markdown
Member

The OnEvent metric tells you which object are monitored (events)
The OnResourceUpdated will correctly tell you for which resource the relevant event was ignored.
If we want to have a ratio of work avoided, then we would need a new metric in handleObjectUpdated to tell the actual number of event processed.

So maybe what is missing is a argocd_cluster_events_processed_total increased in handleObjectUpdated on relevant events alongside the Requesting app refresh caused by object update log.

Although, this PR seems to address the related issue correctly IMO.

}, append(descClusterDefaultLabels, "group", "kind"))

clusterEventsIgnoredCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "argocd_cluster_events_ignored_total",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can call this argocd_resource_updates_ignored_total?

Suggested change
Name: "argocd_cluster_events_ignored_total",
Name: "argocd_resource_updates_ignored_total",

Closes argoproj#27519

Signed-off-by: Staffan Selander <staffan.selander.li@gmail.com>
Address PR review feedback to associate the metrics body output with
the test (only printed on failure / verbose mode) rather than emitting
it unconditionally via the global logger.

Signed-off-by: Staffan Selander <staffan.selander.li@gmail.com>
@staffanselander staffanselander force-pushed the feat/add-ignored-events-metric branch from 045bdd9 to 2776a73 Compare May 19, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add argocd_cluster_events_ignored_total metric for ignoreResourceUpdates

7 participants