feat: add argocd_cluster_events_ignored_total metric#27520
feat: add argocd_cluster_events_ignored_total metric#27520staffanselander wants to merge 2 commits into
Conversation
❌ Preview Environment undeployed from BunnyshellAvailable commands (reply to this comment):
|
3f4b9b1 to
7080d4d
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #27520 +/- ##
=======================================
Coverage 64.23% 64.24%
=======================================
Files 422 422
Lines 57853 57860 +7
=======================================
+ Hits 37163 37173 +10
+ Misses 17183 17179 -4
- Partials 3507 3508 +1 ☔ View full report in Codecov by Sentry. |
7080d4d to
fd8be48
Compare
ppapapetrou76
left a comment
There was a problem hiding this comment.
LGTM but I have a question
argocd_cluster_events_total uses OnEvent; this new counter uses OnResourceUpdated.
Is there any reason for this?
There was a problem hiding this comment.
Pull request overview
Adds a new application-controller Prometheus counter (argocd_cluster_events_ignored_total) to provide observability into how often Kubernetes watch events are filtered out by ignoreResourceUpdates, enabling operators to evaluate rule effectiveness without enabling high-volume debug logs.
Changes:
- Add and register
argocd_cluster_events_ignored_total{server,group,kind}in the controller metrics server, including cache-expiration reset support. - Increment the new counter when
skipResourceUpdate()causes an update event to be ignored. - Add a unit test and document the new metric in the operator metrics reference.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| docs/operator-manual/metrics.md | Documents the new argocd_cluster_events_ignored_total counter. |
| controller/metrics/metrics.go | Defines/registers the new counter and exposes IncClusterEventsIgnoredCount(), including reset on expiration. |
| controller/cache/cache.go | Increments the ignored-events counter on the ignoreResourceUpdates early-return path. |
| controller/metrics/metrics_test.go | Verifies the new counter is emitted with expected labels/values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This is a good call out because it will make something like |
Yup but I would just go with |
|
@staffanselander please fix the DCO check |
|
Thanks for the careful review — really useful feedback. The more I dig in, the more I think the right answer here isn't a one-line move. Looking at the cache code, I'd like to take a bit more time to evaluate options that fit Argo CD's existing patterns better and avoid the misleading-ratio confusion you both flagged. Will report back with a concrete proposal. |
fee5e0d to
045bdd9
Compare
|
The OnEvent metric tells you which object are monitored (events) So maybe what is missing is a Although, this PR seems to address the related issue correctly IMO. |
| }, append(descClusterDefaultLabels, "group", "kind")) | ||
|
|
||
| clusterEventsIgnoredCounter = prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Name: "argocd_cluster_events_ignored_total", |
There was a problem hiding this comment.
Perhaps we can call this argocd_resource_updates_ignored_total?
| Name: "argocd_cluster_events_ignored_total", | |
| Name: "argocd_resource_updates_ignored_total", |
Closes argoproj#27519 Signed-off-by: Staffan Selander <staffan.selander.li@gmail.com>
Address PR review feedback to associate the metrics body output with the test (only printed on failure / verbose mode) rather than emitting it unconditionally via the global logger. Signed-off-by: Staffan Selander <staffan.selander.li@gmail.com>
045bdd9 to
2776a73
Compare
Closes #27519
Description
Adds a Prometheus counter
argocd_cluster_events_ignored_totalthat tracks k8s resource eventsfiltered by
ignoreResourceUpdatesrules. This enables operators to measure rule effectivenesswithout debug logging (which increases log volume 8-17x at scale).
Uses the same labels (
server,group,kind) as the existingargocd_cluster_events_total.Checklist