-
Notifications
You must be signed in to change notification settings - Fork 3
Fix memory leak in Prometheus metrics by removing unbounded labels #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The `TASKS_PERPETUATED`, `TASKS_RETRIED`, and `TASKS_STRICKEN` metrics were using `execution.specific_labels()` which includes task keys, timestamps, and attempt numbers. These create unbounded cardinality in Prometheus, causing memory growth over time as each unique combination creates a new time series. Changed these metrics to use `execution.general_labels()` instead, which only includes bounded labels like task name, docket name, and worker name. This keeps the metrics useful for observability while preventing the cardinality explosion. OpenTelemetry spans and log contexts can still safely use `specific_labels()` since they're sampled/streamed rather than aggregated like Prometheus metrics. Added tests to verify metrics only use safe labels going forward. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
Claude finished @chrisguidry's task —— View job Review Checklist
Code Review SummaryThis PR correctly addresses the Prometheus metrics memory leak issue by switching from unbounded to bounded labels. The solution is well-implemented and includes appropriate test coverage. ✅ Strengths
✅ Code Quality
✅ Performance Considerations
✅ Security Considerations
|
|
📚 Documentation has been built for this PR! You can download the documentation directly here: |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #164 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 33 33
Lines 4789 4820 +31
Branches 267 267
=========================================
+ Hits 4789 4820 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
The
TASKS_PERPETUATED,TASKS_RETRIED, andTASKS_STRICKENmetrics were usingexecution.specific_labels()which includes task keys, timestamps, and attempt numbers. These create unbounded cardinality in Prometheus, causing memory growth over time as each unique combination creates a new time series.Changed these metrics to use
execution.general_labels()instead, which only includes bounded labels like task name, docket name, and worker name. This keeps the metrics useful for observability while preventing the cardinality explosion.OpenTelemetry spans and log contexts can still safely use
specific_labels()since they're sampled/streamed rather than aggregated like Prometheus metrics.Added tests to verify metrics only use safe labels going forward.
🤖 Generated with Claude Code