Fix memory leak in Prometheus metrics by removing unbounded labels #164

chrisguidry · 2025-10-07T20:42:08Z

The TASKS_PERPETUATED, TASKS_RETRIED, and TASKS_STRICKEN metrics were using execution.specific_labels() which includes task keys, timestamps, and attempt numbers. These create unbounded cardinality in Prometheus, causing memory growth over time as each unique combination creates a new time series.

Changed these metrics to use execution.general_labels() instead, which only includes bounded labels like task name, docket name, and worker name. This keeps the metrics useful for observability while preventing the cardinality explosion.

OpenTelemetry spans and log contexts can still safely use specific_labels() since they're sampled/streamed rather than aggregated like Prometheus metrics.

Added tests to verify metrics only use safe labels going forward.

🤖 Generated with Claude Code

The `TASKS_PERPETUATED`, `TASKS_RETRIED`, and `TASKS_STRICKEN` metrics were using `execution.specific_labels()` which includes task keys, timestamps, and attempt numbers. These create unbounded cardinality in Prometheus, causing memory growth over time as each unique combination creates a new time series. Changed these metrics to use `execution.general_labels()` instead, which only includes bounded labels like task name, docket name, and worker name. This keeps the metrics useful for observability while preventing the cardinality explosion. OpenTelemetry spans and log contexts can still safely use `specific_labels()` since they're sampled/streamed rather than aggregated like Prometheus metrics. Added tests to verify metrics only use safe labels going forward. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

claude · 2025-10-07T20:42:24Z

Claude finished @chrisguidry's task —— View job

Review Checklist

Analyze the code changes in all 3 files
Verify the fix addresses the memory leak issue
Check code quality and best practices
Review test coverage and implementation
Assess performance and security implications
Provide comprehensive feedback

Code Review Summary

This PR correctly addresses the Prometheus metrics memory leak issue by switching from unbounded to bounded labels. The solution is well-implemented and includes appropriate test coverage.

✅ Strengths

Root Cause Properly Identified: The PR correctly identifies that specific_labels() includes unbounded cardinality labels (task keys, timestamps, attempt numbers) which cause memory growth in Prometheus over time.
Correct Solution: Switching to general_labels() is the right approach as it only includes bounded labels:
- general_labels(): Only docket.task (bounded by number of registered tasks)
- specific_labels(): Includes docket.key, docket.when, docket.attempt (all unbounded)
Consistent Implementation: All three affected metrics (TASKS_PERPETUATED, TASKS_RETRIED, TASKS_STRICKEN) have been updated consistently.
Excellent Test Coverage: Added comprehensive tests that:
- Verify TASKS_RETRIED uses only bounded labels (tests/test_instrumentation.py:503-526)
- Verify TASKS_PERPETUATED uses only bounded labels (tests/test_instrumentation.py:528-553)
- Explicitly assert absence of unbounded labels (docket.key, docket.when, docket.attempt)

✅ Code Quality

Clean, minimal changes focused on the issue
No unnecessary modifications or scope creep
Follows existing code patterns and conventions
Preserves backward compatibility for spans/logging

✅ Performance Considerations

Major improvement: Prevents unbounded metric cardinality explosion
Memory efficiency: Each unique label combination creates a new Prometheus time series. The fix prevents unlimited growth
No performance regression: general_labels() is actually more efficient than specific_labels()

✅ Security Considerations

No security issues identified
Removing task keys from metrics could be considered a slight security improvement as it reduces information leakage

⚠️ Minor Observations

Missing Test for TASKS_STRICKEN: While the code change in src/docket/docket.py:458 fixes TASKS_STRICKEN, there's no corresponding test verifying this metric uses bounded labels. However, the implementation is consistent with the other two metrics, so this is a minor point.
Documentation: The PR description correctly notes that OpenTelemetry spans and logs can still use specific_labels() since they're sampled/streamed rather than aggregated. This understanding is correct.

✅ Test Coverage

The test coverage requirement of 100% appears to be maintained. The new tests are well-structured and comprehensive:

Use appropriate fixtures and mocks
Test the actual label content, not just the behavior
Include both positive assertions (what should be present) and negative assertions (what should be absent)

Recommendation

APPROVED ✅ - This PR successfully fixes a critical memory leak issue with a clean, well-tested solution. The switch from unbounded to bounded labels is the correct approach for Prometheus metrics while preserving detailed information where appropriate (spans and logs).

The fix will significantly improve the production stability of systems using Docket with Prometheus monitoring by preventing metric cardinality explosion.

github-actions · 2025-10-07T20:42:25Z

📚 Documentation has been built for this PR!

You can download the documentation directly here:
https://github.com/chrisguidry/docket/actions/runs/18325651303/artifacts/4208297143

codecov-commenter · 2025-10-07T20:43:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (d1eb84c) to head (9c685fa).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #164   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           33        33           
  Lines         4789      4820   +31     
  Branches       267       267           
=========================================
+ Hits          4789      4820   +31

Flag	Coverage Δ
python-3.12	`100.00% <100.00%> (ø)`
python-3.13	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/docket/docket.py	`100.00% <ø> (ø)`
src/docket/worker.py	`100.00% <100.00%> (ø)`
tests/test_instrumentation.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chrisguidry merged commit 1eec4a7 into main Oct 7, 2025
17 checks passed

chrisguidry deleted the fix-metric-cardinality branch October 7, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix memory leak in Prometheus metrics by removing unbounded labels #164

Fix memory leak in Prometheus metrics by removing unbounded labels #164

Uh oh!

chrisguidry commented Oct 7, 2025

Uh oh!

claude bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

codecov-commenter commented Oct 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix memory leak in Prometheus metrics by removing unbounded labels #164

Fix memory leak in Prometheus metrics by removing unbounded labels #164

Uh oh!

Conversation

chrisguidry commented Oct 7, 2025

Uh oh!

claude bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Checklist

Code Review Summary

✅ Strengths

✅ Code Quality

✅ Performance Considerations

✅ Security Considerations

⚠️ Minor Observations

✅ Test Coverage

Recommendation

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

codecov-commenter commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 7, 2025 •

edited

Loading