Skip to content

Conversation

@chrisguidry
Copy link
Owner

The TASKS_PERPETUATED, TASKS_RETRIED, and TASKS_STRICKEN metrics were using execution.specific_labels() which includes task keys, timestamps, and attempt numbers. These create unbounded cardinality in Prometheus, causing memory growth over time as each unique combination creates a new time series.

Changed these metrics to use execution.general_labels() instead, which only includes bounded labels like task name, docket name, and worker name. This keeps the metrics useful for observability while preventing the cardinality explosion.

OpenTelemetry spans and log contexts can still safely use specific_labels() since they're sampled/streamed rather than aggregated like Prometheus metrics.

Added tests to verify metrics only use safe labels going forward.

🤖 Generated with Claude Code

The `TASKS_PERPETUATED`, `TASKS_RETRIED`, and `TASKS_STRICKEN` metrics were
using `execution.specific_labels()` which includes task keys, timestamps, and
attempt numbers. These create unbounded cardinality in Prometheus, causing
memory growth over time as each unique combination creates a new time series.

Changed these metrics to use `execution.general_labels()` instead, which only
includes bounded labels like task name, docket name, and worker name. This
keeps the metrics useful for observability while preventing the cardinality
explosion.

OpenTelemetry spans and log contexts can still safely use `specific_labels()`
since they're sampled/streamed rather than aggregated like Prometheus metrics.

Added tests to verify metrics only use safe labels going forward.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@claude
Copy link
Contributor

claude bot commented Oct 7, 2025

Claude finished @chrisguidry's task —— View job


Review Checklist

  • Analyze the code changes in all 3 files
  • Verify the fix addresses the memory leak issue
  • Check code quality and best practices
  • Review test coverage and implementation
  • Assess performance and security implications
  • Provide comprehensive feedback

Code Review Summary

This PR correctly addresses the Prometheus metrics memory leak issue by switching from unbounded to bounded labels. The solution is well-implemented and includes appropriate test coverage.

Strengths

  1. Root Cause Properly Identified: The PR correctly identifies that specific_labels() includes unbounded cardinality labels (task keys, timestamps, attempt numbers) which cause memory growth in Prometheus over time.

  2. Correct Solution: Switching to general_labels() is the right approach as it only includes bounded labels:

    • general_labels(): Only docket.task (bounded by number of registered tasks)
    • specific_labels(): Includes docket.key, docket.when, docket.attempt (all unbounded)
  3. Consistent Implementation: All three affected metrics (TASKS_PERPETUATED, TASKS_RETRIED, TASKS_STRICKEN) have been updated consistently.

  4. Excellent Test Coverage: Added comprehensive tests that:

    • Verify TASKS_RETRIED uses only bounded labels (tests/test_instrumentation.py:503-526)
    • Verify TASKS_PERPETUATED uses only bounded labels (tests/test_instrumentation.py:528-553)
    • Explicitly assert absence of unbounded labels (docket.key, docket.when, docket.attempt)

Code Quality

  • Clean, minimal changes focused on the issue
  • No unnecessary modifications or scope creep
  • Follows existing code patterns and conventions
  • Preserves backward compatibility for spans/logging

Performance Considerations

  • Major improvement: Prevents unbounded metric cardinality explosion
  • Memory efficiency: Each unique label combination creates a new Prometheus time series. The fix prevents unlimited growth
  • No performance regression: general_labels() is actually more efficient than specific_labels()

Security Considerations

  • No security issues identified
  • Removing task keys from metrics could be considered a slight security improvement as it reduces information leakage

⚠️ Minor Observations

  1. Missing Test for TASKS_STRICKEN: While the code change in src/docket/docket.py:458 fixes TASKS_STRICKEN, there's no corresponding test verifying this metric uses bounded labels. However, the implementation is consistent with the other two metrics, so this is a minor point.

  2. Documentation: The PR description correctly notes that OpenTelemetry spans and logs can still use specific_labels() since they're sampled/streamed rather than aggregated. This understanding is correct.

Test Coverage

The test coverage requirement of 100% appears to be maintained. The new tests are well-structured and comprehensive:

  • Use appropriate fixtures and mocks
  • Test the actual label content, not just the behavior
  • Include both positive assertions (what should be present) and negative assertions (what should be absent)

Recommendation

APPROVED ✅ - This PR successfully fixes a critical memory leak issue with a clean, well-tested solution. The switch from unbounded to bounded labels is the correct approach for Prometheus metrics while preserving detailed information where appropriate (spans and logs).

The fix will significantly improve the production stability of systems using Docket with Prometheus monitoring by preventing metric cardinality explosion.

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

📚 Documentation has been built for this PR!

You can download the documentation directly here:
https://github.com/chrisguidry/docket/actions/runs/18325651303/artifacts/4208297143

@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (d1eb84c) to head (9c685fa).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #164   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           33        33           
  Lines         4789      4820   +31     
  Branches       267       267           
=========================================
+ Hits          4789      4820   +31     
Flag Coverage Δ
python-3.12 100.00% <100.00%> (ø)
python-3.13 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/docket/docket.py 100.00% <ø> (ø)
src/docket/worker.py 100.00% <100.00%> (ø)
tests/test_instrumentation.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrisguidry chrisguidry merged commit 1eec4a7 into main Oct 7, 2025
17 checks passed
@chrisguidry chrisguidry deleted the fix-metric-cardinality branch October 7, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants