fix: connection closure metrics and enable all metric groups by default#3735
fix: connection closure metrics and enable all metric groups by default#3735ofekshenawa merged 9 commits intomasterfrom
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
16563ce to
6cea2f0
Compare
This commit addresses three issues in the redisotel-native metrics implementation:
1. Fix connection closure metrics not appearing in dashboards
- Root cause: CloseConn() in internal/pool/pool.go was not calling
the metricConnectionClosedCallback when closing stale connections
- Added callback invocation in CloseConn() to record 'stale' closures
2. Verify tracing and metrics compatibility
- Added TestTracingAndMetricsCompatibility to document that
redisotel (tracing via AddHook) and redisotel-native (metrics via
SetOTelRecorder) can coexist without conflicts
3. Enable all metric groups by default
- Changed default MetricGroups from connection-basic+resiliency
to MetricGroupAll for comprehensive observability out of the box
E2E validated: redis_client_connection_closed_total now appears in
Prometheus with reason='stale' when connections are closed.
6cea2f0 to
9f26be2
Compare
Upgrades OpenTelemetry Go SDK from v1.33.0 to v1.40.0 to fix a HIGH severity security vulnerability (arbitrary code execution via PATH hijacking on macOS). Changes: - go.opentelemetry.io/otel: v1.33.0 -> v1.40.0 - go.opentelemetry.io/otel/metric: v1.33.0 -> v1.40.0 - go.opentelemetry.io/otel/sdk: v1.33.0 -> v1.40.0 - go.opentelemetry.io/otel/sdk/metric: v1.33.0 -> v1.40.0 - go.opentelemetry.io/otel/trace: v1.33.0 -> v1.40.0 - go.opentelemetry.io/auto/sdk: v1.1.0 -> v1.2.1 (transitive) - github.com/go-logr/logr: v1.4.2 -> v1.4.3 (transitive) - golang.org/x/sys: v0.30.0 -> v0.40.0 (transitive) All tests pass with the upgraded dependencies.
…e metrics Addresses the Medium Severity issue where connection closure reasons were misattributed. The CloseConn method was hardcoding 'stale' as the reason even when connections were closed due to hook errors. Changes: - Updated Pooler.CloseConn interface: CloseConn(*Conn) → CloseConn(*Conn, string) - Updated ConnPool.CloseConn to accept and use the reason parameter - Updated StickyConnPool.CloseConn to pass reason to underlying pool - Updated SingleConnPool.CloseConn to pass reason to underlying pool - Updated call sites with correct reasons: - Line 767: 'stale' for unhealthy/stale connections - Line 778: 'hook_error' for hook errors on idle connections - Line 827: 'hook_error' for hook errors on new connections - redis.go:350: 'auth_error' for authentication errors This ensures the redis.client.connection.closed metric accurately reflects the actual reason for connection closure, improving observability and debugging capabilities.
- Add getMetricConnectionStateChangeCallback() call in CloseConn to record 'idle' -> '' state transition when connections are closed, preventing connection count gauge drift - Update CloseConn mock implementations to match new signature with reason parameter: - internal/auth/streaming/manager_test.go - maintnotifications/pool_hook_test.go Fixes metric drift issue where closed connections weren't being decremented from the db.client.connection.count UpDownCounter.
- Add fromState parameter to Pooler.CloseConn interface to explicitly specify the metric state the connection was in before closing - Update all CloseConn implementations (ConnPool, SingleConnPool, StickyConnPool) - Update all call sites with the correct fromState value: - pool.go:767 (stale): 'idle' - connection was popped from idle pool - pool.go:778 (hook_error): 'idle' - connection was popped from idle pool - pool.go:827 (hook_error on new): 'idle' - newConn recorded '' → 'idle' - redis.go:350 (auth_error): 'idle' - re-auth happens on idle connections - Update test mocks to match new signature This ensures accurate connection count metrics by making the metric state explicit rather than relying on implicit assumptions. Addresses PR review feedback about potential state drift in UpDownCounter metrics.
ndyakov
left a comment
There was a problem hiding this comment.
Overall looks good to me, left a suggestion.
| // Re-auth happens on connections that were idle in the pool (the pool hook | ||
| // waits for IDLE state before transitioning to UNUSABLE for re-auth). | ||
| // From metrics perspective, the connection was never "used" by a client. | ||
| err := c.connPool.CloseConn(poolCn, "auth_error", "idle") |
There was a problem hiding this comment.
can we extract the reasons as consts? also the states are defined as consts, you can use them.
…orrelation - Update Pooler interface CloseConn signature to accept ctx as first parameter - Update ConnPool.CloseConn to propagate context to metric callbacks - Update SingleConnPool and StickyConnPool wrapper implementations - Update all call sites to pass appropriate context - Fix mock implementations in test files This enables OpenTelemetry trace-to-metric correlation by ensuring the caller's context (with active span) is passed to metric callbacks instead of using context.Background().
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| // Notify metrics: new connection is created and used | ||
| if cb := getMetricConnectionStateChangeCallback(); cb != nil { | ||
| cb(ctx, newcn, "", "used") | ||
| cb(ctx, newcn, "", MetricStateUsed) |
There was a problem hiding this comment.
Wrong fromState causes idle gauge metric drift
High Severity
The state change callback at the success path for new connections uses "" as fromState, but newConn() already recorded a "" → MetricStateIdle transition at line 578. The correct fromState here is MetricStateIdle, not "". The error path at line 862 correctly uses MetricStateIdle (and the comment at line 861 explicitly documents this reasoning), but the success path doesn't apply the same logic. This causes the idle gauge to be permanently incremented by +1 for every new connection successfully acquired through Get(), since the "" → idle from newConn is never offset by an idle → used transition.
ndyakov
left a comment
There was a problem hiding this comment.
left one suggestion, not a blocker, you can merge (so you can continue with the otel changes) and we can add this as a todo and do it in the near future.
| const ( | ||
| // MetricStateIdle indicates the connection is idle in the pool, | ||
| // ready to be acquired. | ||
| MetricStateIdle = "idle" | ||
|
|
||
| // MetricStateUsed indicates the connection is currently being used | ||
| // by a client operation. | ||
| MetricStateUsed = "used" | ||
| ) |
There was a problem hiding this comment.
can we use ConnState instead? It won't be a breaking change if we decide to do so in the future, since this is an internal packaged and even if exported, we can drop this and switch to ConnState


This commit addresses three issues in the redisotel-native metrics implementation:
Fix connection closure metrics not appearing in dashboards
Verify tracing and metrics compatibility
Enable all metric groups by default
Note
Medium Risk
Updates the
Pooler.CloseConnAPI and connection lifecycle paths to emit new close/state-change metrics, which could affect pool implementations and metric volume/cardinality. Also changes redisotel-native defaults and bumps OTel deps, so dashboards/alerting may see new or renamed series and higher emission rates.Overview
Fixes missing connection-close metrics and standardizes pool close semantics.
Pooler.CloseConnis changed to acceptctx, a closereason, andfromState, with new constants (CloseReason*,MetricState*) and updated call sites (stale/hook/auth failures) to ensuremetricConnectionClosedCallbackand state transitions are recorded.Makes redisotel-native more “on-by-default” for observability and adds coverage.
NewConfig()now defaultsMetricGroupstoMetricGroupAll, adds semconv metric-name assertions plus Redis-backed stress/compatibility tests, and updatesextra/redisotel-nativedependencies to OTelv1.40.0(includingsdk/metric).Written by Cursor Bugbot for commit dfbc5d7. This will update automatically on new commits. Configure here.