Skip to content

fix: connection closure metrics and enable all metric groups by default#3735

Merged
ofekshenawa merged 9 commits intomasterfrom
fix/metrics-improvements
Mar 16, 2026
Merged

fix: connection closure metrics and enable all metric groups by default#3735
ofekshenawa merged 9 commits intomasterfrom
fix/metrics-improvements

Conversation

@ofekshenawa
Copy link
Copy Markdown
Collaborator

@ofekshenawa ofekshenawa commented Mar 9, 2026

This commit addresses three issues in the redisotel-native metrics implementation:

  1. Fix connection closure metrics not appearing in dashboards

    • Root cause: CloseConn() in internal/pool/pool.go was not calling the metricConnectionClosedCallback when closing stale connections
    • Added callback invocation in CloseConn() to record 'stale' closures
  2. Verify tracing and metrics compatibility

    • Added TestTracingAndMetricsCompatibility to document that redisotel (tracing via AddHook) and redisotel-native (metrics via SetOTelRecorder) can coexist without conflicts
  3. Enable all metric groups by default

    • Changed default MetricGroups from connection-basic+resiliency to MetricGroupAll for comprehensive observability out of the box

Note

Medium Risk
Updates the Pooler.CloseConn API and connection lifecycle paths to emit new close/state-change metrics, which could affect pool implementations and metric volume/cardinality. Also changes redisotel-native defaults and bumps OTel deps, so dashboards/alerting may see new or renamed series and higher emission rates.

Overview
Fixes missing connection-close metrics and standardizes pool close semantics. Pooler.CloseConn is changed to accept ctx, a close reason, and fromState, with new constants (CloseReason*, MetricState*) and updated call sites (stale/hook/auth failures) to ensure metricConnectionClosedCallback and state transitions are recorded.

Makes redisotel-native more “on-by-default” for observability and adds coverage. NewConfig() now defaults MetricGroups to MetricGroupAll, adds semconv metric-name assertions plus Redis-backed stress/compatibility tests, and updates extra/redisotel-native dependencies to OTel v1.40.0 (including sdk/metric).

Written by Cursor Bugbot for commit dfbc5d7. This will update automatically on new commits. Configure here.

@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Mar 9, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@ofekshenawa ofekshenawa force-pushed the fix/metrics-improvements branch from 16563ce to 6cea2f0 Compare March 9, 2026 20:31
This commit addresses three issues in the redisotel-native metrics implementation:

1. Fix connection closure metrics not appearing in dashboards
   - Root cause: CloseConn() in internal/pool/pool.go was not calling
     the metricConnectionClosedCallback when closing stale connections
   - Added callback invocation in CloseConn() to record 'stale' closures

2. Verify tracing and metrics compatibility
   - Added TestTracingAndMetricsCompatibility to document that
     redisotel (tracing via AddHook) and redisotel-native (metrics via
     SetOTelRecorder) can coexist without conflicts

3. Enable all metric groups by default
   - Changed default MetricGroups from connection-basic+resiliency
     to MetricGroupAll for comprehensive observability out of the box

E2E validated: redis_client_connection_closed_total now appears in
Prometheus with reason='stale' when connections are closed.
@ofekshenawa ofekshenawa force-pushed the fix/metrics-improvements branch from 6cea2f0 to 9f26be2 Compare March 9, 2026 20:48
Upgrades OpenTelemetry Go SDK from v1.33.0 to v1.40.0 to fix a HIGH severity
security vulnerability (arbitrary code execution via PATH hijacking on macOS).

Changes:
- go.opentelemetry.io/otel: v1.33.0 -> v1.40.0
- go.opentelemetry.io/otel/metric: v1.33.0 -> v1.40.0
- go.opentelemetry.io/otel/sdk: v1.33.0 -> v1.40.0
- go.opentelemetry.io/otel/sdk/metric: v1.33.0 -> v1.40.0
- go.opentelemetry.io/otel/trace: v1.33.0 -> v1.40.0
- go.opentelemetry.io/auto/sdk: v1.1.0 -> v1.2.1 (transitive)
- github.com/go-logr/logr: v1.4.2 -> v1.4.3 (transitive)
- golang.org/x/sys: v0.30.0 -> v0.40.0 (transitive)

All tests pass with the upgraded dependencies.
@ofekshenawa ofekshenawa marked this pull request as ready for review March 9, 2026 21:15
Comment thread internal/pool/pool.go
…e metrics

Addresses the Medium Severity issue where connection closure reasons were
misattributed. The CloseConn method was hardcoding 'stale' as the reason
even when connections were closed due to hook errors.

Changes:
- Updated Pooler.CloseConn interface: CloseConn(*Conn) → CloseConn(*Conn, string)
- Updated ConnPool.CloseConn to accept and use the reason parameter
- Updated StickyConnPool.CloseConn to pass reason to underlying pool
- Updated SingleConnPool.CloseConn to pass reason to underlying pool
- Updated call sites with correct reasons:
  - Line 767: 'stale' for unhealthy/stale connections
  - Line 778: 'hook_error' for hook errors on idle connections
  - Line 827: 'hook_error' for hook errors on new connections
  - redis.go:350: 'auth_error' for authentication errors

This ensures the redis.client.connection.closed metric accurately reflects
the actual reason for connection closure, improving observability and
debugging capabilities.
Comment thread internal/pool/pool.go
- Add getMetricConnectionStateChangeCallback() call in CloseConn to record
  'idle' -> '' state transition when connections are closed, preventing
  connection count gauge drift

- Update CloseConn mock implementations to match new signature with reason
  parameter:
  - internal/auth/streaming/manager_test.go
  - maintnotifications/pool_hook_test.go

Fixes metric drift issue where closed connections weren't being decremented
from the db.client.connection.count UpDownCounter.
Comment thread internal/pool/pool.go
- Add fromState parameter to Pooler.CloseConn interface to explicitly specify
  the metric state the connection was in before closing
- Update all CloseConn implementations (ConnPool, SingleConnPool, StickyConnPool)
- Update all call sites with the correct fromState value:
  - pool.go:767 (stale): 'idle' - connection was popped from idle pool
  - pool.go:778 (hook_error): 'idle' - connection was popped from idle pool
  - pool.go:827 (hook_error on new): 'idle' - newConn recorded '' → 'idle'
  - redis.go:350 (auth_error): 'idle' - re-auth happens on idle connections
- Update test mocks to match new signature

This ensures accurate connection count metrics by making the metric state
explicit rather than relying on implicit assumptions. Addresses PR review
feedback about potential state drift in UpDownCounter metrics.
@ofekshenawa ofekshenawa requested a review from ndyakov March 9, 2026 22:52
Copy link
Copy Markdown
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, left a suggestion.

Comment thread redis.go Outdated
// Re-auth happens on connections that were idle in the pool (the pool hook
// waits for IDLE state before transitioning to UNUSABLE for re-auth).
// From metrics perspective, the connection was never "used" by a client.
err := c.connPool.CloseConn(poolCn, "auth_error", "idle")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract the reasons as consts? also the states are defined as consts, you can use them.

Comment thread internal/pool/pool.go
…orrelation

- Update Pooler interface CloseConn signature to accept ctx as first parameter
- Update ConnPool.CloseConn to propagate context to metric callbacks
- Update SingleConnPool and StickyConnPool wrapper implementations
- Update all call sites to pass appropriate context
- Fix mock implementations in test files

This enables OpenTelemetry trace-to-metric correlation by ensuring the
caller's context (with active span) is passed to metric callbacks instead
of using context.Background().
Comment thread internal/pool/pool.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread internal/pool/pool.go
// Notify metrics: new connection is created and used
if cb := getMetricConnectionStateChangeCallback(); cb != nil {
cb(ctx, newcn, "", "used")
cb(ctx, newcn, "", MetricStateUsed)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong fromState causes idle gauge metric drift

High Severity

The state change callback at the success path for new connections uses "" as fromState, but newConn() already recorded a "" → MetricStateIdle transition at line 578. The correct fromState here is MetricStateIdle, not "". The error path at line 862 correctly uses MetricStateIdle (and the comment at line 861 explicitly documents this reasoning), but the success path doesn't apply the same logic. This causes the idle gauge to be permanently incremented by +1 for every new connection successfully acquired through Get(), since the "" → idle from newConn is never offset by an idle → used transition.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one suggestion, not a blocker, you can merge (so you can continue with the otel changes) and we can add this as a todo and do it in the near future.

Comment thread internal/pool/pool.go
Comment on lines +38 to +46
const (
// MetricStateIdle indicates the connection is idle in the pool,
// ready to be acquired.
MetricStateIdle = "idle"

// MetricStateUsed indicates the connection is currently being used
// by a client operation.
MetricStateUsed = "used"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use ConnState instead? It won't be a breaking change if we decide to do so in the future, since this is an internal packaged and even if exported, we can drop this and switch to ConnState

@ofekshenawa ofekshenawa merged commit cd3949d into master Mar 16, 2026
78 of 81 checks passed
@ofekshenawa ofekshenawa deleted the fix/metrics-improvements branch March 16, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants