Skip to content

fix: do not close sentinel when replica list is empty#3795

Open
shahyash2609 wants to merge 7 commits intoredis:masterfrom
shahyash2609:fix/sentinel-no-replica-rediscovery-loop
Open

fix: do not close sentinel when replica list is empty#3795
shahyash2609 wants to merge 7 commits intoredis:masterfrom
shahyash2609:fix/sentinel-no-replica-rediscovery-loop

Conversation

@shahyash2609
Copy link
Copy Markdown

@shahyash2609 shahyash2609 commented Apr 27, 2026

Summary

  • Bug: replicaAddrss() in sentinel.go calls closeSentinel() when getReplicaAddrs returns an empty list without error. This tears down the sentinel connection (and its +switch-master pub/sub subscription), forcing a full sentinel rediscovery on every subsequent operation.
  • Impact: On master-only sentinel setups (no replicas), this creates a continuous rediscovery loop — every dial triggers concurrent queries to all sentinel nodes, flooding logs and adding unnecessary latency.
  • Fix: Return []string{}, nil instead of calling closeSentinel() when replicas are empty but there's no error. An empty replica list is a valid, stable state for master-only deployments. The sentinel connection must be preserved for ongoing master discovery and failover monitoring.

Root cause

In sentinel.go, function replicaAddrss():

} else {
    // No error and no replicas.
    _ = c.closeSentinel()  // ← BUG: destroys sentinel for a valid state
}

After closeSentinel(), c.sentinel is nil and c.pubsub is nil. The next call to MasterAddr() must perform full sentinel discovery (querying all sentinel nodes concurrently) because there's no cached sentinel client to use.

With replicas present this code path is never reached — replicaAddrs() returns the cached list and does an async refresh. The bug only manifests on master-only clusters.

Changes

File Change
sentinel.go Replace closeSentinel() with return []string{}, nil in the zero-replica branch
export_test.go Add test helpers to inspect sentinelFailover internal state
sentinel_test.go Add TestSentinelFailover_ReplicaAddrs_NoReplicas verifying sentinel survives repeated calls with zero replicas

Test plan

  • TestSentinelFailover_ReplicaAddrs_NoReplicas — confirms sentinel stays alive after ReplicaAddrs returns empty list
  • Existing sentinel tests continue to pass (no behavioral change for setups with replicas)
  • Verified in production at Meesho on master-only sentinel clusters — rediscovery loop eliminated

🤖 Generated with Claude Code


Note

Medium Risk
Touches Sentinel failover connection lifecycle logic; while narrowly scoped, a mistake could cause stale Sentinel state or unnecessary reconnects in production.

Overview
Prevents sentinelFailover.replicaAddrs from calling closeSentinel() when Sentinel returns no replicas and no error, returning an empty list instead to preserve the Sentinel client/pubsub subscription for master discovery and failover monitoring.

Adds test-only helpers (HasSentinel, NewTestSentinelFailover, ReplicaAddrs) and a new unit/integration test TestSentinelFailover_ReplicaAddrs_NoReplicas that provisions a temporary master-only Sentinel monitor and asserts the Sentinel connection remains alive across repeated ReplicaAddrs calls.

Reviewed by Cursor Bugbot for commit 318995c. Bugbot is set up for automated code reviews on this repo. Configure here.

When a sentinel-monitored Redis master has no replicas (master-only
setup), replicaAddrss() was calling closeSentinel() on receiving an
empty replica list without error. This destroyed the sentinel
connection (and its pub/sub failover subscription), forcing a full
sentinel rediscovery on every subsequent operation — creating a
continuous rediscovery loop that flooded logs and added latency.

The empty-replica case is a valid, stable state for master-only
deployments. The sentinel connection must be preserved for ongoing
master discovery and failover monitoring via pub/sub.

The fix returns early with an empty slice instead of tearing down
the sentinel. A test is added to verify the sentinel survives
repeated ReplicaAddrs calls when no replicas exist.

Fixes: sentinel rediscovery loop on master-only clusters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 27, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4620700cac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread sentinel.go
// No error and no replicas — valid steady state for master-only setups.
// Do NOT close sentinel; it is still needed for master discovery
// and failover pub/sub monitoring.
return []string{}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep disconnected-replica fallback when no healthy replicas

Returning immediately on the zero-replica branch bypasses the useDisconnected path in sentinelFailover.replicaAddrs, so when FailoverOptions.UseDisconnectedReplicas is enabled and the cached sentinel reports only disconnected replicas, RandomReplicaAddr can no longer discover those replicas and will route reads to the master instead. This is a regression in failover behavior under degraded replication states because the second call (replicaAddrs(ctx, true)) now exits before reaching the loop that applies parseReplicaAddrs(..., useDisconnected).

Useful? React with 👍 / 👎.

Comment thread export_test.go Outdated
Comment thread sentinel.go
Address review feedback: the early return for zero replicas must only
apply when useDisconnected=false (the normal path). When
useDisconnected=true, fall through to the discovery loop which
correctly passes useDisconnected to parseReplicaAddrs — since
getReplicaAddrs hardcodes false, the cached sentinel path can never
find disconnected replicas.

Also remove unused SetSentinelForTest helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread sentinel_test.go
shahyash2609 and others added 2 commits April 27, 2026 22:20
Address review feedback: the previous test used sentinelName which
has two replicas in the test environment, so getReplicaAddrs returned
a non-empty list and the zero-replica code path was never exercised.

Now the test registers a temporary master-only name via SENTINEL
MONITOR (pointing to the existing master port but with a new name
and no replicas), verifies that ReplicaAddrs returns empty without
closing the sentinel, and cleans up via SENTINEL REMOVE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a connectivity check at the start of the test so it is skipped
(not failed) in environments where sentinel processes are not running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit a95c9f1. Configure here.

Comment thread sentinel_test.go Outdated
shahyash2609 and others added 3 commits April 27, 2026 23:43
Point SENTINEL MONITOR at the standalone Redis instance (redisPort)
instead of sentinelMasterPort, which already has two replicas. This
prevents Sentinel from discovering replicas via INFO REPLICATION and
ensures the test reliably exercises the zero-replica code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shahyash2609
Copy link
Copy Markdown
Author

@ofekshenawa @ndyakov Can you please review this PR?

@ndyakov
Copy link
Copy Markdown
Member

ndyakov commented Apr 29, 2026

@shahyash2609 thank you for pinging us, will check it as soon as possible.

@shahyash2609
Copy link
Copy Markdown
Author

Hi @ndyakov, Hope you are doing well.
Can you pleaes review this once you get time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants