Fix replica can't finish failover when config epoch is outdated #2178

enjoy-binbin · 2025-06-05T15:43:52Z

When the primary changes the config epoch and then down immediately,
the replica may not update the config epoch in time. Although we will
broadcast the change in cluster (see #1813), there may be a race in
the network or in the code. In this case, the replica will never finish
the failover since other primaries will refuse to vote because the
replica's slot config epoch is old.

We need a way to allow the replica can finish the failover in this case.

When the primary refuses to vote because the replica's config epoch is
less than the dead primary's config epoch, it can send an UPDATE packet
to the replica to inform the replica about the dead primary. The UPDATE
message contains information about the dead primary's config epoch and
owned slots. The failover will time out, but later the replica can try
again with the updated config epoch and it can succeed.

Fixes #2169.

When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. Signed-off-by: Binbin <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>

codecov · 2025-06-05T15:58:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.54%. Comparing base (0039adc) to head (d5bedc2).
⚠️ Report is 280 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2178      +/-   ##
============================================
+ Coverage     71.45%   71.54%   +0.08%     
============================================
  Files           122      122              
  Lines         66210    66442     +232     
============================================
+ Hits          47311    47535     +224     
- Misses        18899    18907       +8

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.72% <100.00%> (-0.12%)`	⬇️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro

Thanks for the PR.

Have an apprehension around large cluster setup. Shared couple of ideas in a comment.

src/cluster_legacy.c

zuiderkwast

Awesome work!

I think we can merge this because it fixes a problem the cluster can't recover from. We can backport this PR.

To address Hari's concern about a traffic storm, we can introduce a new message as a follow up, either an UPDATE with lightweight header or an AUTH_DENY lightweight message. But this would be only for future versions, not to be backported.

tests/unit/cluster/failover2.tcl

Signed-off-by: Binbin <[email protected]>

tests/unit/cluster/failover2.tcl

src/cluster_legacy.c

tests/unit/cluster/failover2.tcl

Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

The new test was added in #2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]>

When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

The new test was added in #2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: chzhoo <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> Signed-off-by: chzhoo <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337)

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337)

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: shanwan1 <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> Signed-off-by: shanwan1 <[email protected]>

…ated (valkey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. Signed-off-by: Ran Shidlansik <[email protected]>

…dated (#2178) to 7.2 (#2232) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Ran Shidlansik <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337)

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337) Signed-off-by: Viktor Söderqvist <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337) Signed-off-by: Viktor Söderqvist <[email protected]>

When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>

The new test was added in #2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337) Signed-off-by: Viktor Söderqvist <[email protected]>

…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>

The new test was added in valkey-io#2178, obviously there may be pending reads in the connection, so there may be a race in the DROP-CLUSTER-PACKET-FILTER part causing the test to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure that the replica does not process the packet. Signed-off-by: Binbin <[email protected]> (cherry picked from commit 2019337) Signed-off-by: Viktor Söderqvist <[email protected]>

enjoy-binbin requested review from PingXie, hpatro, madolson and zuiderkwast June 5, 2025 15:43

hpatro reviewed Jun 5, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

zuiderkwast approved these changes Jun 6, 2025

View reviewed changes

tests/unit/cluster/failover2.tcl Show resolved Hide resolved

zuiderkwast mentioned this pull request Jun 6, 2025

8.1.2 release #2173

Merged

zuiderkwast changed the title ~~Fix replica can to finish failover when config epoch is outdated~~ Fix replica can't finish failover when config epoch is outdated Jun 6, 2025

hpatro added this to Valkey 8.1, Valkey 8.0 and Valkey 7.2 Jun 6, 2025

start_cluster comment around the proc

c7a5f74

Signed-off-by: Binbin <[email protected]>

madolson reviewed Jun 9, 2025

View reviewed changes

tests/unit/cluster/failover2.tcl Show resolved Hide resolved

hpatro approved these changes Jun 9, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

madolson approved these changes Jun 9, 2025

View reviewed changes

tests/unit/cluster/failover2.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/failover2.tcl Show resolved Hide resolved

Apply suggestions from code review

d5bedc2

Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

hpatro merged commit 476671b into valkey-io:unstable Jun 10, 2025
52 checks passed

github-project-automation bot moved this to To be backported in Valkey 8.1 Jun 10, 2025

github-project-automation bot moved this to To be backported in Valkey 7.2 Jun 10, 2025

github-project-automation bot moved this to To be backported in Valkey 8.0 Jun 10, 2025

enjoy-binbin deleted the epoch_vote branch June 10, 2025 01:53

enjoy-binbin mentioned this pull request Jun 11, 2025

Add packet-drop to fix the new flaky failover test #2196

Merged

hpatro moved this from To be backported to 8.1.2 in Valkey 8.1 Jun 11, 2025

vitarb mentioned this pull request Jun 13, 2025

Fixes for Valkey 8.0.5 #2210

Merged

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Jun 17, 2025

ranshid moved this from To be backported to In Progress in Valkey 7.2 Jun 18, 2025

zuiderkwast moved this from To be backported to 8.0.5 in Valkey 8.0 Aug 18, 2025

rainsupreme moved this from In Progress to 7.2.10 in Valkey 7.2 Sep 29, 2025

enjoy-binbin added the cluster label Nov 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix replica can't finish failover when config epoch is outdated #2178

Fix replica can't finish failover when config epoch is outdated #2178

Uh oh!

enjoy-binbin commented Jun 5, 2025

Uh oh!

codecov bot commented Jun 5, 2025 •

edited

Loading

Uh oh!

hpatro left a comment

Uh oh!

Uh oh!

zuiderkwast left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix replica can't finish failover when config epoch is outdated #2178

Fix replica can't finish failover when config epoch is outdated #2178

Uh oh!

Conversation

enjoy-binbin commented Jun 5, 2025

Uh oh!

codecov bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jun 5, 2025 •

edited

Loading