Skip to content

Conversation

@ranshid
Copy link
Member

@ranshid ranshid commented Jun 18, 2025

When the primary changes the config epoch and then down immediately,
the replica may not update the config epoch in time. Although we will
broadcast the change in cluster (see #1813), there may be a race in
the network or in the code. In this case, the replica will never finish
the failover since other primaries will refuse to vote because the
replica's slot config epoch is old.

We need a way to allow the replica can finish the failover in this case.

When the primary refuses to vote because the replica's config epoch is
less than the dead primary's config epoch, it can send an UPDATE packet
to the replica to inform the replica about the dead primary. The UPDATE
message contains information about the dead primary's config epoch and
owned slots. The failover will time out, but later the replica can try
again with the updated config epoch and it can succeed.

Fixes #2169.

…ated (valkey-io#2178)

    When the primary changes the config epoch and then down immediately,
    the replica may not update the config epoch in time. Although we will
    broadcast the change in cluster (see valkey-io#1813), there may be a race in
    the network or in the code. In this case, the replica will never finish
    the failover since other primaries will refuse to vote because the
    replica's slot config epoch is old.

    We need a way to allow the replica can finish the failover in this case.

    When the primary refuses to vote because the replica's config epoch is
    less than the dead primary's config epoch, it can send an UPDATE packet
    to the replica to inform the replica about the dead primary. The UPDATE
    message contains information about the dead primary's config epoch and
    owned slots. The failover will time out, but later the replica can try
    again with the updated config epoch and it can succeed.

    Fixes valkey-io#2169.

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid ranshid requested a review from enjoy-binbin June 18, 2025 07:55
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
@enjoy-binbin
Copy link
Member

The test fail because there are some other changes around the test suit, see #2210 top comment.
We decided to drop the test to avoid the major backport unless there are other changes rely on it.

@ranshid ranshid merged commit 525551a into valkey-io:7.2 Jun 18, 2025
15 checks passed
@codecov
Copy link

codecov bot commented Jun 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (5dc6632) to head (baeafc4).
Report is 1 commits behind head on 7.2.

Additional details and impacted files
@@     Coverage Diff     @@
##   7.2   #2232   +/-   ##
===========================
===========================
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants