Do the failover immediately if the replica is the best ranked replica #2227

enjoy-binbin · 2025-06-17T15:52:04Z

In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is,
to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a delay of
500-1000ms. The original purpose of this delay:

Allow FAIL to propagate to at least a majority of the primaries. This makes sure
they will vote when a replica sends failover auth request.
Allow replicas to exchange their offsets, so they will have a correct view of
their own rank.

We want to minimize this delay while ensuring safety. It is useful for example in
these cases:

If there is only one replica, then we don't need any delay, or
If there are more replicas, with a fast network, the replicas can exchange the
offsets very quickly and start the failover within a few milliseconds instead
of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked replica, we let
it initiate a failover immediately and completely remove the delay.

How to ensure safety?

To make sure this replica has the best rank, it only skips the delay if it is
sure that it have the best rank and that all replicas in the same shard agree that
the primary is failing. A new flag CLUSTER_NODE_MY_PRIMARY_FAIL is introduced to
indicate that each replica has marked its primary as FAIL. If all replicas say that
the primary is failing, we also know that the offset is not updated, because the
offset is not incrementing when the primary is failing. We can skip the delay only
if we have received a message from all replicas and they all have set this flag.
To make sure the primaries will vote even if they didn't receive the FAIL yet,
we use the CLUSTERMSG_FLAG0_FORCEACK to make sure they will vote. This is equivalent
to broadcasting a FAIL message to all primaries before we broadcast the failover
auth request (but cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated
in the following sequence diagram:

A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|

Details

This is the how the failover is initiated, with new steps marked with (new):

A majority of primaries have marked another primary as PFAIL.
Some nodes counts failure reports and marks the failing primary as FAIL. The node
that detects FAIL broadcasts it to all nodes in the cluster.
When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it
schedules a failover:
a. It sets a timeout (500ms + random 0-500ms).
b. It broadcasts pong to the other replicas in the same shard.
c. (new) The pong (actually the clusterMsg header) has a new flag CLUSTER_NODE_MY_PRIMARY_FAIL.
When the replicas broadcast pong to each other here, this flag is set.
(new) When the following conditions are met, skip the remaining delay and start
the failover using AUTH REQUEST with the FORCE ACK flag set, that is if
a. a PONG is received from every other replica in the same shard (broadcast within
the shard) and
b. all replicas have marked that its primary is FAIL in their last message (the new
CLUSTER_NODE_MY_PRIMARY_FAIL flag is set) and
c. this is the best replica (rank = 0) and
d. my replication offset != 0.
When the delay has passed and no other replica has initiated failover, then initiate failover.

Notes:

With 3(c), we don't need to wait for FAIL to propagate to all voting primaries.
At this point, a FAIL has already been broadcast by some node, but there is a race
so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag
ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL
just before broadcasting auth request.)
4(b) ensures that we have received the replication offset from all other replicas and
that it's up to date. If a replica says that it's primary is failing, it also means that
the replication from the primary to that replica has stopped.
4(c) is to avoid a special bad case. It can happen that not all replicas know about each
other. In this case, two replicas can think they are both the best replica and start the
failover at the same time. This can already happen without this PR. When it happens, it
usually means that a new replica has just joined and it has no data (offset = 0) and if
it wins the election, there is a problem of dataloss (discussed and partially mitigated
for the replica migration case in Fix data loss when replica do a failover with a old history repl offset #885). To avoid this case, skip this fast failover path
if the replica has offset = 0.

In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

codecov · 2025-06-17T16:08:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.48%. Comparing base (e9edba4) to head (2c0b049).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2227      +/-   ##
============================================
+ Coverage     71.36%   71.48%   +0.11%     
============================================
  Files           123      123              
  Lines         67094    67121      +27     
============================================
+ Hits          47884    47980      +96     
+ Misses        19210    19141      -69

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.93% <100.00%> (+0.21%)`	⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

Good job!

I have some ideas. See comments.

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

src/cluster_legacy.c

enjoy-binbin · 2025-06-18T02:42:40Z

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

Yes, i will add the test later, i test it locally yesterday and it work.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

Yes, it is indeed a bit confusing, sorry i did not mention it. I want to put CLUSTER_TODO_HANDLE_FASTER_FAILOVER after other flags, like after CLUSTER_TODO_UPDATE_STATE, so that we will first print CLUSTER IS DOWN and then try to trigger the failover. That is now we will trigger failover faster, in the past, we will update the cluster state and then try to failover in ClusterCron.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

Sure, i can remove FASTER_FAILOVER and then only put it in #2209

Signed-off-by: Binbin <[email protected]>

src/cluster_legacy.c

Signed-off-by: Binbin <[email protected]>

src/cluster_legacy.c

zuiderkwast · 2025-07-02T14:54:47Z

@hpatro I have discussed with @enjoy-binbin and updated the top comment with a more thorough reasoning about the possible scenarios. PTAL.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2025-07-11T04:03:06Z

Let's also include a check that myself's replication offset != 0 here?

It makes sure that if myself is a new replica just connected that don't yet know about other replicas, we should not try to do this fast failover.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Ok, i pushed the new commit, sorry for the late response, i somehow forgot it and thought we are waitting response from @hpatro

zuiderkwast · 2025-07-11T04:47:11Z

@madolson This is based on your idea to keep track of other replicas' view of the primay's fail status. Great if you can take a look.

hpatro

Haven't taken a complete look.

@enjoy-binbin Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

enjoy-binbin · 2025-07-12T14:00:10Z

Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

What do you means? sorry i don't quite get the question.

zuiderkwast

I would like to have this in 9.0. I hope others will agree. Maybe we can merge it after RC1.

tests/unit/cluster/faster-failover.tcl

Signed-off-by: Binbin <[email protected]>

madolson · 2025-08-06T18:35:28Z

I would like to pause this for 9.0. We have seen a lot of cluster instability in AWS after merging 8.1 (much worse than in any past release), and I don't think we have the testing in place for all of these optimizations to make sure we aren't ending up in odd states.

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

enjoy-binbin mentioned this pull request Jun 17, 2025

Start failover without waiting for the next cluster cron cycle #2209

Merged

enjoy-binbin requested review from madolson and zuiderkwast June 17, 2025 15:52

Remove spaces

35d28dc

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 17, 2025

zuiderkwast reviewed Jun 17, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added 5 commits June 20, 2025 10:37

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

555d556

Signed-off-by: Binbin <[email protected]>

code review

20e6f5b

Signed-off-by: Binbin <[email protected]>

add test

a4e4cb4

Signed-off-by: Binbin <[email protected]>

Add markNodeAsFailing helper function

146482c

Signed-off-by: Binbin <[email protected]>

Set CLUSTERMSG_FLAG0_FORCEACK flag if myself is the best ranked replica

22c63ef

Signed-off-by: Binbin <[email protected]>

enjoy-binbin commented Jun 25, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin added 2 commits July 2, 2025 19:57

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

034dac0

Signed-off-by: Binbin <[email protected]>

revert change

d8eb504

Signed-off-by: Binbin <[email protected]>

zuiderkwast reviewed Jul 2, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added 2 commits July 11, 2025 11:57

skip newly empty replica (offset == 0)

3925892

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

2c0b049

Signed-off-by: Binbin <[email protected]>

hpatro reviewed Jul 12, 2025

View reviewed changes

sungming2 mentioned this pull request Jul 18, 2025

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Closed

zuiderkwast approved these changes Aug 4, 2025

View reviewed changes

tests/unit/cluster/faster-failover.tcl Outdated Show resolved Hide resolved

zuiderkwast added this to Valkey 9.0 Aug 4, 2025

zuiderkwast moved this to Optional for next release candidate in Valkey 9.0 Aug 4, 2025

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

92e9629

Signed-off-by: Binbin <[email protected]>

Change to use loop for the test

5c3366b

Signed-off-by: Binbin <[email protected]>

zuiderkwast mentioned this pull request Aug 7, 2025

Make cluster failover delay relative to node timeout #2449

Merged

madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 11, 2025

enjoy-binbin added the cluster label Sep 19, 2025

Do the failover immediately if the replica is the best ranked replica #2227

Are you sure you want to change the base?

Do the failover immediately if the replica is the best ranked replica #2227

Conversation

enjoy-binbin commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to ensure safety?

Details

Uh oh!

codecov bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jun 18, 2025 • edited by zuiderkwast Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Jul 2, 2025

Uh oh!

enjoy-binbin commented Jul 11, 2025

Uh oh!

zuiderkwast commented Jul 11, 2025

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jul 12, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enjoy-binbin commented Jun 17, 2025 •

edited

Loading

codecov bot commented Jun 17, 2025 •

edited

Loading

enjoy-binbin commented Jun 18, 2025 •

edited by zuiderkwast

Loading

madolson commented Aug 6, 2025 •

edited

Loading