Skip to content

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Jun 17, 2025

In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is,
to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a delay of
500-1000ms. The original purpose of this delay:

  1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure
    they will vote when a replica sends failover auth request.
  2. Allow replicas to exchange their offsets, so they will have a correct view of
    their own rank.

We want to minimize this delay while ensuring safety. It is useful for example in
these cases:

  1. If there is only one replica, then we don't need any delay, or
  2. If there are more replicas, with a fast network, the replicas can exchange the
    offsets very quickly and start the failover within a few milliseconds instead
    of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked replica, we let
it initiate a failover immediately and completely remove the delay.

How to ensure safety?

  1. To make sure this replica has the best rank, it only skips the delay if it is
    sure that it have the best rank and that all replicas in the same shard agree that
    the primary is failing. A new flag CLUSTER_NODE_MY_PRIMARY_FAIL is introduced to
    indicate that each replica has marked its primary as FAIL. If all replicas say that
    the primary is failing, we also know that the offset is not updated, because the
    offset is not incrementing when the primary is failing. We can skip the delay only
    if we have received a message from all replicas and they all have set this flag.

  2. To make sure the primaries will vote even if they didn't receive the FAIL yet,
    we use the CLUSTERMSG_FLAG0_FORCEACK to make sure they will vote. This is equivalent
    to broadcasting a FAIL message to all primaries before we broadcast the failover
    auth request (but cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated
in the following sequence diagram:

A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|

Details

This is the how the failover is initiated, with new steps marked with (new):

  1. A majority of primaries have marked another primary as PFAIL.
  2. Some nodes counts failure reports and marks the failing primary as FAIL. The node
    that detects FAIL broadcasts it to all nodes in the cluster.
  3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it
    schedules a failover:
    a. It sets a timeout (500ms + random 0-500ms).
    b. It broadcasts pong to the other replicas in the same shard.
    c. (new) The pong (actually the clusterMsg header) has a new flag CLUSTER_NODE_MY_PRIMARY_FAIL.
    When the replicas broadcast pong to each other here, this flag is set.
  4. (new) When the following conditions are met, skip the remaining delay and start
    the failover using AUTH REQUEST with the FORCE ACK flag set, that is if
    a. a PONG is received from every other replica in the same shard (broadcast within
    the shard) and
    b. all replicas have marked that its primary is FAIL in their last message (the new
    CLUSTER_NODE_MY_PRIMARY_FAIL flag is set) and
    c. this is the best replica (rank = 0) and
    d. my replication offset != 0.
  5. When the delay has passed and no other replica has initiated failover, then initiate failover.

Notes:

  • With 3(c), we don't need to wait for FAIL to propagate to all voting primaries.
    At this point, a FAIL has already been broadcast by some node, but there is a race
    so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag
    ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL
    just before broadcasting auth request.)
  • 4(b) ensures that we have received the replication offset from all other replicas and
    that it's up to date. If a replica says that it's primary is failing, it also means that
    the replication from the primary to that replica has stopped.
  • 4(c) is to avoid a special bad case. It can happen that not all replicas know about each
    other. In this case, two replicas can think they are both the best replica and start the
    failover at the same time. This can already happen without this PR. When it happens, it
    usually means that a new replica has just joined and it has no data (offset = 0) and if
    it wins the election, there is a problem of dataloss (discussed and partially mitigated
    for the replica migration case in Fix data loss when replica do a failover with a old history repl offset #885). To avoid this case, skip this fast failover path
    if the replica has offset = 0.

In valkey-io#2209, we are exploring ways to make failover faster, that is, to
minimize the delay.

The purpose of this delay is to allow us to have a scheduled election,
so that during the delay, there is a way for the cluster to get more
detailed information. For example, other primary nodes need to receive
the propagation of FAIL in order to vote, and we need to know the offset
of other replica nodes in order to rank.

We want to minimize delay while ensuring safety. It is very useful for
example, if there is only one replica, then we don't need any delay.

In this PR, when we find that the replica is the best ranked replica,
we let it initiate a failover immediately and completely remove the delay.

How to ensure safety?

1. We try to broadcast a FAIL message to the cluster before the replica
initiates the election, so that other primaries will not refuse to vote.

2. Each replica node will mark in flags whether its primary node is in
FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced.
And when we gossip with others, the replica can know whether all replicas
under the primary node have reached a consensus on the FAIL state based
on this flag. If they have reached a consensus, it means that we know the
offset information of all replicas, which means that we can calculate the
rank based on the existing information without worrying about the offset
being outdated.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 17, 2025
@codecov
Copy link

codecov bot commented Jun 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.48%. Comparing base (e9edba4) to head (2c0b049).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2227      +/-   ##
============================================
+ Coverage     71.36%   71.48%   +0.11%     
============================================
  Files           123      123              
  Lines         67094    67121      +27     
============================================
+ Hits          47884    47980      +96     
+ Misses        19210    19141      -69     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.93% <100.00%> (+0.21%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

I have some ideas. See comments.

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Jun 18, 2025

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

Yes, i will add the test later, i test it locally yesterday and it work.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

Yes, it is indeed a bit confusing, sorry i did not mention it. I want to put CLUSTER_TODO_HANDLE_FASTER_FAILOVER after other flags, like after CLUSTER_TODO_UPDATE_STATE, so that we will first print CLUSTER IS DOWN and then try to trigger the failover. That is now we will trigger failover faster, in the past, we will update the cluster state and then try to failover in ClusterCron.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

Sure, i can remove FASTER_FAILOVER and then only put it in #2209

@zuiderkwast
Copy link
Contributor

@hpatro I have discussed with @enjoy-binbin and updated the top comment with a more thorough reasoning about the possible scenarios. PTAL.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

@enjoy-binbin
Copy link
Member Author

Let's also include a check that myself's replication offset != 0 here?

It makes sure that if myself is a new replica just connected that don't yet know about other replicas, we should not try to do this fast failover.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Ok, i pushed the new commit, sorry for the late response, i somehow forgot it and thought we are waitting response from @hpatro

@zuiderkwast
Copy link
Contributor

@madolson This is based on your idea to keep track of other replicas' view of the primay's fail status. Great if you can take a look.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't taken a complete look.

@enjoy-binbin Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

@enjoy-binbin
Copy link
Member Author

Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

What do you means? sorry i don't quite get the question.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have this in 9.0. I hope others will agree. Maybe we can merge it after RC1.

@zuiderkwast zuiderkwast moved this to Optional for next release candidate in Valkey 9.0 Aug 4, 2025
@madolson
Copy link
Member

madolson commented Aug 6, 2025

I would like to pause this for 9.0. We have seen a lot of cluster instability in AWS after merging 8.1 (much worse than in any past release), and I don't think we have the testing in place for all of these optimizations to make sure we aren't ending up in odd states.

@madolson madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 11, 2025
zuiderkwast added a commit that referenced this pull request Aug 18, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than #2227, although the two
changes are orthogonal.

Part of issue #2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
asagege pushed a commit to asagege/valkey that referenced this pull request Aug 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than #2227, although the two
changes are orthogonal.

Part of issue #2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants