Skip to content

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Jun 12, 2025

When myself is a replica, if we receive a message that my primary is
FAIL, we can try to set CLUSTER_TODO_HANDLE_FAILOVER so that we can
try to failover as soon as possible in beforeSleep without waiting
for clusterCron to kick in.

Add a new markNodeAsFailing method, we move FAIL flag related code to
here so we can easily check whether the failing node is my primary.

The following is all old content. It was deleted during the discussion, but the content is worth keeping. (Don't merge this text)

================================================================================

Remove the 500ms fixed delay for automatic failover for faster failover

An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.

server.cluster->failover_auth_time =
now +
500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */

A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.

Currently a replica now can only learn of the death of its primary
node from two places:

  1. Collecting pfail messages from other primary nodes, and then check
    in markNodeAsFailingIfNeeded to see if we have the quorum.
  2. Receiving FAIL messages from other primary nodes (or replica nodes).

Anyway both the failing state is triggered collecting failure reports
from primaries.

In markNodeAsFailingIfNeeded, if nodeIsReplica(myself) && myself->replicaof == node
is true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.

void markNodeAsFailingIfNeeded(clusterNode *node) {
    ...
    /* Broadcast the failing node name to everybody, forcing all the other
     * reachable nodes to flag the node as FAIL.
     * We do that even if this node is a replica and not a primary: anyway
     * the failing state is triggered collecting failure reports from primaries,
     * so here the replica is only helping propagating this status. */
    clusterSendFail(node->name);
    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG);

And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.

    } else if (type == CLUSTERMSG_TYPE_FAIL) {
        clusterNode *failing;

        if (sender) {
            failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN);
            if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) {
                serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender,
                          sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename);
                failing->flags |= CLUSTER_NODE_FAIL;
                failing->fail_time = now;
                failing->flags &= ~CLUSTER_NODE_PFAIL;
                clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
            }

So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.

Related to #2023.

An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.
```
server.cluster->failover_auth_time =
now +
500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
```

A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.

Currently a replica now can only learn of the death of its primary
node from two places:
1. Collecting pfail messages from other primary nodes, and then check
in markNodeAsFailingIfNeeded to see if we have the quorum.
2. Receiving FAIL messages from other primary nodes (or replica nodes).

Anyway both the failing state is triggered collecting failure reports
from primaries.

In markNodeAsFailingIfNeeded, if `nodeIsReplica(myself) && myself->replicaof == node`
is true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.
```
void markNodeAsFailingIfNeeded(clusterNode *node) {
    ...
    /* Broadcast the failing node name to everybody, forcing all the other
     * reachable nodes to flag the node as FAIL.
     * We do that even if this node is a replica and not a primary: anyway
     * the failing state is triggered collecting failure reports from primaries,
     * so here the replica is only helping propagating this status. */
    clusterSendFail(node->name);
    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG);
```

And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.
```
    } else if (type == CLUSTERMSG_TYPE_FAIL) {
        clusterNode *failing;

        if (sender) {
            failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN);
            if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) {
                serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender,
                          sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename);
                failing->flags |= CLUSTER_NODE_FAIL;
                failing->fail_time = now;
                failing->flags &= ~CLUSTER_NODE_PFAIL;
                clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
            }
```

So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin added the major-decision-pending Major decision pending by TSC team label Jun 12, 2025
@codecov
Copy link

codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.53%. Comparing base (2287261) to head (00d4b4c).

Files with missing lines Patch % Lines
src/cluster_legacy.c 90.90% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2209      +/-   ##
============================================
- Coverage     71.54%   71.53%   -0.01%     
============================================
  Files           122      122              
  Lines         66491    66501      +10     
============================================
+ Hits          47570    47573       +3     
- Misses        18921    18928       +7     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.82% <90.90%> (+0.18%) ⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the time between scheduling a failover and starting the vote, the replica sends it's offset to other replicas in the same shard, to make sure the nodes agree on the rank before the failover starts.

        /* Now that we have a scheduled election, broadcast our offset
         * to all the other replicas so that they'll updated their offsets
         * if our offset is better. */
        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_REPLICAS);

Assuming the other replicas will also receive the FAIL, they will also broadcast their offset to other replicas in the same shard, so all replicas will have updated offset before any replica actually starts the vote.

This code updates the rank and the delay if some replica reports a better offset:

    /* It is possible that we received more updated offsets from other
     * replicas for the same primary since we computed our election delay.
     * Update the delay if our rank changed.
     *
     * It is also possible that we received the message that telling a
     * shard is up. Update the delay if our failed_primary_rank changed.
     *
     * Not performed if this is a manual failover. */
    if (server.cluster->failover_auth_sent == 0 && server.cluster->mf_end == 0) {
        int newrank = clusterGetReplicaRank();
        if (newrank != server.cluster->failover_auth_rank) {
            long long added_delay = (newrank - server.cluster->failover_auth_rank) * 1000;
            server.cluster->failover_auth_time += added_delay;
            server.cluster->failover_auth_rank = newrank;
            serverLog(LL_NOTICE, "Replica rank updated to #%d, added %lld milliseconds of delay.", newrank,
                      added_delay);
        }

With this PR, the delay can be near 0, so there is not enough time for the replicas to agree about the rank. In the worst case, two replicas both believe they are rank 0 and start the failover at the same time, and none of them will win the election, causing even more downtime before we have a successful failover.

Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?

@madolson
Copy link
Member

madolson commented Jun 12, 2025

Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?

We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate.

We could wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption from #2023 we wanted to make this delay dynamic w.r.t. to cluster-node-timeout. I'm apprehensive to having a separate flow to handle faster failover.

@enjoy-binbin
Copy link
Member Author

My assumption from #2023 we wanted to make this delay dynamic w.r.t. to cluster-node-timeout. I'm apprehensive to having a separate flow to handle faster failover.

yes, Viktor and i talked about this offline yesterday, i think we can do it with other fixed number, like the random() % 500, server.cluster->failover_auth_rank * 1000, server.cluster->failover_failed_primary_rank * 500 something like these.

And with this fixed 500ms delay, i wonder if the current mechanism will be safe enough, and if we determine that it is indeed safe, then i would love to remove it.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 13, 2025
@zuiderkwast
Copy link
Contributor

We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate.

@madolson Sure, but we try to make the race less likely to happen. We have rank and now also primary rank that helps avoid a race if two primaries fail at the same time.

We wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed.

Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code?

@hwware
Copy link
Member

hwware commented Jun 13, 2025

My understanding for this requirement is to change current 500 ms delay (the fixed value) to a config parameter and let user decide the value (the default value keeps as 500 ms). Added in Valkey weekly meeting.

@madolson
Copy link
Member

madolson commented Jun 13, 2025

Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code?

No, I was proposing that was something we could do. I think I missed a word in my original comment.

We are waiting from the fail message from the other replicas to make sure we have their repl offset to compute the rank. If a given replica has received that information from all other replicas, it doesn't need to wait.

@madolson
Copy link
Member

We discussed this in the weekly meeting. The discussion was that we don't want to fully remove the 500ms logic, since we believe we do need to wait to get the messages from the other replicas to verify we have the best replication offset. We need some verification that we have the offsets from the other replicas before starting the election.

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Jun 16, 2025

btw, if we only have one replica, that means, my rank is definitely is 0, and my offset is definitely the higher one. The waitting is just a waste in this case.

@enjoy-binbin
Copy link
Member Author

Assuming the comments are correct and we are waiting for FAIL to propagate. Or we consider it safe, then I would like to remove it completely.

        server.cluster->failover_auth_time = now +
                                             500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
                                             random() % 500; /* Random delay between 0 and 500 milliseconds. */

Or there are other options:

  1. Make it become a configuration item, an experienced administrator can use any value he wants, zero or 500ms, one can set it based on network (if you have a small node timeout, you can change it) or cluster size (if you only have one replica, you can just go zero) etc. This work for everyone, except the disadvantage is that a new configuration item needs to be introduced, which is not easy for ordinary people to use, but to some extent many of our configuration items rely on users to understand some details.
  2. zuiderkwast idea, doing some math like node_timeout / 30, i am fine with that, but I should mention that we don't modify this timeout value often, so it is unlikely to be used for us.

Anyway, bottom line, if we feel safe, then I want to remove it. If we are not sure, that's fine, after all, stability is also very important.

@zuiderkwast
Copy link
Contributor

@enjoy-binbin we can do Madelyn's idea. We wait but when we receive information from all replicas that they know the primary is failing and we get their offset, we start immediately. It can be very fast.

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Jun 17, 2025

ok, that is a good idea and i think we are making a progress step by step, it can definitely solve the one replica case and help other cases, i will do it in a new PR.

new PR: #2227

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Jun 17, 2025
In valkey-io#2209, we are exploring ways to make failover faster, that is, to
minimize the delay.

The purpose of this delay is to allow us to have a scheduled election,
so that during the delay, there is a way for the cluster to get more
detailed information. For example, other primary nodes need to receive
the propagation of FAIL in order to vote, and we need to know the offset
of other replica nodes in order to rank.

We want to minimize delay while ensuring safety. It is very useful for
example, if there is only one replica, then we don't need any delay.

In this PR, when we find that the replica is the best ranked replica,
we let it initiate a failover immediately and completely remove the delay.

How to ensure safety?

1. We try to broadcast a FAIL message to the cluster before the replica
initiates the election, so that other primaries will not refuse to vote.

2. Each replica node will mark in flags whether its primary node is in
FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced.
And when we gossip with others, the replica can know whether all replicas
under the primary node have reached a consensus on the FAIL state based
on this flag. If they have reached a consensus, it means that we know the
offset information of all replicas, which means that we can calculate the
rank based on the existing information without worrying about the offset
being outdated.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin removed the major-decision-pending Major decision pending by TSC team label Jun 27, 2025
@enjoy-binbin enjoy-binbin changed the title Remove the 500ms fixed delay for automatic failover for faster failover Trigger the election as soon as possible when myself marked primary as FAIL Jun 27, 2025
@enjoy-binbin
Copy link
Member Author

@zuiderkwast I revert the 500ms change, and only keep the TODO FAILOVER logic. (I guess we can ignore the DCO for now, i can squash a signoff when i merge it)

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM. Now it's only to skip the wait for the next clusterCron cycle.

I think you can change the title from "as soon as possible" to for example something like "Start failover without waiting for the next cluster cron cycle".

Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin changed the title Trigger the election as soon as possible when myself marked primary as FAIL Start failover without waiting for the next cluster cron cycle Jun 27, 2025
@enjoy-binbin enjoy-binbin merged commit 23fb9f2 into valkey-io:unstable Jun 27, 2025
51 of 58 checks passed
@enjoy-binbin enjoy-binbin deleted the faster_failover branch June 27, 2025 16:04
@hpatro
Copy link
Collaborator

hpatro commented Jun 27, 2025

@enjoy-binbin @zuiderkwast Could you explain the rational behind this change better?

I don't understand the benefit of this very clearly. Let me share my reasoning on this. I believe on a single primary failure scenario with just one replica this change could potentially help with faster failover, however that is also not guaranteed if the healthy primaries are not yet aware about the failure of the primary for which vote is requested.

Further, if there are multiple replicas in a given shard, it's very likely they would conflict with each other and divide the votes amongst them, which isn't ideal. It's also possible that a replica with lower offset could be elected.

And in multiple primary failure scenario, we would again divide the votes for a given epoch amongst different shard causing wastage of election cycle and introduce delay.

I would also like the @valkey-io/core-team to share their opinion on this.

@zuiderkwast
Copy link
Contributor

@hpatro This PR is now very minor. It doesn't remove the 500 delay nor the random delay. It only removed the wait for next cluster cron cycle, which I believe is not an intentional delay.

Maybe you want to discuss this one instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants