-
Notifications
You must be signed in to change notification settings - Fork 955
Start failover without waiting for the next cluster cron cycle #2209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start failover without waiting for the next cluster cron cycle #2209
Conversation
An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.
```
server.cluster->failover_auth_time =
now +
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
```
A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.
Currently a replica now can only learn of the death of its primary
node from two places:
1. Collecting pfail messages from other primary nodes, and then check
in markNodeAsFailingIfNeeded to see if we have the quorum.
2. Receiving FAIL messages from other primary nodes (or replica nodes).
Anyway both the failing state is triggered collecting failure reports
from primaries.
In markNodeAsFailingIfNeeded, if `nodeIsReplica(myself) && myself->replicaof == node`
is true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.
```
void markNodeAsFailingIfNeeded(clusterNode *node) {
...
/* Broadcast the failing node name to everybody, forcing all the other
* reachable nodes to flag the node as FAIL.
* We do that even if this node is a replica and not a primary: anyway
* the failing state is triggered collecting failure reports from primaries,
* so here the replica is only helping propagating this status. */
clusterSendFail(node->name);
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG);
```
And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.
```
} else if (type == CLUSTERMSG_TYPE_FAIL) {
clusterNode *failing;
if (sender) {
failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN);
if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) {
serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender,
sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename);
failing->flags |= CLUSTER_NODE_FAIL;
failing->fail_time = now;
failing->flags &= ~CLUSTER_NODE_PFAIL;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
}
```
So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.
Signed-off-by: Binbin <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2209 +/- ##
============================================
- Coverage 71.54% 71.53% -0.01%
============================================
Files 122 122
Lines 66491 66501 +10
============================================
+ Hits 47570 47573 +3
- Misses 18921 18928 +7
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the time between scheduling a failover and starting the vote, the replica sends it's offset to other replicas in the same shard, to make sure the nodes agree on the rank before the failover starts.
/* Now that we have a scheduled election, broadcast our offset
* to all the other replicas so that they'll updated their offsets
* if our offset is better. */
clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_REPLICAS);Assuming the other replicas will also receive the FAIL, they will also broadcast their offset to other replicas in the same shard, so all replicas will have updated offset before any replica actually starts the vote.
This code updates the rank and the delay if some replica reports a better offset:
/* It is possible that we received more updated offsets from other
* replicas for the same primary since we computed our election delay.
* Update the delay if our rank changed.
*
* It is also possible that we received the message that telling a
* shard is up. Update the delay if our failed_primary_rank changed.
*
* Not performed if this is a manual failover. */
if (server.cluster->failover_auth_sent == 0 && server.cluster->mf_end == 0) {
int newrank = clusterGetReplicaRank();
if (newrank != server.cluster->failover_auth_rank) {
long long added_delay = (newrank - server.cluster->failover_auth_rank) * 1000;
server.cluster->failover_auth_time += added_delay;
server.cluster->failover_auth_rank = newrank;
serverLog(LL_NOTICE, "Replica rank updated to #%d, added %lld milliseconds of delay.", newrank,
added_delay);
}With this PR, the delay can be near 0, so there is not enough time for the replicas to agree about the rank. In the worst case, two replicas both believe they are rank 0 and start the failover at the same time, and none of them will win the election, causing even more downtime before we have a successful failover.
Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?
We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate. We could wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed. |
hpatro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption from #2023 we wanted to make this delay dynamic w.r.t. to cluster-node-timeout. I'm apprehensive to having a separate flow to handle faster failover.
yes, Viktor and i talked about this offline yesterday, i think we can do it with other fixed number, like the And with this fixed 500ms delay, i wonder if the current mechanism will be safe enough, and if we determine that it is indeed safe, then i would love to remove it. |
@madolson Sure, but we try to make the race less likely to happen. We have rank and now also primary rank that helps avoid a race if two primaries fail at the same time.
Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code? |
|
My understanding for this requirement is to change current 500 ms delay (the fixed value) to a config parameter and let user decide the value (the default value keeps as 500 ms). Added in Valkey weekly meeting. |
No, I was proposing that was something we could do. I think I missed a word in my original comment. We are waiting from the fail message from the other replicas to make sure we have their repl offset to compute the rank. If a given replica has received that information from all other replicas, it doesn't need to wait. |
|
We discussed this in the weekly meeting. The discussion was that we don't want to fully remove the 500ms logic, since we believe we do need to wait to get the messages from the other replicas to verify we have the best replication offset. We need some verification that we have the offsets from the other replicas before starting the election. |
|
btw, if we only have one replica, that means, my rank is definitely is 0, and my offset is definitely the higher one. The waitting is just a waste in this case. |
|
Assuming the comments are correct and we are waiting for FAIL to propagate. Or we consider it safe, then I would like to remove it completely. Or there are other options:
Anyway, bottom line, if we feel safe, then I want to remove it. If we are not sure, that's fine, after all, stability is also very important. |
|
@enjoy-binbin we can do Madelyn's idea. We wait but when we receive information from all replicas that they know the primary is failing and we get their offset, we start immediately. It can be very fast. |
|
ok, that is a good idea and i think we are making a progress step by step, it can definitely solve the one replica case and help other cases, i will do it in a new PR. new PR: #2227 |
In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <[email protected]>
…r failover" This reverts commit 00d4b4c.
Signed-off-by: Binbin <[email protected]>
…s FAIL Signed-off-by: Binbin <[email protected]>
|
@zuiderkwast I revert the 500ms change, and only keep the TODO FAILOVER logic. (I guess we can ignore the DCO for now, i can squash a signoff when i merge it) |
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change LGTM. Now it's only to skip the wait for the next clusterCron cycle.
I think you can change the title from "as soon as possible" to for example something like "Start failover without waiting for the next cluster cron cycle".
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>
|
@enjoy-binbin @zuiderkwast Could you explain the rational behind this change better? I don't understand the benefit of this very clearly. Let me share my reasoning on this. I believe on a single primary failure scenario with just one replica this change could potentially help with faster failover, however that is also not guaranteed if the healthy primaries are not yet aware about the failure of the primary for which vote is requested. Further, if there are multiple replicas in a given shard, it's very likely they would conflict with each other and divide the votes amongst them, which isn't ideal. It's also possible that a replica with lower offset could be elected. And in multiple primary failure scenario, we would again divide the votes for a given epoch amongst different shard causing wastage of election cycle and introduce delay. I would also like the @valkey-io/core-team to share their opinion on this. |
|
@hpatro This PR is now very minor. It doesn't remove the 500 delay nor the random delay. It only removed the wait for next cluster cron cycle, which I believe is not an intentional delay. Maybe you want to discuss this one instead? |
When myself is a replica, if we receive a message that my primary is
FAIL, we can try to set CLUSTER_TODO_HANDLE_FAILOVER so that we can
try to failover as soon as possible in beforeSleep without waiting
for clusterCron to kick in.
Add a new markNodeAsFailing method, we move FAIL flag related code to
here so we can easily check whether the failing node is my primary.
The following is all old content. It was deleted during the discussion, but the content is worth keeping. (Don't merge this text)
================================================================================
Remove the 500ms fixed delay for automatic failover for faster failover
An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.
A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.
Currently a replica now can only learn of the death of its primary
node from two places:
in markNodeAsFailingIfNeeded to see if we have the quorum.
Anyway both the failing state is triggered collecting failure reports
from primaries.
In markNodeAsFailingIfNeeded, if
nodeIsReplica(myself) && myself->replicaof == nodeis true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.
And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.
So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.
Related to #2023.