-
Notifications
You must be signed in to change notification settings - Fork 955
Do the failover immediately if the replica is the best ranked replica #2227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2227 +/- ##
============================================
+ Coverage 71.36% 71.48% +0.11%
============================================
Files 123 123
Lines 67094 67121 +27
============================================
+ Hits 47884 47980 +96
+ Misses 19210 19141 -69
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
I have some ideas. See comments.
Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.
I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.
IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?
Yes, i will add the test later, i test it locally yesterday and it work.
Yes, it is indeed a bit confusing, sorry i did not mention it. I want to put CLUSTER_TODO_HANDLE_FASTER_FAILOVER after other flags, like after CLUSTER_TODO_UPDATE_STATE, so that we will first print CLUSTER IS DOWN and then try to trigger the failover. That is now we will trigger failover faster, in the past, we will update the cluster state and then try to failover in ClusterCron.
Sure, i can remove FASTER_FAILOVER and then only put it in #2209 |
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
|
@hpatro I have discussed with @enjoy-binbin and updated the top comment with a more thorough reasoning about the possible scenarios. PTAL. The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.) |
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Ok, i pushed the new commit, sorry for the late response, i somehow forgot it and thought we are waitting response from @hpatro |
|
@madolson This is based on your idea to keep track of other replicas' view of the primay's fail status. Great if you can take a look. |
hpatro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't taken a complete look.
@enjoy-binbin Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?
What do you means? sorry i don't quite get the question. |
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have this in 9.0. I hope others will agree. Maybe we can merge it after RC1.
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
|
I would like to pause this for 9.0. We have seen a lot of cluster instability in AWS after merging 8.1 (much worse than in any past release), and I don't think we have the testing in place for all of these optimizations to make sure we aren't ending up in odd states. |
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is,
to minimize the delay.
When a node is marked as FAIL and before the failover starts, there is a delay of
500-1000ms. The original purpose of this delay:
they will vote when a replica sends failover auth request.
their own rank.
We want to minimize this delay while ensuring safety. It is useful for example in
these cases:
offsets very quickly and start the failover within a few milliseconds instead
of 500-1000ms.
In this PR, when we can be sure that the replica is the best ranked replica, we let
it initiate a failover immediately and completely remove the delay.
How to ensure safety?
To make sure this replica has the best rank, it only skips the delay if it is
sure that it have the best rank and that all replicas in the same shard agree that
the primary is failing. A new flag
CLUSTER_NODE_MY_PRIMARY_FAILis introduced toindicate that each replica has marked its primary as FAIL. If all replicas say that
the primary is failing, we also know that the offset is not updated, because the
offset is not incrementing when the primary is failing. We can skip the delay only
if we have received a message from all replicas and they all have set this flag.
To make sure the primaries will vote even if they didn't receive the FAIL yet,
we use the
CLUSTERMSG_FLAG0_FORCEACKto make sure they will vote. This is equivalentto broadcasting a FAIL message to all primaries before we broadcast the failover
auth request (but cheaper).
The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated
in the following sequence diagram:
Details
This is the how the failover is initiated, with new steps marked with (new):
that detects FAIL broadcasts it to all nodes in the cluster.
schedules a failover:
a. It sets a timeout (500ms + random 0-500ms).
b. It broadcasts pong to the other replicas in the same shard.
c. (new) The pong (actually the clusterMsg header) has a new flag
CLUSTER_NODE_MY_PRIMARY_FAIL.When the replicas broadcast pong to each other here, this flag is set.
the failover using AUTH REQUEST with the FORCE ACK flag set, that is if
a. a PONG is received from every other replica in the same shard (broadcast within
the shard) and
b. all replicas have marked that its primary is FAIL in their last message (the new
CLUSTER_NODE_MY_PRIMARY_FAILflag is set) andc. this is the best replica (rank = 0) and
d. my replication offset != 0.
Notes:
At this point, a FAIL has already been broadcast by some node, but there is a race
so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag
ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL
just before broadcasting auth request.)
that it's up to date. If a replica says that it's primary is failing, it also means that
the replication from the primary to that replica has stopped.
other. In this case, two replicas can think they are both the best replica and start the
failover at the same time. This can already happen without this PR. When it happens, it
usually means that a new replica has just joined and it has no data (offset = 0) and if
it wins the election, there is a problem of dataloss (discussed and partially mitigated
for the replica migration case in Fix data loss when replica do a failover with a old history repl offset #885). To avoid this case, skip this fast failover path
if the replica has offset = 0.