Skip to content

Commit 2ca0dd8

Browse files
authored
Make cluster failover delay relative to node timeout (#2449)
In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
1 parent 7a5c0d0 commit 2ca0dd8

File tree

1 file changed

+19
-8
lines changed

1 file changed

+19
-8
lines changed

src/cluster_legacy.c

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5167,6 +5167,10 @@ void clusterHandleReplicaFailover(void) {
51675167
if (auth_timeout < CLUSTER_OPERATION_TIMEOUT) auth_timeout = CLUSTER_OPERATION_TIMEOUT;
51685168
auth_retry_time = auth_timeout * 2;
51695169

5170+
/* Use a failover delay relative to node timeout: 500 for the default node
5171+
* timeout of 15000, less for lower node timeout, but not more. */
5172+
long long delay = min(server.cluster_node_timeout / 30, 500);
5173+
51705174
/* Pre conditions to run the function, that must be met both in case
51715175
* of an automatic or manual failover:
51725176
* 1) We are a replica.
@@ -5212,20 +5216,27 @@ void clusterHandleReplicaFailover(void) {
52125216
* elapsed, we can setup a new one. */
52135217
if (auth_age > auth_retry_time) {
52145218
server.cluster->failover_auth_time = now +
5215-
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
5216-
random() % 500; /* Random delay between 0 and 500 milliseconds. */
5219+
delay + /* Fixed delay to let FAIL msg propagate. */
5220+
random() % delay; /* Random delay between 0 and the fixed delay. */
52175221
server.cluster->failover_auth_count = 0;
52185222
server.cluster->failover_auth_sent = 0;
52195223
server.cluster->failover_auth_rank = clusterGetReplicaRank();
52205224
/* We add another delay that is proportional to the replica rank.
5221-
* Specifically 1 second * rank. This way replicas that have a probably
5225+
* By default, 1 second * rank. This way replicas that have a probably
52225226
* less updated replication offset, are penalized. */
5223-
server.cluster->failover_auth_time += server.cluster->failover_auth_rank * 1000;
5227+
server.cluster->failover_auth_time += server.cluster->failover_auth_rank * (delay * 2);
5228+
/* If this is a newly added replica, there is a risk it doesn't know
5229+
* about other replicas yet, so it may think it's the best replica even
5230+
* if there are others with a better replication offsets. Add an extra
5231+
* delay to make it less likely to will win the failover. */
5232+
if (getNodeReplicationOffset(myself) == 0) {
5233+
server.cluster->failover_auth_time += 500;
5234+
}
52245235
/* We add another delay that is proportional to the failed primary rank.
5225-
* Specifically 0.5 second * rank. This way those failed primaries will be
5236+
* By default, 0.5 second * rank. This way those failed primaries will be
52265237
* elected in rank to avoid the vote conflicts. */
52275238
server.cluster->failover_failed_primary_rank = clusterGetFailedPrimaryRank();
5228-
server.cluster->failover_auth_time += server.cluster->failover_failed_primary_rank * 500;
5239+
server.cluster->failover_auth_time += server.cluster->failover_failed_primary_rank * delay;
52295240
/* However if this is a manual failover, no delay is needed. */
52305241
if (server.cluster->mf_end) {
52315242
server.cluster->failover_auth_time = now;
@@ -5262,7 +5273,7 @@ void clusterHandleReplicaFailover(void) {
52625273
if (server.cluster->failover_auth_sent == 0 && server.cluster->mf_end == 0) {
52635274
int newrank = clusterGetReplicaRank();
52645275
if (newrank != server.cluster->failover_auth_rank) {
5265-
long long added_delay = (newrank - server.cluster->failover_auth_rank) * 1000;
5276+
long long added_delay = (newrank - server.cluster->failover_auth_rank) * (delay * 2);
52665277
server.cluster->failover_auth_time += added_delay;
52675278
server.cluster->failover_auth_rank = newrank;
52685279
serverLog(LL_NOTICE, "Replica rank updated to #%d, added %lld milliseconds of delay.", newrank,
@@ -5271,7 +5282,7 @@ void clusterHandleReplicaFailover(void) {
52715282

52725283
int new_failed_primary_rank = clusterGetFailedPrimaryRank();
52735284
if (new_failed_primary_rank != server.cluster->failover_failed_primary_rank) {
5274-
long long added_delay = (new_failed_primary_rank - server.cluster->failover_failed_primary_rank) * 500;
5285+
long long added_delay = (new_failed_primary_rank - server.cluster->failover_failed_primary_rank) * delay;
52755286
server.cluster->failover_auth_time += added_delay;
52765287
server.cluster->failover_failed_primary_rank = new_failed_primary_rank;
52775288
serverLog(LL_NOTICE, "Failed primary rank updated to #%d, added %lld milliseconds of delay.",

0 commit comments

Comments
 (0)