Skip to content

Commit 1007ded

Browse files
enjoy-binbinzvischn
authored andcommitted
Fix replica not able to initate election in time when epoch fails (valkey-io#1009)
If multiple primary nodes go down at the same time, their replica nodes will initiate the elections at the same time. There is a certain probability that the replicas will initate the elections in the same epoch. And obviously, in our current election mechanism, only one replica node can eventually get the enough votes, and the other replica node will fail to win due the the insufficient majority, and then its election will time out and we will wait for the retry, which result in a long failure time. If another node has been won the election in the failover epoch, we can assume that my election has failed and we can retry as soom as possible. Signed-off-by: Binbin <[email protected]>
1 parent 879c94c commit 1007ded

File tree

2 files changed

+51
-0
lines changed

2 files changed

+51
-0
lines changed

src/cluster_legacy.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3135,6 +3135,24 @@ int clusterProcessPacket(clusterLink *link) {
31353135
if (sender_claims_to_be_primary && sender_claimed_config_epoch > sender->configEpoch) {
31363136
sender->configEpoch = sender_claimed_config_epoch;
31373137
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);
3138+
3139+
if (server.cluster->failover_auth_time && sender->configEpoch >= server.cluster->failover_auth_epoch) {
3140+
/* Another node has claimed an epoch greater than or equal to ours.
3141+
* If we have an ongoing election, reset it because we cannot win
3142+
* with an epoch smaller than or equal to the incoming claim. This
3143+
* allows us to start a new election as soon as possible. */
3144+
server.cluster->failover_auth_time = 0;
3145+
serverLog(LL_WARNING,
3146+
"Failover election in progress for epoch %llu, but received a claim from "
3147+
"node %.40s (%s) with an equal or higher epoch %llu. Resetting the election "
3148+
"since we cannot win an election in the past.",
3149+
(unsigned long long)server.cluster->failover_auth_epoch,
3150+
sender->name, sender->human_nodename,
3151+
(unsigned long long)sender->configEpoch);
3152+
/* Maybe we could start a new election, set a flag here to make sure
3153+
* we check as soon as possible, instead of waiting for a cron. */
3154+
clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
3155+
}
31383156
}
31393157
/* Update the replication offset info for this node. */
31403158
sender->repl_offset = ntohu64(hdr->offset);

tests/unit/cluster/failover2.tcl

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,3 +64,36 @@ start_cluster 3 4 {tags {external:skip cluster} overrides {cluster-ping-interval
6464
}
6565

6666
} ;# start_cluster
67+
68+
69+
start_cluster 7 3 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 5000}} {
70+
test "Primaries will not time out then they are elected in the same epoch" {
71+
# Since we have the delay time, so these node may not initiate the
72+
# election at the same time (same epoch). But if they do, we make
73+
# sure there is no failover timeout.
74+
75+
# Killing there primary nodes.
76+
pause_process [srv 0 pid]
77+
pause_process [srv -1 pid]
78+
pause_process [srv -2 pid]
79+
80+
# Wait for the failover
81+
wait_for_condition 1000 50 {
82+
[s -7 role] == "master" &&
83+
[s -8 role] == "master" &&
84+
[s -9 role] == "master"
85+
} else {
86+
fail "No failover detected"
87+
}
88+
89+
# Make sure there is no failover timeout.
90+
verify_no_log_message -7 "*Failover attempt expired*" 0
91+
verify_no_log_message -8 "*Failover attempt expired*" 0
92+
verify_no_log_message -9 "*Failover attempt expired*" 0
93+
94+
# Resuming these primary nodes, speed up the shutdown.
95+
resume_process [srv 0 pid]
96+
resume_process [srv -1 pid]
97+
resume_process [srv -2 pid]
98+
}
99+
} ;# start_cluster

0 commit comments

Comments
 (0)