Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

enjoy-binbin · 2024-09-30T04:16:25Z

update

The auto-failover-on-shutdown was removed in #2292, please see #2292 for more details.

========================================

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

The primary does this by sending a CLUSTER FAILOVER command to the replica.
We added a replicaid arg to CLUSTER FAILOVER, after receiving the command,
the replica will check whether the node-id is itself, if not, the command
will be ignored. The node-id is set by the replica through client setname
during the replication handshake.

New argument for CLUSTER FAILOVER

So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID node-id],
this arg does not intented for user use, so it will not be added to the JSON
file.

Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id

During the replication handshake, replica now will use REPLCONF SET-CLUSTER-NODE-ID
to inform the primary of replica node-id.

Primary issue CLUSTER FAILOVER

Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas because
it is a shared replication buffer but only the replica with the mathching id
will execute it.

Add a new auto-failover-on-shutdown config

People can disable this feature if they don't like it, the default is 0.

This closes #939.

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. This closes valkey-io#939. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

codecov · 2024-09-30T04:31:34Z

Codecov Report

❌ Patch coverage is 88.23529% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.09%. Comparing base (dfdcbfe) to head (3b18c45).
⚠️ Report is 337 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	85.00%	6 Missing ⚠️
src/replication.c	92.59%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
+ Coverage     71.02%   71.09%   +0.06%     
============================================
  Files           123      123              
  Lines         65683    65766      +83     
============================================
+ Hits          46653    46754     +101     
+ Misses        19030    19012      -18

Files with missing lines	Coverage Δ
src/config.c	`78.39% <ø> (ø)`
src/server.c	`87.57% <100.00%> (+0.02%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/replication.c	`87.32% <92.59%> (+0.05%)`	⬆️
src/cluster_legacy.c	`86.08% <85.00%> (+0.05%)`	⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

Nice! Thanks for doing this.

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

This is the sequence:

SHUTDOWN or SIGTERM calls prepareForShutdown(). Here, pause clients for writing and start waiting for replicas offset.
In serverCron(), we check isReadyToShutdown() which checks if all replicas have repl_ack_off == primary_repl_offset. If yes, finishShutdown() is called, otherwise wait more.
finishShutdown.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

What do you think?

src/server.c

tests/unit/cluster/auto-failover-on-shutdown.tcl

src/cluster_legacy.c

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-10-06T06:34:00Z

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

The issue desc is good and very detailed so i copied it, i will update it later.

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

yean, a failover as soon as possible is good, but itn't true that the primary is down only after it actually exit? so in this case, if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

so let me sort it out again, you are suggesting that if one replica has already caught up the offset, we should trigger a failover immediately?

I guess it is also make sense in this case.

Signed-off-by: Binbin <[email protected]>

zuiderkwast · 2024-10-06T19:45:00Z

if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

enjoy-binbin · 2024-10-18T09:35:04Z

Sorry for the late reply, i somehow missed this thread.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

Yes, i think this may happen, like if the primary does not flush its output buffer to the slow replica, like primary does not write the buffer to the slow replica, when doing the reconfiguration, the slow replica may use an old offset to psync with the new primary, which will cause a full sync. This may happen, but the probability should be small since the primary will call flushReplicasOutputBuffers to write as much as possible before shutdown.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

zuiderkwast · 2024-10-18T09:54:41Z

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

I'm thinking if there are any corner cases, like if the cluster is too small to have quorum without the shutting down primary...

If it is simple, I prefer to let the primary wait and vote. Then we can avoid the server.cluster->mf_is_primary_failover variable. I don't like this variable and special case. :)

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-10-27T10:27:35Z

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

Signed-off-by: Binbin <[email protected]>

zuiderkwast

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

I understand. Simple is better.

But possible data loss is not good. See comments below.

tests/unit/cluster/auto-failover-on-shutdown.tcl

Signed-off-by: Binbin <[email protected]>

PingXie · 2024-10-28T04:21:31Z

This is an interesting idea. I like the direction we are going in but I agree with @zuiderkwast that potential data loss is not appealing.

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

Today, we can't forget "myself" nor "my primary" (with the latter being a dynamic state). This adds operational complexity. Imagine that the admin could just send CLUSTER FORGET to any node in the cluster and then the server will do the right thing, failing over the primaryship to one of its replicas, if applicable, and then broadcast the forget message to the cluster.

Signed-off-by: Binbin <[email protected]>

tests/unit/cluster/auto-failover-on-shutdown.tcl

zuiderkwast · 2024-10-28T10:47:58Z

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

@PingXie Yes, it's a good idea, but this PR is about the scenario that the machine is taken down without the control of the Valkey admin. For example, in Kubernetes when a worker is shutdown, SIGTERM is sent to all processes and it waits for 30 seconds by default. When you shutdown your laptop, I believe it's similar, each application gets SIGTERM and has some time to be able to do a graceful shutdown.

src/cluster_legacy.c

Signed-off-by: Binbin <[email protected]>

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

madolson · 2025-04-07T14:25:31Z

Discussed briefly in core team meeting. No more open questions, so once low level details are done we can merge it for 9.0.

zuiderkwast · 2025-04-07T17:24:01Z

@hpatro @madolson Do want to complete your reviews or can we merge it?

PingXie · 2025-04-07T19:56:58Z

can we merge it?

I have a few new open questions. Will also reset my approval next

PingXie

LGTM.

valkey.conf

Signed-off-by: Binbin <[email protected]>

madolson

I never reviewed all the details, but the high-level is good to me now.

…key-io#1091) When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. The primary does this by sending a CLUSTER FAILOVER command to the replica. We added a replicaid arg to CLUSTER FAILOVER, after receiving the command, the replica will check whether the node-id is itself, if not, the command will be ignored. The node-id is set by the replica through client setname during the replication handshake. ### New argument for CLUSTER FAILOVER So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID node-id], this arg does not intented for user use, so it will not be added to the JSON file. ### Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id During the replication handshake, replica now will use REPLCONF SET-CLUSTER-NODE-ID to inform the primary of replica node-id. ### Primary issue CLUSTER FAILOVER Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas because it is a shared replication buffer but only the replica with the mathching id will execute it. ### Add a new auto-failover-on-shutdown config People can disable this feature if they don't like it, the default is 0. This closes valkey-io#939. --------- Signed-off-by: Binbin <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]>

zuiderkwast · 2025-06-19T12:44:59Z

When reviewing #2195, I'm thinking back on this feature again. Here, we added a new config auto-failover-on-shutdown. Why didn't we instead add a new option failover to the existing configs for shutdown-on-sigterm and shutdown-on-sigint and an argument to the SHUTDOWN command (SHUTDOWN [FAILOVER]) to trigger the failover on shutdown based on how the shutdown is triggered? This is how other shutdown behavior is selected.

This is not released yet so we can still change this.

enjoy-binbin · 2025-06-19T14:32:38Z

right, i totally forgot it. I guess it might be the difference between active and passive? auto-failover-on-shutdown (and shutdown-on-sig failover for sure) is like a passive way, and SHUTDOWN FAILOVER is a active way.

And you are right, this anyway is not released and we can change it for good.

@valkey-io/core-team thoughts?

In #1091, a new config `auto-failover-on-shutdown` was added. This PR changes the config to make it unified with other shutdown related options. This feature has not yet been released, so it's not a breaking change. The auto-failover-on-shutdown config is replaced by * A new "failover" option to the existing configs `shutdown-on-sigterm` and `shutdown-on-sigint`. * A new FAILOVER option to the SHUTDOWN command. Additionally, a history entry is added to the SHUTDOWN command which was missing in #2195. Follow-up of #1091. Signed-off-by: Viktor Söderqvist <[email protected]>

enjoy-binbin requested a review from zuiderkwast September 30, 2024 04:16

enjoy-binbin force-pushed the shutdown_failover branch from 42dc65e to 6ab8888 Compare September 30, 2024 04:16

fix typo

4b49f03

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 30, 2024

zuiderkwast reviewed Oct 1, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

f9ca731

Signed-off-by: Binbin <[email protected]>

add comment in the test

df0ef8d

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added 2 commits October 21, 2024 10:44

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

594fd5a

Signed-off-by: Binbin <[email protected]>

removing mf_is_primary_failover

519eb2a

Signed-off-by: Binbin <[email protected]>

enjoy-binbin force-pushed the shutdown_failover branch from 366e082 to 519eb2a Compare October 27, 2024 10:25

enjoy-binbin added 3 commits October 27, 2024 20:45

try to fix test

32043dd

Signed-off-by: Binbin <[email protected]>

try to stable the test

e7b33fa

Signed-off-by: Binbin <[email protected]>

Move the logic to clusterHandleServerShutdown

d6649e5

Signed-off-by: Binbin <[email protected]>

zuiderkwast reviewed Oct 28, 2024

View reviewed changes

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

Adjust the tests

64831c9

Signed-off-by: Binbin <[email protected]>

Do shutdown failover only when offset is match

b06a8c4

Signed-off-by: Binbin <[email protected]>

enjoy-binbin force-pushed the shutdown_failover branch from 8ee3291 to b06a8c4 Compare October 28, 2024 04:33

zuiderkwast reviewed Oct 28, 2024

View reviewed changes

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

enjoy-binbin commented Oct 28, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added 2 commits January 8, 2025 13:31

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

5f7b429

Signed-off-by: Binbin <[email protected]>

remove count++ and fix confilct

e56a360

Signed-off-by: Binbin <[email protected]>

enjoy-binbin and others added 5 commits April 7, 2025 11:48

Remove tmp file

c367470

Signed-off-by: Binbin <[email protected]>

Remove tmp file

5e88fd3

Signed-off-by: Binbin <[email protected]>

Code review

2cd1832

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

d2bf07f

Signed-off-by: Binbin <[email protected]>

Update tests/unit/cluster/auto-failover-on-shutdown.tcl

533e6a6

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

zuiderkwast mentioned this pull request Apr 7, 2025

[NEW] Introduce First-Class Durability Support in Valkey #1355

Open

zuiderkwast removed this from Valkey 8.1 Apr 7, 2025

zuiderkwast added this to Valkey 9.0 Apr 7, 2025

zuiderkwast moved this to In Progress in Valkey 9.0 Apr 7, 2025

madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Apr 7, 2025

PingXie approved these changes Apr 8, 2025

View reviewed changes

valkey.conf Outdated Show resolved Hide resolved

PingXie mentioned this pull request Apr 8, 2025

[New Issue] Keep old primary after finishing shutdown for MOVED redirections #1930

Open

update valkey.conf, review from Ping

3b18c45

Signed-off-by: Binbin <[email protected]>

madolson approved these changes Apr 8, 2025

View reviewed changes

hwware merged commit 44dafba into valkey-io:unstable Apr 8, 2025
58 checks passed

github-project-automation bot moved this from In Progress to Done in Valkey 9.0 Apr 8, 2025

enjoy-binbin deleted the shutdown_failover branch April 9, 2025 01:56

zuiderkwast mentioned this pull request Jul 1, 2025

Auto-failover on shutdown unified config #2292

Merged

zuiderkwast mentioned this pull request Jul 13, 2025

Failover on shutdown for standalone #2349

Open

zuiderkwast mentioned this pull request Aug 14, 2025

Valkey 9.0.0-rc1 release notes and version.h bump #2476

Merged

zuiderkwast mentioned this pull request Sep 5, 2025

[NEW] REPLICAOF NO ONE using a coordinated failover #2587

Open

enjoy-binbin added the cluster label Sep 19, 2025

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Uh oh!

Conversation

enjoy-binbin commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update

New argument for CLUSTER FAILOVER

Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id

Primary issue CLUSTER FAILOVER

Add a new auto-failover-on-shutdown config

Uh oh!

codecov bot commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Oct 6, 2024

Uh oh!

zuiderkwast commented Oct 6, 2024

Uh oh!

enjoy-binbin commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Oct 18, 2024

Uh oh!

enjoy-binbin commented Oct 27, 2024

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PingXie commented Oct 28, 2024

Uh oh!

Uh oh!

zuiderkwast commented Oct 28, 2024

Uh oh!

Uh oh!

madolson commented Apr 7, 2025

Uh oh!

zuiderkwast commented Apr 7, 2025

Uh oh!

PingXie commented Apr 7, 2025

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zuiderkwast commented Jun 19, 2025

Uh oh!

enjoy-binbin commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

enjoy-binbin commented Sep 30, 2024 •

edited

Loading

codecov bot commented Sep 30, 2024 •

edited

Loading

enjoy-binbin commented Oct 18, 2024 •

edited

Loading