Add support for Atomic Slot Migration to CLI #2755

murphyjacob4 · 2025-10-20T22:35:07Z

Adds a new option --cluster-use-atomic-slot-migration. This will apply to both --cluster reshard and --cluster rebalance commands.

We could do some more optimizations here, but for now we batch all the slot ranges for one (source, target) pair and send them off as one CLUSTER MIGRATESLOTS request. We then wait for this request to finish through polling CLUSTER GETSLOTMIGRATIONS once every 100ms. We parse CLUSTER GETSLOTMIGRATIONS and look for the most recent migration affecting the requested slot range, then check if it is in progress, failed, cancelled, or successful. If there is a failure or cancellation, we give this error to the user.

Fixes #2504

Signed-off-by: Jacob Murphy <[email protected]>

codecov · 2025-10-20T23:00:06Z

Codecov Report

❌ Patch coverage is 76.53846% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.61%. Comparing base (1cf0df9) to head (2287bef).
⚠️ Report is 32 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/valkey-cli.c	76.53%	61 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2755      +/-   ##
============================================
+ Coverage     72.60%   72.61%   +0.01%     
============================================
  Files           128      128              
  Lines         71303    71526     +223     
============================================
+ Hits          51767    51941     +174     
- Misses        19536    19585      +49

Files with missing lines	Coverage Δ
src/valkey-cli.c	`57.06% <76.53%> (+0.84%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

LGTM in general

zuiderkwast · 2025-10-30T10:33:36Z

src/valkey-cli.c

+            /* For atomic slot migration, we move everything as one command */
+            int result = clusterManagerMoveSlotRangesASM(item->source, target, item->slot_ranges, opts, &err);
+            if (!result) {
+                clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err);


These error messages are to the user. They shouldn't normally contain references to the code internals.

How about something like this?

Suggested change

clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err);

clusterManagerLogErr("Atomic slot migration failed: %s\n", err);

zuiderkwast · 2025-10-30T11:03:11Z

src/valkey-cli.c

+                if (opts & CLUSTER_MANAGER_CMD_FLAG_USE_ATOMIC_SLOT_MIGRATION) {
+                    /* Now that the migration is done, print all the #'s */
+                    printf("#");
+                    continue;
+                }


Hehe, this is an atomic progress bar, completing atomically in one step. 😄

Would it make sense to track the progress in clusterManagerMoveSlotRangesASM and print some progress indicator based on the syncslots states or something? Maybe later? We can ignore it for now.

zuiderkwast · 2025-10-30T11:10:05Z

src/valkey-cli.c

    return clusterManagerCommandCheck(argc, argv);
 }

+static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {


Can we add a small documentation comment to this function?

Suggested change

static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {

/* Perform the slot migrations specified in the table, which is a list of

* clusterManagerReshardTableItem pointers. Opts is a bitwise-or of

* CLUSTER_MANAGER_CMD_FLAG_ flags. Returns 1 on success, 0 on error. */

static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {

zuiderkwast · 2025-10-30T11:24:33Z

src/valkey-cli.c

-            }
-            goto cleanup;
-        }
+    int opts = CLUSTER_MANAGER_OPT_VERBOSE | config.cluster_manager_command.flags;


Mixing CLUSTER_MANAGER_OPT_* and CLUSTER_MANAGER_CMD_FLAGS_* here? These two separate sets of flags that can conflict, can't they?

enjoy-binbin

overall LGTM.

enjoy-binbin · 2025-11-10T02:23:26Z

src/valkey-cli.c

+    valkeyAppendCommandArgv(node1->context, argv_idx, argv, argvlen);
+    valkeyReply *reply;
+    if (err != NULL) *err = NULL;
+    if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) {


Suggested change

if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) {

if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK || reply == NULL) {

enjoy-binbin · 2025-11-10T02:24:33Z

src/valkey-cli.c

+            CLUSTER_MANAGER_PRINT_REPLY_ERROR(node1, reply->str);
+        goto cleanup;
+    }
+cleanup:


Suggested change

cleanup:

cleanup:

enjoy-binbin · 2025-11-10T02:25:46Z

src/valkey-cli.c

+    MIGRATION_SUCCESS,
+    MIGRATION_CANCELLED,
+    MIGRATION_FAILED,
+    MIGRATION_IN_PROGRESS


so we won't touch this line again if we add a new field

Suggested change

MIGRATION_IN_PROGRESS

MIGRATION_IN_PROGRESS,

enjoy-binbin · 2025-11-10T02:28:21Z

tests/unit/cluster/cli.tcl

+foreach use_atomic_slot_migration {0 1} {
+    # start three servers
+    set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]


let's just do something like this, so we can avoid the huge diff.

Suggested change

foreach use_atomic_slot_migration {0 1} {

# start three servers

set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]

foreach use_atomic_slot_migration {0 1} {

# start three servers

set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]

enjoy-binbin · 2025-11-10T02:29:47Z

tests/unit/cluster/cli.tcl

 }

-}
+}


Suggested change

}

} ;# foreach use_atomic_slot_migration

enjoy-binbin · 2025-11-10T02:39:51Z

src/valkey-cli.c

    return success;
 }

+static int clusterManagerMoveSlotRangesASM(clusterManagerNode *source, clusterManagerNode *target, list *slot_ranges, int opts, char **err) {


we are sharing the same opts flags as clusterManagerMoveSlot, we need a comment in here

enjoy-binbin · 2025-11-10T02:41:09Z

src/valkey-cli.c

+        fflush(stdout);
+        sdsfree(to_print);
+    }
+    int print_dots = (opts & CLUSTER_MANAGER_OPT_VERBOSE), option_cold = (opts & CLUSTER_MANAGER_OPT_COLD), success = 1, in_progress = 0;


what does CLUSTER_MANAGER_OPT_COLD do in ASM? do we actually use cold in ASM?

Add support for Atomic Slot Migration to CLI

e05963c

Signed-off-by: Jacob Murphy <[email protected]>

github-actions bot assigned murphyjacob4 Oct 20, 2025

murphyjacob4 requested a review from enjoy-binbin October 20, 2025 22:38

Clang format fixes

2287bef

Signed-off-by: Jacob Murphy <[email protected]>

zuiderkwast reviewed Oct 30, 2025

View reviewed changes

enjoy-binbin reviewed Nov 10, 2025

View reviewed changes

enjoy-binbin added this to Valkey 9.1 Nov 10, 2025

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 10, 2025

	clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err);
	clusterManagerLogErr("Atomic slot migration failed: %s\n", err);

-static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {
+/* Perform the slot migrations specified in the table, which is a list of
+ * clusterManagerReshardTableItem pointers. Opts is a bitwise-or of
+ * CLUSTER_MANAGER_CMD_FLAG_ flags. Returns 1 on success, 0 on error. */
+static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {

	if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) {
	if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK \|\| reply == NULL) {

               }
-              }
+              }

Add support for Atomic Slot Migration to CLI #2755

Are you sure you want to change the base?

Add support for Atomic Slot Migration to CLI #2755

Uh oh!

Conversation

murphyjacob4 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

murphyjacob4 commented Oct 20, 2025 •

edited

Loading

codecov bot commented Oct 20, 2025 •

edited

Loading