Skip to content

Conversation

@murphyjacob4
Copy link
Contributor

@murphyjacob4 murphyjacob4 commented Oct 20, 2025

Adds a new option --cluster-use-atomic-slot-migration. This will apply to both --cluster reshard and --cluster rebalance commands.

We could do some more optimizations here, but for now we batch all the slot ranges for one (source, target) pair and send them off as one CLUSTER MIGRATESLOTS request. We then wait for this request to finish through polling CLUSTER GETSLOTMIGRATIONS once every 100ms. We parse CLUSTER GETSLOTMIGRATIONS and look for the most recent migration affecting the requested slot range, then check if it is in progress, failed, cancelled, or successful. If there is a failure or cancellation, we give this error to the user.

Fixes #2504

Signed-off-by: Jacob Murphy <[email protected]>
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 76.53846% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.61%. Comparing base (1cf0df9) to head (2287bef).
⚠️ Report is 32 commits behind head on unstable.

Files with missing lines Patch % Lines
src/valkey-cli.c 76.53% 61 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2755      +/-   ##
============================================
+ Coverage     72.60%   72.61%   +0.01%     
============================================
  Files           128      128              
  Lines         71303    71526     +223     
============================================
+ Hits          51767    51941     +174     
- Misses        19536    19585      +49     
Files with missing lines Coverage Δ
src/valkey-cli.c 57.06% <76.53%> (+0.84%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general

/* For atomic slot migration, we move everything as one command */
int result = clusterManagerMoveSlotRangesASM(item->source, target, item->slot_ranges, opts, &err);
if (!result) {
clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages are to the user. They shouldn't normally contain references to the code internals.

How about something like this?

Suggested change
clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err);
clusterManagerLogErr("Atomic slot migration failed: %s\n", err);

Comment on lines +7645 to +7649
if (opts & CLUSTER_MANAGER_CMD_FLAG_USE_ATOMIC_SLOT_MIGRATION) {
/* Now that the migration is done, print all the #'s */
printf("#");
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, this is an atomic progress bar, completing atomically in one step. 😄

Would it make sense to track the progress in clusterManagerMoveSlotRangesASM and print some progress indicator based on the syncslots states or something? Maybe later? We can ignore it for now.

return clusterManagerCommandCheck(argc, argv);
}

static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a small documentation comment to this function?

Suggested change
static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {
/* Perform the slot migrations specified in the table, which is a list of
* clusterManagerReshardTableItem pointers. Opts is a bitwise-or of
* CLUSTER_MANAGER_CMD_FLAG_ flags. Returns 1 on success, 0 on error. */
static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) {

}
goto cleanup;
}
int opts = CLUSTER_MANAGER_OPT_VERBOSE | config.cluster_manager_command.flags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing CLUSTER_MANAGER_OPT_* and CLUSTER_MANAGER_CMD_FLAGS_* here? These two separate sets of flags that can conflict, can't they?

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.

valkeyAppendCommandArgv(node1->context, argv_idx, argv, argvlen);
valkeyReply *reply;
if (err != NULL) *err = NULL;
if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) {
if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK || reply == NULL) {

CLUSTER_MANAGER_PRINT_REPLY_ERROR(node1, reply->str);
goto cleanup;
}
cleanup:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cleanup:
cleanup:

MIGRATION_SUCCESS,
MIGRATION_CANCELLED,
MIGRATION_FAILED,
MIGRATION_IN_PROGRESS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we won't touch this line again if we add a new field

Suggested change
MIGRATION_IN_PROGRESS
MIGRATION_IN_PROGRESS,

Comment on lines +35 to +37
foreach use_atomic_slot_migration {0 1} {
# start three servers
set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just do something like this, so we can avoid the huge diff.

Suggested change
foreach use_atomic_slot_migration {0 1} {
# start three servers
set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]
foreach use_atomic_slot_migration {0 1} {
# start three servers
set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16]

}

}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
} ;# foreach use_atomic_slot_migration

return success;
}

static int clusterManagerMoveSlotRangesASM(clusterManagerNode *source, clusterManagerNode *target, list *slot_ranges, int opts, char **err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are sharing the same opts flags as clusterManagerMoveSlot, we need a comment in here

fflush(stdout);
sdsfree(to_print);
}
int print_dots = (opts & CLUSTER_MANAGER_OPT_VERBOSE), option_cold = (opts & CLUSTER_MANAGER_OPT_COLD), success = 1, in_progress = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does CLUSTER_MANAGER_OPT_COLD do in ASM? do we actually use cold in ASM?

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[NEW] Use atomic slot migration in valkey-cli

3 participants