-
Notifications
You must be signed in to change notification settings - Fork 947
Add support for Atomic Slot Migration to CLI #2755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2755 +/- ##
============================================
+ Coverage 72.60% 72.61% +0.01%
============================================
Files 128 128
Lines 71303 71526 +223
============================================
+ Hits 51767 51941 +174
- Misses 19536 19585 +49
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general
| /* For atomic slot migration, we move everything as one command */ | ||
| int result = clusterManagerMoveSlotRangesASM(item->source, target, item->slot_ranges, opts, &err); | ||
| if (!result) { | ||
| clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages are to the user. They shouldn't normally contain references to the code internals.
How about something like this?
| clusterManagerLogErr("clusterManagerMoveSlotRangeASM failed: %s\n", err); | |
| clusterManagerLogErr("Atomic slot migration failed: %s\n", err); |
| if (opts & CLUSTER_MANAGER_CMD_FLAG_USE_ATOMIC_SLOT_MIGRATION) { | ||
| /* Now that the migration is done, print all the #'s */ | ||
| printf("#"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, this is an atomic progress bar, completing atomically in one step. 😄
Would it make sense to track the progress in clusterManagerMoveSlotRangesASM and print some progress indicator based on the syncslots states or something? Maybe later? We can ignore it for now.
| return clusterManagerCommandCheck(argc, argv); | ||
| } | ||
|
|
||
| static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a small documentation comment to this function?
| static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) { | |
| /* Perform the slot migrations specified in the table, which is a list of | |
| * clusterManagerReshardTableItem pointers. Opts is a bitwise-or of | |
| * CLUSTER_MANAGER_CMD_FLAG_ flags. Returns 1 on success, 0 on error. */ | |
| static int clusterApplyReshardTable(list *table, clusterManagerNode *target, int opts) { |
| } | ||
| goto cleanup; | ||
| } | ||
| int opts = CLUSTER_MANAGER_OPT_VERBOSE | config.cluster_manager_command.flags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing CLUSTER_MANAGER_OPT_* and CLUSTER_MANAGER_CMD_FLAGS_* here? These two separate sets of flags that can conflict, can't they?
enjoy-binbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM.
| valkeyAppendCommandArgv(node1->context, argv_idx, argv, argvlen); | ||
| valkeyReply *reply; | ||
| if (err != NULL) *err = NULL; | ||
| if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK) { | |
| if (valkeyGetReply(node1->context, (void **)&reply) != VALKEY_OK || reply == NULL) { |
| CLUSTER_MANAGER_PRINT_REPLY_ERROR(node1, reply->str); | ||
| goto cleanup; | ||
| } | ||
| cleanup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cleanup: | |
| cleanup: |
| MIGRATION_SUCCESS, | ||
| MIGRATION_CANCELLED, | ||
| MIGRATION_FAILED, | ||
| MIGRATION_IN_PROGRESS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we won't touch this line again if we add a new field
| MIGRATION_IN_PROGRESS | |
| MIGRATION_IN_PROGRESS, |
| foreach use_atomic_slot_migration {0 1} { | ||
| # start three servers | ||
| set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just do something like this, so we can avoid the huge diff.
| foreach use_atomic_slot_migration {0 1} { | |
| # start three servers | |
| set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16] | |
| foreach use_atomic_slot_migration {0 1} { | |
| # start three servers | |
| set base_conf [list cluster-enabled yes cluster-node-timeout 1000 cluster-databases 16] |
| } | ||
|
|
||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | |
| } ;# foreach use_atomic_slot_migration | |
| return success; | ||
| } | ||
|
|
||
| static int clusterManagerMoveSlotRangesASM(clusterManagerNode *source, clusterManagerNode *target, list *slot_ranges, int opts, char **err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are sharing the same opts flags as clusterManagerMoveSlot, we need a comment in here
| fflush(stdout); | ||
| sdsfree(to_print); | ||
| } | ||
| int print_dots = (opts & CLUSTER_MANAGER_OPT_VERBOSE), option_cold = (opts & CLUSTER_MANAGER_OPT_COLD), success = 1, in_progress = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does CLUSTER_MANAGER_OPT_COLD do in ASM? do we actually use cold in ASM?
Adds a new option
--cluster-use-atomic-slot-migration. This will apply to both--cluster reshardand--cluster rebalancecommands.We could do some more optimizations here, but for now we batch all the slot ranges for one (source, target) pair and send them off as one
CLUSTER MIGRATESLOTSrequest. We then wait for this request to finish through pollingCLUSTER GETSLOTMIGRATIONSonce every 100ms. We parseCLUSTER GETSLOTMIGRATIONSand look for the most recent migration affecting the requested slot range, then check if it is in progress, failed, cancelled, or successful. If there is a failure or cancellation, we give this error to the user.Fixes #2504