Skip to content

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Aug 21, 2025

When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).

In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.

Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.

Interface changes:

  • New slot_migration_fork_in_progress info field.
  • New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
  • Also add slot migration fork to the cluster class trace latency.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 21, 2025
Signed-off-by: Binbin <[email protected]>
@codecov
Copy link

codecov bot commented Aug 21, 2025

Codecov Report

❌ Patch coverage is 69.60784% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.23%. Comparing base (93d7cca) to head (442c51f).
⚠️ Report is 6 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 40.35% 34 Missing ⚠️
src/cluster_migrateslots.c 81.48% 15 Missing ⚠️
src/rdb.c 84.00% 8 Missing ⚠️
src/server.c 66.66% 4 Missing ⚠️
src/module.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2533      +/-   ##
============================================
+ Coverage     72.21%   72.23%   +0.01%     
============================================
  Files           127      127              
  Lines         70826    70934     +108     
============================================
+ Hits          51147    51237      +90     
- Misses        19679    19697      +18     
Files with missing lines Coverage Δ
src/aof.c 81.17% <100.00%> (+0.06%) ⬆️
src/childinfo.c 97.43% <100.00%> (-0.07%) ⬇️
src/db.c 93.25% <100.00%> (+<0.01%) ⬆️
src/rdb.h 100.00% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/module.c 9.78% <0.00%> (-0.01%) ⬇️
src/server.c 88.36% <66.66%> (-0.09%) ⬇️
src/rdb.c 76.80% <84.00%> (+0.36%) ⬆️
src/cluster_migrateslots.c 91.52% <81.48%> (-1.02%) ⬇️
src/replication.c 85.97% <40.35%> (-1.02%) ⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin enjoy-binbin marked this pull request as ready for review August 21, 2025 11:06
@enjoy-binbin enjoy-binbin added release-notes This issue should get a line item in the release notes major-decision-pending Major decision pending by TSC team labels Aug 22, 2025
@enjoy-binbin enjoy-binbin moved this to In Progress in Valkey 9.0 Aug 22, 2025
Signed-off-by: Binbin <[email protected]>
enjoy-binbin and others added 2 commits August 26, 2025 10:10
Co-authored-by: Jacob Murphy <[email protected]>
Signed-off-by: Binbin <[email protected]>
Copy link
Contributor

@murphyjacob4 murphyjacob4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madolson - I think you mentioned you had some opinions on the INFO fields during this weeks meeting?

addReplyBulkCString(c, "message");
addReplyBulkCString(c, job->status_msg ? job->status_msg : "");
addReplyBulkCString(c, "cow_size");
addReplyLongLong(c, (long long)job->stat_cow_bytes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also expose output buffer size in here? We may be missing this information right now to allow people to monitor its progress. (Or we could do this in another PR, by exposing both the import slot client and the export slot client in the client info, i.e. adding a client flag. But i can't think of a good flag char, since import source already taken 'I' char)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably go with 'i' and 'e', stand for import or export. But no_evict already taken 'e'.

So we can go with 'i' and 'm', stand for importing or migrating as the old word.s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To monitor it, use CLIENT LIST? 🤔 I guess it's possible, yes, but maybe it's easier to use if we put some progress information in CLUSTER SLOTMIGRATIONS.

Maybe we can do it at the same time as #2504, if we want valkey-cli to print some progress indicator in interactive mode (if stdout is a TTY).

@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Sep 8, 2025
@enjoy-binbin enjoy-binbin merged commit f6a0f8c into valkey-io:unstable Sep 18, 2025
110 of 112 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Valkey 9.0 Sep 18, 2025
@enjoy-binbin enjoy-binbin deleted the asm-snapshot-improve branch September 18, 2025 08:26
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
When we adding atomic slot migration in valkey-io#1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).

In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.

Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.

Interface changes:
- New slot_migration_fork_in_progress info field.
- New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
- Also add slot migration fork to the cluster class trace latency.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Co-authored-by: Jacob Murphy <[email protected]>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).

In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.

Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.

Interface changes:
- New slot_migration_fork_in_progress info field.
- New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
- Also add slot migration fork to the cluster class trace latency.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Co-authored-by: Jacob Murphy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants