Separate RDB snapshotting from atomic slot migration #2533

enjoy-binbin · 2025-08-21T08:32:26Z

When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).

In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.

Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.

Interface changes:

New slot_migration_fork_in_progress info field.
New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
Also add slot migration fork to the cluster class trace latency.

Signed-off-by: Jacob Murphy <[email protected]>

Signed-off-by: Binbin <[email protected]>

codecov · 2025-08-21T09:06:29Z

Codecov Report

❌ Patch coverage is 69.60784% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.23%. Comparing base (93d7cca) to head (442c51f).
⚠️ Report is 6 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/replication.c	40.35%	34 Missing ⚠️
src/cluster_migrateslots.c	81.48%	15 Missing ⚠️
src/rdb.c	84.00%	8 Missing ⚠️
src/server.c	66.66%	4 Missing ⚠️
src/module.c	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2533      +/-   ##
============================================
+ Coverage     72.21%   72.23%   +0.01%     
============================================
  Files           127      127              
  Lines         70826    70934     +108     
============================================
+ Hits          51147    51237      +90     
- Misses        19679    19697      +18

Files with missing lines	Coverage Δ
src/aof.c	`81.17% <100.00%> (+0.06%)`	⬆️
src/childinfo.c	`97.43% <100.00%> (-0.07%)`	⬇️
src/db.c	`93.25% <100.00%> (+<0.01%)`	⬆️
src/rdb.h	`100.00% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/module.c	`9.78% <0.00%> (-0.01%)`	⬇️
src/server.c	`88.36% <66.66%> (-0.09%)`	⬇️
src/rdb.c	`76.80% <84.00%> (+0.36%)`	⬆️
src/cluster_migrateslots.c	`91.52% <81.48%> (-1.02%)`	⬇️
src/replication.c	`85.97% <40.35%> (-1.02%)`	⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <[email protected]>

src/cluster_migrateslots.c

src/replication.c

src/cluster_migrateslots.c

Signed-off-by: Binbin <[email protected]>

src/cluster_migrateslots.c

Co-authored-by: Jacob Murphy <[email protected]> Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

murphyjacob4

@madolson - I think you mentioned you had some opinions on the INFO fields during this weeks meeting?

src/server.c

valkey.conf

Signed-off-by: Binbin <[email protected]>

src/cluster_migrateslots.c

Signed-off-by: Binbin <[email protected]>

…prove Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2025-09-02T02:41:39Z

src/cluster_migrateslots.c

        addReplyBulkCString(c, "message");
        addReplyBulkCString(c, job->status_msg ? job->status_msg : "");
+        addReplyBulkCString(c, "cow_size");
+        addReplyLongLong(c, (long long)job->stat_cow_bytes);


let's also expose output buffer size in here? We may be missing this information right now to allow people to monitor its progress. (Or we could do this in another PR, by exposing both the import slot client and the export slot client in the client info, i.e. adding a client flag. But i can't think of a good flag char, since import source already taken 'I' char)

we can probably go with 'i' and 'e', stand for import or export. But no_evict already taken 'e'.

So we can go with 'i' and 'm', stand for importing or migrating as the old word.s

To monitor it, use CLIENT LIST? 🤔 I guess it's possible, yes, but maybe it's easier to use if we put some progress information in CLUSTER SLOTMIGRATIONS.

Maybe we can do it at the same time as #2504, if we want valkey-cli to print some progress indicator in interactive mode (if stdout is a TTY).

src/aof.c

src/cluster_migrateslots.c

…prove Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

…prove Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

When we adding atomic slot migration in valkey-io#1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>

When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>

murphyjacob4 and others added 2 commits August 21, 2025 00:28

Separate RDB snapshotting from atomic slot migration

eec7de9

Signed-off-by: Jacob Murphy <[email protected]>

add pipe back

8bdb2a5

Signed-off-by: Binbin <[email protected]>

github-actions bot assigned enjoy-binbin Aug 21, 2025

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 21, 2025

fix use-after-free

a354b0c

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added 2 commits August 21, 2025 18:48

add slot-migration-cpulist

b298572

Signed-off-by: Binbin <[email protected]>

Fix test

5bc838e

Signed-off-by: Binbin <[email protected]>

enjoy-binbin marked this pull request as ready for review August 21, 2025 11:06

enjoy-binbin added release-notes This issue should get a line item in the release notes major-decision-pending Major decision pending by TSC team labels Aug 22, 2025

enjoy-binbin added this to Valkey 9.0 Aug 22, 2025

enjoy-binbin moved this to In Progress in Valkey 9.0 Aug 22, 2025

enjoy-binbin requested review from PingXie, madolson, murphyjacob4 and zuiderkwast August 22, 2025 02:39

also add slot migration fork to the cluster trace latency

1859d16

Signed-off-by: Binbin <[email protected]>

murphyjacob4 reviewed Aug 22, 2025

View reviewed changes

src/cluster_migrateslots.c Outdated Show resolved Hide resolved

src/replication.c Show resolved Hide resolved

src/cluster_migrateslots.c Show resolved Hide resolved

code review

cd2f3e3

Signed-off-by: Binbin <[email protected]>

murphyjacob4 reviewed Aug 25, 2025

View reviewed changes

src/cluster_migrateslots.c Outdated Show resolved Hide resolved

murphyjacob4 reviewed Aug 26, 2025

View reviewed changes

src/cluster_migrateslots.c Outdated Show resolved Hide resolved

murphyjacob4 reviewed Aug 26, 2025

View reviewed changes

src/cluster_migrateslots.c Outdated Show resolved Hide resolved

enjoy-binbin and others added 2 commits August 26, 2025 10:10

Update src/cluster_migrateslots.c

d4e05df

Co-authored-by: Jacob Murphy <[email protected]> Signed-off-by: Binbin <[email protected]>

code review from murphyjacob4

a575969

Signed-off-by: Binbin <[email protected]>

murphyjacob4 approved these changes Aug 26, 2025

View reviewed changes

madolson reviewed Sep 1, 2025

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

madolson reviewed Sep 1, 2025

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

madolson reviewed Sep 1, 2025

View reviewed changes

valkey.conf Outdated Show resolved Hide resolved

code review from madelyn

a8b0994

Signed-off-by: Binbin <[email protected]>

enjoy-binbin force-pushed the asm-snapshot-improve branch from 4b4a508 to a8b0994 Compare September 1, 2025 16:56

zuiderkwast reviewed Sep 1, 2025

View reviewed changes

src/cluster_migrateslots.c Outdated Show resolved Hide resolved

enjoy-binbin added 2 commits September 2, 2025 08:59

Change to cow_size

2f7994e

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into asm-snapshot-im…

cb29622

…prove Signed-off-by: Binbin <[email protected]>

enjoy-binbin commented Sep 2, 2025

View reviewed changes

madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Sep 8, 2025

PingXie reviewed Sep 10, 2025

View reviewed changes

src/aof.c Outdated Show resolved Hide resolved

src/cluster_migrateslots.c Show resolved Hide resolved

enjoy-binbin added 4 commits September 12, 2025 10:23

Merge remote-tracking branch 'upstream/unstable' into asm-snapshot-im…

2f7f4e7

…prove Signed-off-by: Binbin <[email protected]>

Remove RIO_FLAG_SLOT_MIGRATION_AOF

7ee4456

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into asm-snapshot-im…

501aa55

…prove Signed-off-by: Binbin <[email protected]>

Handle new fields in GETSLOTMIGRATIONS

442c51f

Signed-off-by: Binbin <[email protected]>

enjoy-binbin merged commit f6a0f8c into valkey-io:unstable Sep 18, 2025
110 of 112 checks passed

github-project-automation bot moved this from In Progress to Done in Valkey 9.0 Sep 18, 2025

enjoy-binbin deleted the asm-snapshot-improve branch September 18, 2025 08:26

enjoy-binbin added the cluster label Sep 19, 2025

zuiderkwast mentioned this pull request Sep 30, 2025

Fix accounting for dual channel RDB bytes in replication stats #2602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate RDB snapshotting from atomic slot migration #2533

Separate RDB snapshotting from atomic slot migration #2533

Uh oh!

enjoy-binbin commented Aug 21, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murphyjacob4 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin Sep 2, 2025

Uh oh!

enjoy-binbin Sep 2, 2025

Uh oh!

zuiderkwast Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Separate RDB snapshotting from atomic slot migration #2533

Separate RDB snapshotting from atomic slot migration #2533

Uh oh!

Conversation

enjoy-binbin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murphyjacob4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

enjoy-binbin commented Aug 21, 2025 •

edited

Loading

codecov bot commented Aug 21, 2025 •

edited

Loading