Fix cluster slot migration flaky test #2756

hanxizh9910 · 2025-10-20T23:06:26Z

The original test code only checks:

wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.
cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.

Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.

Fixes #2693.

Signed-off-by: Hanxi Zhang <[email protected]>

codecov · 2025-10-20T23:25:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.43%. Comparing base (b4c93cc) to head (2708cdb).
⚠️ Report is 57 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2756      +/-   ##
============================================
- Coverage     72.59%   72.43%   -0.16%     
============================================
  Files           128      128              
  Lines         71301    70414     -887     
============================================
- Hits          51759    51006     -753     
+ Misses        19542    19408     -134

see 105 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tests/unit/cluster/cli.tcl

zuiderkwast · 2025-10-22T11:09:35Z

As binbin noted, this change is not necessary, so there is no need to review this right now. I changed it to draft to avoid maintainers attention. I don't know the plans for this PR, to update it or close it later?

hanxizh9910 · 2025-10-22T16:20:51Z

As binbin noted, this change is not necessary, so there is no need to review this right now. I changed it to draft to avoid maintainers attention. I don't know the plans for this PR, to update it or close it later?

Yes, thank you for doing this! I’m trying to think through other possible causes. It seems that the code inserts the node into cluster->nodes on each server, which increases the size to 4, but some internal registration steps might not have completed yet. I’ll look into it further.

Signed-off-by: Hanxi Zhang <[email protected]>

… know the new node before move slot Signed-off-by: Hanxi Zhang <[email protected]>

Signed-off-by: Hanxi Zhang <[email protected]>

hanxizh9910 · 2025-11-18T00:48:51Z

Hi @enjoy-binbin, I’ve made another modification to help deflake the test and also updated the top comment. Could you take a look when you have a moment?

enjoy-binbin · 2025-11-18T02:22:05Z

cluster_state for every node, but when a new node is added, server.cluster->state remains CLUSTER_OK.

you mean the new node state is CLUSTER_OK when it was added? I thought it would be FAIL since we init it to FAIL when the server start?

enjoy-binbin · 2025-11-18T02:27:14Z

let use the --cluster check trick like below. We did use it somewhere so i think we can trust it.

        # Cluster check just verifies the config state is self-consistent,
        # waiting for cluster_state to be okay is an independent check that all the
        # nodes actually believe each other are healthy, prevent cluster down error.
        wait_for_condition 1000 50 {
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv 0 port]}] == 0 &&
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -1 port]}] == 0 &&
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -2 port]}] == 0 &&
            [CI 0 cluster_state] eq {ok} &&
            [CI 1 cluster_state] eq {ok} &&
            [CI 2 cluster_state] eq {ok}
        } else {
            fail "Cluster doesn't stabilize"
        }

hanxizh9910 · 2025-11-18T20:44:41Z

cluster_state for every node, but when a new node is added, server.cluster->state remains CLUSTER_OK.

you mean the new node state is CLUSTER_OK when it was added? I thought it would be FAIL since we init it to FAIL when the server start?

@enjoy-binbin, I agree, I was mistaken in assuming it would remain CLUSTER_OK. The state should indeed be FAIL when a new node is added. However, as I mentioned earlier, some handshake processes may not have completed yet, which likely causes the flakiness. I’ll apply the changes you suggested.

…re healthy Signed-off-by: Hanxi Zhang <[email protected]>

tests/unit/cluster/cli.tcl

Signed-off-by: Hanxi Zhang <[email protected]>

hanxizh9910 added 5 commits October 18, 2025 00:00

Deflake cluster-slot-migration-flaky-test-2693

e019e0d

Signed-off-by: Hanxi Zhang <[email protected]>

Added two files to test the flaky test

1b23963

Signed-off-by: Hanxi Zhang <[email protected]>

Fixed a yml format issue

5bca61b

Signed-off-by: Hanxi Zhang <[email protected]>

Added a missing line in the test

a3c8eb9

Signed-off-by: Hanxi Zhang <[email protected]>

With Fix, check to see if any test fail

440c04a

Signed-off-by: Hanxi Zhang <[email protected]>

github-actions bot assigned hanxizh9910 Oct 20, 2025

hanxizh9910 changed the title ~~Fix/cluster slot migration flaky test 2693~~ [Deflake ]Fix/cluster slot migration flaky test 2693 Oct 20, 2025

hanxizh9910 changed the title ~~[Deflake ]Fix/cluster slot migration flaky test 2693~~ [Deflake] Fix/cluster slot migration flaky test 2693 Oct 20, 2025

hanxizh9910 changed the title ~~[Deflake] Fix/cluster slot migration flaky test 2693~~ [DEFLAKE] Fix/cluster slot migration flaky test Oct 20, 2025

enjoy-binbin reviewed Oct 21, 2025

View reviewed changes

tests/unit/cluster/cli.tcl Outdated Show resolved Hide resolved

zuiderkwast marked this pull request as draft October 22, 2025 11:08

hanxizh9910 force-pushed the fix/cluster-slot-migration-flaky-test-2693 branch from 5d670bd to 440c04a Compare October 29, 2025 20:36

hanxizh9910 added 7 commits October 29, 2025 20:37

Removed the old fix

be33235

Signed-off-by: Hanxi Zhang <[email protected]>

Removed the redundant cluster_known_nodes call

1669e08

Signed-off-by: Hanxi Zhang <[email protected]>

Fix: added new fix to deflake the test, waiting for the owner node to…

057ca1e

… know the new node before move slot Signed-off-by: Hanxi Zhang <[email protected]>

Rewrote the fix

8502965

Signed-off-by: Hanxi Zhang <[email protected]>

Wrap the logic into a single expression

5c9be0e

Signed-off-by: Hanxi Zhang <[email protected]>

Fix, added check to make sure the owner node knows the new node

1a136a3

Signed-off-by: Hanxi Zhang <[email protected]>

Deleted two test files

d4b228a

Signed-off-by: Hanxi Zhang <[email protected]>

hanxizh9910 marked this pull request as ready for review November 17, 2025 22:31

hanxizh9910 marked this pull request as draft November 17, 2025 23:57

hanxizh9910 marked this pull request as ready for review November 18, 2025 00:37

enjoy-binbin approved these changes Nov 18, 2025

View reviewed changes

Added code to ensure that all the nodes actually believe each other a…

eccd523

…re healthy Signed-off-by: Hanxi Zhang <[email protected]>

enjoy-binbin reviewed Nov 19, 2025

View reviewed changes

tests/unit/cluster/cli.tcl Outdated Show resolved Hide resolved

sarthakaggarwal97 approved these changes Nov 19, 2025

View reviewed changes

Refactor: merge the fix with a previous step

2708cdb

Signed-off-by: Hanxi Zhang <[email protected]>

enjoy-binbin changed the title ~~[DEFLAKE] Fix/cluster slot migration flaky test~~ Fix cluster slot migration flaky test Nov 20, 2025

enjoy-binbin merged commit ed8856b into valkey-io:unstable Nov 20, 2025
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix cluster slot migration flaky test #2756

Fix cluster slot migration flaky test #2756

Uh oh!

hanxizh9910 commented Oct 20, 2025 •

edited by enjoy-binbin

Loading

Uh oh!

codecov bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

zuiderkwast commented Oct 22, 2025

Uh oh!

hanxizh9910 commented Oct 22, 2025

Uh oh!

hanxizh9910 commented Nov 18, 2025

Uh oh!

enjoy-binbin commented Nov 18, 2025

Uh oh!

enjoy-binbin commented Nov 18, 2025

Uh oh!

hanxizh9910 commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix cluster slot migration flaky test #2756

Fix cluster slot migration flaky test #2756

Uh oh!

Conversation

hanxizh9910 commented Oct 20, 2025 • edited by enjoy-binbin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

zuiderkwast commented Oct 22, 2025

Uh oh!

hanxizh9910 commented Oct 22, 2025

Uh oh!

hanxizh9910 commented Nov 18, 2025

Uh oh!

enjoy-binbin commented Nov 18, 2025

Uh oh!

enjoy-binbin commented Nov 18, 2025

Uh oh!

hanxizh9910 commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hanxizh9910 commented Oct 20, 2025 •

edited by enjoy-binbin

Loading

codecov bot commented Oct 20, 2025 •

edited

Loading