Skip to content

Conversation

@hanxizh9910
Copy link
Contributor

@hanxizh9910 hanxizh9910 commented Oct 20, 2025

The original test code only checks:

  1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
    Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
    which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
    a new node is added to the cluster, it is first created in the HANDSHAKE state, and
    clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
    node to still be in HANDSHAKE status (processed asynchronously) even though it appears
    that all nodes “know” there are 4 nodes in the cluster.

  2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.

Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.

Fixes #2693.

@hanxizh9910 hanxizh9910 changed the title Fix/cluster slot migration flaky test 2693 [Deflake ]Fix/cluster slot migration flaky test 2693 Oct 20, 2025
@hanxizh9910 hanxizh9910 changed the title [Deflake ]Fix/cluster slot migration flaky test 2693 [Deflake] Fix/cluster slot migration flaky test 2693 Oct 20, 2025
@hanxizh9910 hanxizh9910 changed the title [Deflake] Fix/cluster slot migration flaky test 2693 [DEFLAKE] Fix/cluster slot migration flaky test Oct 20, 2025
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.43%. Comparing base (b4c93cc) to head (2708cdb).
⚠️ Report is 57 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2756      +/-   ##
============================================
- Coverage     72.59%   72.43%   -0.16%     
============================================
  Files           128      128              
  Lines         71301    70414     -887     
============================================
- Hits          51759    51006     -753     
+ Misses        19542    19408     -134     

see 105 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast marked this pull request as draft October 22, 2025 11:08
@zuiderkwast
Copy link
Contributor

As binbin noted, this change is not necessary, so there is no need to review this right now. I changed it to draft to avoid maintainers attention. I don't know the plans for this PR, to update it or close it later?

@hanxizh9910
Copy link
Contributor Author

As binbin noted, this change is not necessary, so there is no need to review this right now. I changed it to draft to avoid maintainers attention. I don't know the plans for this PR, to update it or close it later?

Yes, thank you for doing this! I’m trying to think through other possible causes. It seems that the code inserts the node into cluster->nodes on each server, which increases the size to 4, but some internal registration steps might not have completed yet. I’ll look into it further.

@hanxizh9910 hanxizh9910 force-pushed the fix/cluster-slot-migration-flaky-test-2693 branch from 5d670bd to 440c04a Compare October 29, 2025 20:36
@hanxizh9910 hanxizh9910 marked this pull request as ready for review November 17, 2025 22:31
@hanxizh9910 hanxizh9910 marked this pull request as draft November 17, 2025 23:57
@hanxizh9910 hanxizh9910 marked this pull request as ready for review November 18, 2025 00:37
@hanxizh9910
Copy link
Contributor Author

Hi @enjoy-binbin, I’ve made another modification to help deflake the test and also updated the top comment. Could you take a look when you have a moment?

@enjoy-binbin
Copy link
Member

cluster_state for every node, but when a new node is added, server.cluster->state remains CLUSTER_OK.

you mean the new node state is CLUSTER_OK when it was added? I thought it would be FAIL since we init it to FAIL when the server start?

@enjoy-binbin
Copy link
Member

let use the --cluster check trick like below. We did use it somewhere so i think we can trust it.

        # Cluster check just verifies the config state is self-consistent,
        # waiting for cluster_state to be okay is an independent check that all the
        # nodes actually believe each other are healthy, prevent cluster down error.
        wait_for_condition 1000 50 {
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv 0 port]}] == 0 &&
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -1 port]}] == 0 &&
            [catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -2 port]}] == 0 &&
            [CI 0 cluster_state] eq {ok} &&
            [CI 1 cluster_state] eq {ok} &&
            [CI 2 cluster_state] eq {ok}
        } else {
            fail "Cluster doesn't stabilize"
        }

@hanxizh9910
Copy link
Contributor Author

cluster_state for every node, but when a new node is added, server.cluster->state remains CLUSTER_OK.

you mean the new node state is CLUSTER_OK when it was added? I thought it would be FAIL since we init it to FAIL when the server start?

@enjoy-binbin, I agree, I was mistaken in assuming it would remain CLUSTER_OK. The state should indeed be FAIL when a new node is added. However, as I mentioned earlier, some handshake processes may not have completed yet, which likely causes the flakiness. I’ll apply the changes you suggested.

@enjoy-binbin enjoy-binbin changed the title [DEFLAKE] Fix/cluster slot migration flaky test Fix cluster slot migration flaky test Nov 20, 2025
@enjoy-binbin enjoy-binbin merged commit ed8856b into valkey-io:unstable Nov 20, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] Migrate the last slot away from a node using valkey-cli

4 participants