-
Notifications
You must be signed in to change notification settings - Fork 946
Fix cluster slot migration flaky test #2756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cluster slot migration flaky test #2756
Conversation
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2756 +/- ##
============================================
- Coverage 72.59% 72.43% -0.16%
============================================
Files 128 128
Lines 71301 70414 -887
============================================
- Hits 51759 51006 -753
+ Misses 19542 19408 -134 🚀 New features to boost your workflow:
|
|
As binbin noted, this change is not necessary, so there is no need to review this right now. I changed it to draft to avoid maintainers attention. I don't know the plans for this PR, to update it or close it later? |
Yes, thank you for doing this! I’m trying to think through other possible causes. It seems that the code inserts the node into cluster->nodes on each server, which increases the size to 4, but some internal registration steps might not have completed yet. I’ll look into it further. |
5d670bd to
440c04a
Compare
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
… know the new node before move slot Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
|
Hi @enjoy-binbin, I’ve made another modification to help deflake the test and also updated the top comment. Could you take a look when you have a moment? |
you mean the new node state is CLUSTER_OK when it was added? I thought it would be FAIL since we init it to FAIL when the server start? |
|
let use the --cluster check trick like below. We did use it somewhere so i think we can trust it. |
@enjoy-binbin, I agree, I was mistaken in assuming it would remain CLUSTER_OK. The state should indeed be FAIL when a new node is added. However, as I mentioned earlier, some handshake processes may not have completed yet, which likely causes the flakiness. I’ll apply the changes you suggested. |
…re healthy Signed-off-by: Hanxi Zhang <[email protected]>
Signed-off-by: Hanxi Zhang <[email protected]>
The original test code only checks:
wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.
cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.
Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.
Fixes #2693.