Skip to content

Conversation

@sarthakaggarwal97
Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented May 29, 2025

In cluster mode, the node attempts to reconnect to nodes where link == NULL during each execution cycle, which occurs every 100 milliseconds by default. This behavior results in continuous reconnection attempts to nodes that are unreachable or in a failed state (PFAIL or FAIL)

This PR addresses an issue where the system aggressively attempts to reconnect to failed nodes, which can lead to resource exhaustion and potential instability. This change limits the number of attempts we make per failed node.

The PR improves engine CPU of the P99 nodes (20-30 nodes in a cluster) in the by 35%, P90 (200-300 nodes) and Avg (across all nodes) by 10% when there are failed nodes in the cluster.

Screenshot 2025-06-11 at 2 20 11 PM Screenshot 2025-06-12 at 7 16 48 PM

Resolves #2122

@codecov
Copy link

codecov bot commented May 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.25%. Comparing base (0999007) to head (c78a420).
Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2154      +/-   ##
============================================
- Coverage     71.44%   71.25%   -0.19%     
============================================
  Files           123      123              
  Lines         67122    67139      +17     
============================================
- Hits          47955    47842     -113     
- Misses        19167    19297     +130     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.94% <100.00%> (+0.21%) ⬆️
src/server.c 88.09% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarthakaggarwal97 sarthakaggarwal97 force-pushed the max-conn-clustercron branch 2 times, most recently from 7a67cd5 to e3e3cb0 Compare May 29, 2025 22:03
@sarthakaggarwal97 sarthakaggarwal97 marked this pull request as ready for review May 29, 2025 22:06
@sarthakaggarwal97 sarthakaggarwal97 changed the title Limiting the new reconnections for failed nodes. Limiting the new reconnections for failed nodes May 29, 2025
@sarthakaggarwal97 sarthakaggarwal97 self-assigned this May 29, 2025
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the max-conn-clustercron branch 2 times, most recently from f2f8c4e to d58411a Compare May 30, 2025 00:13
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the max-conn-clustercron branch 3 times, most recently from d0f83ad to 5fc2afa Compare May 30, 2025 22:06
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add unit tests for the new dictionary API(s)?

Consider rehashing, entry addition/removal during the iteration.

@hpatro hpatro requested a review from sungming2 June 12, 2025 18:03
@sungming2
Copy link
Contributor

It might be minor improvement, but is it necessary to maintain two iterations in the cron?

valkey/src/cluster_legacy.c

Lines 5396 to 5407 in 2287261

di = dictGetSafeIterator(server.cluster->nodes);
while ((de = dictNext(di)) != NULL) {
clusterNode *node = dictGetVal(de);
/* We free the inbound or outboud link to the node if the link has an
* oversized message send queue and immediately try reconnecting. */
clusterNodeCronFreeLinkOnBufferLimitReached(node);
/* The protocol is that function(s) below return non-zero if the node was
* terminated.
*/
if (clusterNodeCronHandleReconnect(node, now)) continue;
}
dictReleaseIterator(di);

valkey/src/cluster_legacy.c

Lines 5443 to 5444 in 2287261

di = dictGetSafeIterator(server.cluster->nodes);
while ((de = dictNext(di)) != NULL) {

@sarthakaggarwal97 sarthakaggarwal97 force-pushed the max-conn-clustercron branch 4 times, most recently from 1924c80 to 498bd0a Compare June 13, 2025 01:47
@hpatro
Copy link
Collaborator

hpatro commented Jun 13, 2025

Just thought about this @sarthakaggarwal97, let me know if you've already explored this and tested.

Could we make the diff smaller by removing the random iterator part and just introduce the delay in attempting reconnection. I think that should solve the issue. As during steady state iterating over all the entries doesn't lead to any regression and there is no syscall involved.

@sarthakaggarwal97 sarthakaggarwal97 force-pushed the max-conn-clustercron branch 5 times, most recently from bd318cf to 9718d37 Compare July 17, 2025 01:03
@sarthakaggarwal97
Copy link
Contributor Author

sarthakaggarwal97 commented Jul 17, 2025

Sharing the benchmark numbers with the latest change when I kill 450 out of 1000 primaries. The P99 is still improved by a 75% (best case) while avg and P90 still see 10% improvement.

Screenshot 2025-07-16 at 9 45 27 PM Screenshot 2025-07-16 at 9 46 06 PM Screenshot 2025-07-16 at 9 45 47 PM

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying it @sarthakaggarwal97, looks good to me. The logic is dynamic based on timeout value, each node will attempt reconnection after every 750 ms for the default config value i.e. attempt 10 connection retries within pfail state to fail (if all agree).

Some more explanation:

For the default cluster node timeout period (15 seconds), we try disconnecting and reconnecting at half of that interval which is at 7.5 seconds. With the current unstable, after we treat the node as pfail, we disconnect the link and then retry connecting at each cron cycle i.e. 75 times (7.5 seconds / 100 ms). With this change, we introduce a delay between the connection attempts.

This change avoids too many connection attempts being made in case of large no. of node failures in a cluster which caused high CPU utilization and spikes.

@hpatro hpatro requested a review from enjoy-binbin July 18, 2025 17:17
@hpatro
Copy link
Collaborator

hpatro commented Jul 18, 2025

@enjoy-binbin / @madolson If either of you could take a look at this? that would be great.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new strategy seems pretty smart, I like it!

Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 22, 2025
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
@hpatro
Copy link
Collaborator

hpatro commented Jul 22, 2025

Don't see any outstanding comments left. Will merge this in, after the test runs.

@hpatro hpatro merged commit f21015c into valkey-io:unstable Jul 22, 2025
57 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead

6 participants