Limiting the new reconnections for failed nodes #2154

sarthakaggarwal97 · 2025-05-29T17:08:39Z

In cluster mode, the node attempts to reconnect to nodes where link == NULL during each execution cycle, which occurs every 100 milliseconds by default. This behavior results in continuous reconnection attempts to nodes that are unreachable or in a failed state (PFAIL or FAIL)

This PR addresses an issue where the system aggressively attempts to reconnect to failed nodes, which can lead to resource exhaustion and potential instability. This change limits the number of attempts we make per failed node.

The PR improves engine CPU of the P99 nodes (20-30 nodes in a cluster) in the by 35%, P90 (200-300 nodes) and Avg (across all nodes) by 10% when there are failed nodes in the cluster.

Resolves #2122

codecov · 2025-05-29T18:42:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.25%. Comparing base (0999007) to head (c78a420).
Report is 3 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2154      +/-   ##
============================================
- Coverage     71.44%   71.25%   -0.19%     
============================================
  Files           123      123              
  Lines         67122    67139      +17     
============================================
- Hits          47955    47842     -113     
- Misses        19167    19297     +130

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.94% <100.00%> (+0.21%)`	⬆️
src/server.c	`88.09% <100.00%> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tests/unit/cluster/cluster-reliable-meet.tcl

src/cluster_legacy.c

src/cluster_legacy.h

src/cluster_legacy.c

hpatro

Could you also add unit tests for the new dictionary API(s)?

Consider rehashing, entry addition/removal during the iteration.

src/dict.c

src/cluster_legacy.c

sungming2 · 2025-06-12T19:15:56Z

It might be minor improvement, but is it necessary to maintain two iterations in the cron?

valkey/src/cluster_legacy.c

Lines 5396 to 5407 in 2287261

    
               di = dictGetSafeIterator(server.cluster->nodes); 
        
               while ((de = dictNext(di)) != NULL) { 
        
                   clusterNode *node = dictGetVal(de); 
        
                   /* We free the inbound or outboud link to the node if the link has an 
        
                    * oversized message send queue and immediately try reconnecting. */ 
        
                   clusterNodeCronFreeLinkOnBufferLimitReached(node); 
        
                   /* The protocol is that function(s) below return non-zero if the node was 
        
                    * terminated. 
        
                    */ 
        
                   if (clusterNodeCronHandleReconnect(node, now)) continue; 
        
               } 
        
               dictReleaseIterator(di);

valkey/src/cluster_legacy.c

Lines 5443 to 5444 in 2287261

    
           di = dictGetSafeIterator(server.cluster->nodes); 
        
           while ((de = dictNext(di)) != NULL) {

src/cluster_legacy.c

src/dict.c

src/dict.h

src/dict.c

hpatro · 2025-06-13T18:29:04Z

Just thought about this @sarthakaggarwal97, let me know if you've already explored this and tested.

Could we make the diff smaller by removing the random iterator part and just introduce the delay in attempting reconnection. I think that should solve the issue. As during steady state iterating over all the entries doesn't lead to any regression and there is no syscall involved.

sarthakaggarwal97 · 2025-07-17T04:51:01Z

Sharing the benchmark numbers with the latest change when I kill 450 out of 1000 primaries. The P99 is still improved by a 75% (best case) while avg and P90 still see 10% improvement.

src/cluster_legacy.c

hpatro

Thanks for simplifying it @sarthakaggarwal97, looks good to me. The logic is dynamic based on timeout value, each node will attempt reconnection after every 750 ms for the default config value i.e. attempt 10 connection retries within pfail state to fail (if all agree).

Some more explanation:

For the default cluster node timeout period (15 seconds), we try disconnecting and reconnecting at half of that interval which is at 7.5 seconds. With the current unstable, after we treat the node as pfail, we disconnect the link and then retry connecting at each cron cycle i.e. 75 times (7.5 seconds / 100 ms). With this change, we introduce a delay between the connection attempts.

This change avoids too many connection attempts being made in case of large no. of node failures in a cluster which caused high CPU utilization and spikes.

hpatro · 2025-07-18T17:18:34Z

@enjoy-binbin / @madolson If either of you could take a look at this? that would be great.

madolson

The new strategy seems pretty smart, I like it!

src/cluster_legacy.c

tests/unit/cluster/cluster-reliable-meet.tcl

Signed-off-by: Sarthak Aggarwal <[email protected]>

enjoy-binbin

LGTM.

src/cluster_legacy.c

Signed-off-by: Sarthak Aggarwal <[email protected]>

src/cluster_legacy.c

Signed-off-by: Sarthak Aggarwal <[email protected]>

hpatro · 2025-07-22T21:12:09Z

Don't see any outstanding comments left. Will merge this in, after the test runs.

sarthakaggarwal97 force-pushed the max-conn-clustercron branch from b8f2eaa to c5a12ea Compare May 29, 2025 18:27

sarthakaggarwal97 force-pushed the max-conn-clustercron branch 2 times, most recently from 7a67cd5 to e3e3cb0 Compare May 29, 2025 22:03

sarthakaggarwal97 marked this pull request as ready for review May 29, 2025 22:06

sarthakaggarwal97 changed the title ~~Limiting the new reconnections for failed nodes.~~ Limiting the new reconnections for failed nodes May 29, 2025

sarthakaggarwal97 self-assigned this May 29, 2025

sarthakaggarwal97 added the cluster label May 29, 2025

madolson reviewed May 29, 2025

View reviewed changes

tests/unit/cluster/cluster-reliable-meet.tcl Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch 2 times, most recently from f2f8c4e to d58411a Compare May 30, 2025 00:13

hpatro reviewed May 30, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.h Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch 3 times, most recently from d0f83ad to 5fc2afa Compare May 30, 2025 22:06

hpatro reviewed Jun 3, 2025

View reviewed changes

src/dict.c Outdated Show resolved Hide resolved

murphyjacob4 reviewed Jun 4, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch from 5fc2afa to 3e6369a Compare June 4, 2025 04:33

hpatro requested a review from sungming2 June 12, 2025 18:03

sungming2 reviewed Jun 12, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

hpatro reviewed Jun 12, 2025

View reviewed changes

src/dict.c Outdated Show resolved Hide resolved

hpatro reviewed Jun 12, 2025

View reviewed changes

src/dict.h Outdated Show resolved Hide resolved

src/dict.h Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch 4 times, most recently from 1924c80 to 498bd0a Compare June 13, 2025 01:47

sungming2 reviewed Jun 13, 2025

View reviewed changes

src/dict.h Outdated Show resolved Hide resolved

sungming2 reviewed Jun 13, 2025

View reviewed changes

src/dict.c Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch 5 times, most recently from bd318cf to 9718d37 Compare July 17, 2025 01:03

hpatro reviewed Jul 17, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

hpatro approved these changes Jul 17, 2025

View reviewed changes

hpatro requested a review from enjoy-binbin July 18, 2025 17:17

madolson reviewed Jul 21, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

tests/unit/cluster/cluster-reliable-meet.tcl Outdated Show resolved Hide resolved

sarthakaggarwal97 force-pushed the max-conn-clustercron branch from 6be0d6c to b07b89f Compare July 21, 2025 17:40

sarthakaggarwal97 added 6 commits July 21, 2025 14:43

limiting connections

dc6c93a

Signed-off-by: Sarthak Aggarwal <[email protected]>

macro for number of node retries

6a3a8a8

Signed-off-by: Sarthak Aggarwal <[email protected]>

addressing comments

ff6c537

Signed-off-by: Sarthak Aggarwal <[email protected]>

addressing comments

ec2d356

Signed-off-by: Sarthak Aggarwal <[email protected]>

addressing some comments

7ea1713

Signed-off-by: Sarthak Aggarwal <[email protected]>

removing cluster nodes check

680b9d9

Signed-off-by: Sarthak Aggarwal <[email protected]>

sarthakaggarwal97 force-pushed the max-conn-clustercron branch from 5e4fedd to 680b9d9 Compare July 21, 2025 21:43

enjoy-binbin approved these changes Jul 22, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 22, 2025

enjoy-binbin added this to Valkey 9.0 Jul 22, 2025

addressing comments

c94d594

Signed-off-by: Sarthak Aggarwal <[email protected]>

hpatro reviewed Jul 22, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 added 2 commits July 22, 2025 13:12

addressing comments

090b4c0

Signed-off-by: Sarthak Aggarwal <[email protected]>

removing test

c78a420

Signed-off-by: Sarthak Aggarwal <[email protected]>

hpatro approved these changes Jul 22, 2025

View reviewed changes

hpatro merged commit f21015c into valkey-io:unstable Jul 22, 2025
57 of 61 checks passed

github-project-automation bot moved this to Done in Valkey 9.0 Jul 22, 2025

Limiting the new reconnections for failed nodes #2154

Limiting the new reconnections for failed nodes #2154

Uh oh!

Conversation

sarthakaggarwal97 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sungming2 commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro commented Jun 13, 2025

Uh oh!

sarthakaggarwal97 commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

hpatro commented Jul 18, 2025

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sarthakaggarwal97 commented May 29, 2025 •

edited

Loading

codecov bot commented May 29, 2025 •

edited

Loading

sarthakaggarwal97 commented Jul 17, 2025 •

edited

Loading