-
Notifications
You must be signed in to change notification settings - Fork 964
Limiting the new reconnections for failed nodes #2154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting the new reconnections for failed nodes #2154
Conversation
b8f2eaa to
c5a12ea
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2154 +/- ##
============================================
- Coverage 71.44% 71.25% -0.19%
============================================
Files 123 123
Lines 67122 67139 +17
============================================
- Hits 47955 47842 -113
- Misses 19167 19297 +130
🚀 New features to boost your workflow:
|
7a67cd5 to
e3e3cb0
Compare
f2f8c4e to
d58411a
Compare
d0f83ad to
5fc2afa
Compare
hpatro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add unit tests for the new dictionary API(s)?
Consider rehashing, entry addition/removal during the iteration.
5fc2afa to
3e6369a
Compare
|
It might be minor improvement, but is it necessary to maintain two iterations in the cron? Lines 5396 to 5407 in 2287261
Lines 5443 to 5444 in 2287261
|
1924c80 to
498bd0a
Compare
|
Just thought about this @sarthakaggarwal97, let me know if you've already explored this and tested. Could we make the diff smaller by removing the random iterator part and just introduce the delay in attempting reconnection. I think that should solve the issue. As during steady state iterating over all the entries doesn't lead to any regression and there is no syscall involved. |
bd318cf to
9718d37
Compare
hpatro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for simplifying it @sarthakaggarwal97, looks good to me. The logic is dynamic based on timeout value, each node will attempt reconnection after every 750 ms for the default config value i.e. attempt 10 connection retries within pfail state to fail (if all agree).
Some more explanation:
For the default cluster node timeout period (15 seconds), we try disconnecting and reconnecting at half of that interval which is at 7.5 seconds. With the current unstable, after we treat the node as pfail, we disconnect the link and then retry connecting at each cron cycle i.e. 75 times (7.5 seconds / 100 ms). With this change, we introduce a delay between the connection attempts.
This change avoids too many connection attempts being made in case of large no. of node failures in a cluster which caused high CPU utilization and spikes.
|
@enjoy-binbin / @madolson If either of you could take a look at this? that would be great. |
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new strategy seems pretty smart, I like it!
6be0d6c to
b07b89f
Compare
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
5e4fedd to
680b9d9
Compare
enjoy-binbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
|
Don't see any outstanding comments left. Will merge this in, after the test runs. |



In cluster mode, the node attempts to reconnect to nodes where
link == NULLduring each execution cycle, which occurs every 100 milliseconds by default. This behavior results in continuous reconnection attempts to nodes that are unreachable or in a failed state (PFAIL or FAIL)This PR addresses an issue where the system aggressively attempts to reconnect to failed nodes, which can lead to resource exhaustion and potential instability. This change limits the number of attempts we make per failed node.
The PR improves engine CPU of the P99 nodes (20-30 nodes in a cluster) in the by 35%, P90 (200-300 nodes) and Avg (across all nodes) by 10% when there are failed nodes in the cluster.
Resolves #2122