[NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead

**The problem/use-case that the feature addresses**

In `clusterCron()`, the function `clusterNodeCronHandleReconnect()` attempts to reconnect to nodes whose `link == NULL` on every iteration (10 times per second by default). This leads to repeated connection attempts to nodes that are unreachable or in a failed (`PFAIL` or `FAIL`) state.

When there are multiple failing nodes, this retry logic results in:

- High CPU usage due to frequent `connConnect()` calls.
- Excessive memory allocation/free churn from repeatedly creating and destroying `clusterLink` objects.

*Profiles of the connected nodes:*

<img width="1194" alt="Image" src="https://github.com/user-attachments/assets/78b99c38-838b-4a6d-a655-72797d8f37b5" />

---

*Engine CPU / Used Memory metrics:*

The initial spike is during the time when large numbers of primary nodes (n/2 - 1) were killed, and subsequent increase in the memory seems to be due to frequent reconnect attempts based on the above profile.

<img width="949" alt="Image" src="https://github.com/user-attachments/assets/ead6d94c-d07e-439e-a4c4-99d139402e7f" />

---

Once the failed nodes are resumed, the compute is back to its steady state levels.

**Description of the feature**

A backoff mechanism to avoid so many reconnects with the failed nodes.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead #2122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead #2122

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions