Skip to content

[NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead #2122

@sarthakaggarwal97

Description

@sarthakaggarwal97

The problem/use-case that the feature addresses

In clusterCron(), the function clusterNodeCronHandleReconnect() attempts to reconnect to nodes whose link == NULL on every iteration (10 times per second by default). This leads to repeated connection attempts to nodes that are unreachable or in a failed (PFAIL or FAIL) state.

When there are multiple failing nodes, this retry logic results in:

  • High CPU usage due to frequent connConnect() calls.
  • Excessive memory allocation/free churn from repeatedly creating and destroying clusterLink objects.

Profiles of the connected nodes:

Image

Engine CPU / Used Memory metrics:

The initial spike is during the time when large numbers of primary nodes (n/2 - 1) were killed, and subsequent increase in the memory seems to be due to frequent reconnect attempts based on the above profile.

Image

Once the failed nodes are resumed, the compute is back to its steady state levels.

Description of the feature

A backoff mechanism to avoid so many reconnects with the failed nodes.

Metadata

Metadata

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions