The problem/use-case that the feature addresses
In clusterCron(), the function clusterNodeCronHandleReconnect() attempts to reconnect to nodes whose link == NULL on every iteration (10 times per second by default). This leads to repeated connection attempts to nodes that are unreachable or in a failed (PFAIL or FAIL) state.
When there are multiple failing nodes, this retry logic results in:
- High CPU usage due to frequent
connConnect() calls.
- Excessive memory allocation/free churn from repeatedly creating and destroying
clusterLink objects.
Profiles of the connected nodes:
Engine CPU / Used Memory metrics:
The initial spike is during the time when large numbers of primary nodes (n/2 - 1) were killed, and subsequent increase in the memory seems to be due to frequent reconnect attempts based on the above profile.
Once the failed nodes are resumed, the compute is back to its steady state levels.
Description of the feature
A backoff mechanism to avoid so many reconnects with the failed nodes.