Log failed cluster node(s) state periodically to capture transient state for debuggability #2011

hpatro · 2025-04-26T07:14:01Z

This PR logs CLUSTER INFO / CLUSTER NODES output every 5 seconds to the log file for verbose/debug loglevel mode.

Certain times few nodes are not in convergence with the entire cluster and there are no logs captured about the divergence. This logging could help us better analyze in test setup where we can aggressively log more cluster information.

…debugability Signed-off-by: Harkrishn Patro <[email protected]>

codecov · 2025-04-26T07:30:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.99%. Comparing base (0b94ca6) to head (4e7f83c).
Report is 8 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2011      +/-   ##
============================================
- Coverage     71.01%   70.99%   -0.03%     
============================================
  Files           123      123              
  Lines         66033    66125      +92     
============================================
+ Hits          46892    46944      +52     
- Misses        19141    19181      +40

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.19% <100.00%> (+0.10%)`	⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson · 2025-04-28T17:14:22Z

src/cluster_legacy.c

+            sds cluster_info = genClusterInfoString();
+            sds cluster_nodes = clusterGenNodesDescription(NULL, 0, 0);
+
+            sds infostring = sdscatprintf(sdsempty(), "\r\n# Cluster info\r\n");


New lines break log parsing. If someone turns on for some reason, it should still be "valid" log lines. I'm OK logging the state, but I don't think it should just be verbatim the info fields.

Interesting. So, it's an exception for crash report to have new lines?

@madolson and I discussed offline. We discussed we can have a single log line with information around failed nodes rather than all the nodes.

enjoy-binbin · 2025-05-08T07:15:29Z

Is the main purpose for debugging? ie someone find the cluster is not normal and adjust the loglevel to verbose and catch it?

hpatro · 2025-06-16T17:50:28Z

Is the main purpose for debugging? ie someone find the cluster is not normal and adjust the loglevel to verbose and catch it?

Yes. Even to investigate incident which occurred in the past it's quite difficult for operators to figure out the issue with the current state of logging. I would like this to be active at NOTICE level with failed nodes information which is actually relevant #2011 (comment)

Log cluster state periodically to capture transient state for better …

4e7f83c

…debugability Signed-off-by: Harkrishn Patro <[email protected]>

hpatro requested review from enjoy-binbin and madolson April 26, 2025 07:14

madolson reviewed Apr 28, 2025

View reviewed changes

hpatro mentioned this pull request Apr 29, 2025

Avoid log spam about cluster node failure detection by each primary #2010

Merged

hpatro changed the title ~~Log cluster state periodically to capture transient state for debuggability~~ Log failed cluster node(s) state periodically to capture transient state for debuggability Jun 16, 2025

hpatro mentioned this pull request Jun 27, 2025

Support Large Valkey Cluster #2281

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log failed cluster node(s) state periodically to capture transient state for debuggability #2011

Log failed cluster node(s) state periodically to capture transient state for debuggability #2011

Uh oh!

hpatro commented Apr 26, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 26, 2025

Uh oh!

madolson Apr 28, 2025

Uh oh!

hpatro Apr 28, 2025

Uh oh!

hpatro Apr 29, 2025

Uh oh!

enjoy-binbin commented May 8, 2025

Uh oh!

hpatro commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Log failed cluster node(s) state periodically to capture transient state for debuggability #2011

Are you sure you want to change the base?

Log failed cluster node(s) state periodically to capture transient state for debuggability #2011

Uh oh!

Conversation

hpatro commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 26, 2025

Codecov Report

Uh oh!

madolson Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

hpatro Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

hpatro Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented May 8, 2025

Uh oh!

hpatro commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hpatro commented Apr 26, 2025 •

edited

Loading