-
Notifications
You must be signed in to change notification settings - Fork 961
Log failed cluster node(s) state periodically to capture transient state for debuggability #2011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
…debugability Signed-off-by: Harkrishn Patro <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2011 +/- ##
============================================
- Coverage 71.01% 70.99% -0.03%
============================================
Files 123 123
Lines 66033 66125 +92
============================================
+ Hits 46892 46944 +52
- Misses 19141 19181 +40
🚀 New features to boost your workflow:
|
| sds cluster_info = genClusterInfoString(); | ||
| sds cluster_nodes = clusterGenNodesDescription(NULL, 0, 0); | ||
|
|
||
| sds infostring = sdscatprintf(sdsempty(), "\r\n# Cluster info\r\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New lines break log parsing. If someone turns on for some reason, it should still be "valid" log lines. I'm OK logging the state, but I don't think it should just be verbatim the info fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So, it's an exception for crash report to have new lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madolson and I discussed offline. We discussed we can have a single log line with information around failed nodes rather than all the nodes.
|
Is the main purpose for debugging? ie someone find the cluster is not normal and adjust the loglevel to verbose and catch it? |
Yes. Even to investigate incident which occurred in the past it's quite difficult for operators to figure out the issue with the current state of logging. I would like this to be active at NOTICE level with failed nodes information which is actually relevant #2011 (comment) |
This PR logs CLUSTER INFO / CLUSTER NODES output every 5 seconds to the log file for verbose/debug loglevel mode.
Certain times few nodes are not in convergence with the entire cluster and there are no logs captured about the divergence. This logging could help us better analyze in test setup where we can aggressively log more cluster information.