loganalyzer: write markers directly to /var/log/syslog to prevent UDP drop#22776
loganalyzer: write markers directly to /var/log/syslog to prevent UDP drop#22776Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
… drop When the DUT is under heavy load (e.g. config_reload with CPU stress from test_po_cleanup_after_reload), the host rsyslogd UDP receive buffer can overflow. Markers sent solely through /dev/log (syslog UDP socket) may be silently dropped by the kernel, causing wait_for_marker() to time out with: RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog Root cause: - _add_end_marker() did not pass --logs to loganalyzer.py on the DUT, so the end marker was only written via place_marker_to_syslog() which sends through the /dev/log UDP socket. - _setup_marker() only passed --logs when additional_files were configured, leaving the default case without a direct file write. - place_marker() in the DUT-side loganalyzer.py always called place_marker_to_syslog() (UDP path) in addition to any file writes. Under heavy syslog traffic (container restarts, orchagent floods), the kernel drops UDP datagrams when rsyslogd's receive buffer is full. The marker never reaches /var/log/syslog, and wait_for_marker() times out after 120 seconds. Fix: - _add_end_marker(): always pass --logs /var/log/syslog so the marker is written directly to the file. - _setup_marker(): ensure /var/log/syslog is always in the --logs list. - place_marker() (DUT-side): remove the place_marker_to_syslog() call since markers are now always written directly to log files, making the unreliable UDP path redundant. Signed-off-by: Ze Gan <[email protected]>
66c400c to
e3d7920
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
banidoru
left a comment
There was a problem hiding this comment.
Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends fixes the root cause. Three issues below: a stale docstring referencing removed code, a mutable-default-argument-style bug, and a fragile caller contract.
|
|
||
| Always pass --logs /var/log/syslog so that the marker is written | ||
| directly to the file via place_marker_to_file(), in addition to | ||
| the normal syslog-socket path (place_marker_to_syslog()). |
There was a problem hiding this comment.
This docstring says the marker is written "in addition to the normal syslog-socket path (place_marker_to_syslog())", but the ansible-side change removes the place_marker_to_syslog() call entirely. The docstring is now misleading — please update it to reflect that only direct file writes are used.
| rsyslogd may drop UDP messages, causing markers sent solely via | ||
| /dev/log to be silently lost. Direct file writes are immune to | ||
| this. See also: _add_end_marker(). | ||
| """ |
There was a problem hiding this comment.
Mutating the caller's log_files list in-place (log_files.append(...)) is a subtle side effect. If a caller passes a list and reuses it later, it will unexpectedly contain /var/log/syslog. Safer to copy first:
log_files = list(log_files) if log_files else []| ''' | ||
| @summary: Place marker into '/dev/log' and each log file specified. | ||
| @summary: Place marker directly into each log file specified. | ||
| @param log_file_list : List of file paths, to be applied with marker. |
There was a problem hiding this comment.
The docstring says "Must include /var/log/syslog (callers are responsible for ensuring this)" but the code doesn't validate or enforce it. Since place_marker_to_syslog() is removed, forgetting /var/log/syslog in the list means the marker silently never reaches syslog. Consider either:
- Adding an assertion / auto-append here (defensive), or
- Dropping the caller-responsibility note if the pytest-side code already guarantees it.
banidoru
left a comment
There was a problem hiding this comment.
Overall the fix is sound — replacing unreliable UDP syslog writes with direct file appends is the right call for marker reliability. A few concerns remain:
place_marker_to_syslog()is now dead code — should be removed or kept with intent documented.- Direct file writes are vulnerable to log rotation (marker lost if file is rotated between write and read).
_add_end_markeronly passes/var/log/syslogbut not the other monitored log files, unlike_setup_markerwhich passes all of them.
| ''' | ||
|
|
||
| for log_file in log_file_list: | ||
| self.place_marker_to_file(log_file, marker) |
There was a problem hiding this comment.
place_marker_to_syslog() is no longer called from place_marker() and there appear to be no other call sites. This makes the method dead code. Consider removing place_marker_to_syslog() entirely to avoid confusion — or if it's intentionally kept as a fallback, add a comment explaining why.
| Direct file writes are immune to this problem. | ||
| ''' | ||
|
|
||
| for log_file in log_file_list: |
There was a problem hiding this comment.
Direct file writes are vulnerable to log rotation: if logrotate rotates /var/log/syslog between place_marker_to_file() and wait_for_marker(), the marker ends up in the rotated file and wait_for_marker() won't find it. The old syslog-socket approach was immune because rsyslogd handles rotation internally.
This may be rare in practice, but worth noting — consider re-writing the marker after a wait_for_marker() timeout as a retry, or documenting this as a known limitation.
| cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \ | ||
| " --logs /var/log/syslog"\ | ||
| .format(run_dir=self.dut_run_dir, marker=marker) | ||
|
|
There was a problem hiding this comment.
_add_end_marker only passes --logs /var/log/syslog, whereas _setup_marker passes all monitored log files plus /var/log/syslog. If the analyzer monitors additional log files, the end marker won't appear in them (it previously did via place_marker_to_syslog() which rsyslogd could route to multiple files). Is this asymmetry intentional?
banidoru
left a comment
There was a problem hiding this comment.
Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends is a pragmatic fix. One behavioral concern below; the rest of the issues I noticed were already flagged by prior reviewers.
| if log_files: | ||
| cmd += " --logs {}".format(','.join(log_files)) | ||
| if not log_files: | ||
| log_files = [] |
There was a problem hiding this comment.
Behavioral regression when log_files=None: Previously, when no log_files were specified, --logs was omitted entirely, letting the ansible-side script fall through to its own default log file list. Now this always passes --logs /var/log/syslog, which could restrict monitoring to only syslog when the caller expected the full default set.
If callers rely on the ansible script's built-in defaults (e.g. monitoring multiple log files), this change silently narrows the scope. Please verify that all _setup_marker() call sites always pass explicit log_files, or preserve the old no---logs behavior when the original list was empty.
|
|
||
| for log_file in log_file_list: | ||
| self.place_marker_to_file(log_file, marker) | ||
|
|
There was a problem hiding this comment.
place_marker_to_file() has a missing
eturn after the early-exit guard (lines 176-178):
`python
if not len(log_file) or self.is_filename_stdin(log_file):
self.print_diagnostic_message('Log file {} not found. Skip adding marker.'.format(log_file))
<-- no return here; falls through to open(log_file, 'a')
`
For an empty string, open('', 'a') raises FileNotFoundError. With the old code, place_marker_to_syslog() was called unconditionally after the loop, so a file-write failure was silently absorbed. Now that place_marker_to_syslog() is removed, this exception propagates up and the marker is lost entirely. Please add
eturn after the diagnostic message.
|
|
||
| self.place_marker_to_syslog(marker) | ||
| if wait_for_marker: | ||
| if self.wait_for_marker(marker) is False: |
There was a problem hiding this comment.
wait_for_marker() always polls the hardcoded path /var/log/syslog (not the files in log_file_list that were just written to). This creates an implicit contract: log_file_list must always contain /var/log/syslog, otherwise place_marker_to_file() writes the marker to a completely different file and wait_for_marker() times out — the exact same failure mode this PR is trying to fix, just from a different cause.
The callers added in this PR (_add_end_marker, _setup_marker) do guarantee /var/log/syslog is in the list, but there's nothing enforcing this contract for future callers or direct script users. Consider either:
- Adding an assertion at the top of place_marker() when wait_for_marker=True: �ssert '/var/log/syslog' in log_file_list
- Or updating wait_for_marker() to accept the file list as a parameter and search in the files that were actually written to.
| cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \ | ||
| " --logs /var/log/syslog"\ | ||
| .format(run_dir=self.dut_run_dir, marker=marker) | ||
|
|
There was a problem hiding this comment.
_add_end_marker always passes --logs /var/log/syslog regardless of which log files the LogAnalyzer instance was configured to monitor. This is fine for the _add_end_marker use case (it's only used to anchor the syslog extraction window), but worth noting: the ansible �nalyze action (line 833-834 of ansible's loganalyzer.py) also calls place_marker(log_file_list, end_marker) using the full monitored file list, which will produce a second �nd-LogAnalyzer-{run_id} entry in /var/log/syslog.
Please verify that �nalyze_file() handles duplicate end markers gracefully (i.e., �rr_duplicate_end_marker is not set as an error when the same end marker string appears twice in the file). If it does treat duplicates as an error, this PR could introduce a new failure mode in tests where the analysis log file is the same as the one _add_end_marker writes to.
| if log_files: | ||
| cmd += " --logs {}".format(','.join(log_files)) | ||
| if not log_files: | ||
| log_files = [] |
There was a problem hiding this comment.
Nit: log_files is mutated in place here, which is a side effect on the caller's list (already noted elsewhere). But there's also a behavioural change worth double-checking: with the old code, when log_files=None and --logs was omitted, the ansible-side main() at line 844-845 would append system_log_file to log_file_list only for the �nalyze action — not for init. So for init with no --logs, log_file_list was empty and the start marker was written purely via the syslog socket.
With this change, init now receives --logs /var/log/syslog, so place_marker(['/var/log/syslog'], start_marker) is called, which calls place_marker_to_file('/var/log/syslog', ...). This is functionally equivalent only if the file write is reliable and the calling user has write permission to /var/log/syslog. If the loganalyzer script is ever run without root / �dm group privileges, open('/var/log/syslog', 'a') will raise PermissionError — whereas the old syslog-socket path worked for any user.
Description of PR
Summary:
Fix LogAnalyzer marker placement to write directly to
/var/log/sysloginstead of relying solely on the syslog UDP socket (/dev/log), which can silently drop messages under heavy system load.Type of change
Back port request
Approach
What is the motivation for this PR?
test_po_cleanup_after_reloadfails intermittently with:Full error trace:
Root cause analysis:
The test creates heavy CPU load (16 cores running
yes) and then callsconfig_reload, which restarts all containers. During this period:_add_end_marker()callsloganalyzer.py --action add_end_markerwithout--logs, solog_file_listis empty.place_marker()skipsplace_marker_to_file()(no files in list) and only callsplace_marker_to_syslog(), which writes through the/dev/logUnix domain socket (UDP/SOCK_DGRAM).rsyslogdreceives massive syslog traffic from all restarting containers (orchagent, syncd, teamd, etc.). Its UDP receive buffer overflows and the kernel silently drops the marker message.wait_for_marker()polls/var/log/syslogfor 120 seconds but the marker never appears, resulting inRuntimeError.This is a generic issue affecting any test that uses
LogAnalyzerunder heavy system load, not justtest_po_cleanup_after_reload.Note: PR #20327 attempted to fix this by stopping swss rsyslogd before config_reload, but
config_reloadrestarts the swss container, bringing rsyslogd back. Even with swss rsyslogd stopped, other containers still generate enough syslog traffic to overflow the UDP buffer.How did you do it?
Three changes:
_add_end_marker()(tests/common/plugins/loganalyzer/loganalyzer.py): Always pass--logs /var/log/syslogso the end marker is written directly to the file._setup_marker()(tests/common/plugins/loganalyzer/loganalyzer.py): Ensure/var/log/syslogis always included in--logs, so the start marker is also written directly to the file.place_marker()(ansible/roles/test/files/tools/loganalyzer/loganalyzer.py): Remove theplace_marker_to_syslog()call. Since callers now always include/var/log/sysloginlog_file_list,place_marker_to_file()writes the marker directly viaopen()+write()+flush(). The unreliable UDP socket path is no longer needed.Direct file writes (
open+append+flush) are immune to UDP buffer overflow — they always succeed regardless of system load.How did you verify/test it?
Ran
test_po_cleanup_after_reloadon sn5640 (str4-sn5640-3, t1 topology, 16 vCPUs). Previously failed consistently; passes reliably after the fix.Any platform specific information?
No. This is a generic loganalyzer issue affecting all platforms.
Supported testbed topology if it's a new test case?
N/A (bug fix)
Documentation
N/A