Skip to content

loganalyzer: write markers directly to /var/log/syslog to prevent UDP drop#22776

Closed
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur:fix/loganalyzer-marker-udp-drop
Closed

loganalyzer: write markers directly to /var/log/syslog to prevent UDP drop#22776
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur:fix/loganalyzer-marker-udp-drop

Conversation

@Pterosaur
Copy link
Copy Markdown
Contributor

@Pterosaur Pterosaur commented Mar 6, 2026

Description of PR

Summary:
Fix LogAnalyzer marker placement to write directly to /var/log/syslog instead of relying solely on the syslog UDP socket (/dev/log), which can silently drop messages under heavy system load.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

test_po_cleanup_after_reload fails intermittently with:

RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.2026-03-04-08:34:39 in /var/log/syslog

Full error trace:

tests.common.errors.RunAnsibleModuleFail: run module command failed, Ansible Results =>
failed = True
changed = True
rc = 1
cmd = ['python', '/tmp/loganalyzer.py', '--action', 'add_end_marker', '--run_id', 'port_channel_cleanup.2026-03-04-08:34:39']
start = 2026-03-04 08:48:07.817786
end = 2026-03-04 08:50:19.294633
delta = 0:02:11.476847
msg = non-zero return code
stderr =
Traceback (most recent call last):
  File "/tmp/loganalyzer.py", line 877, in <module>
    main(sys.argv[1:])
  File "/tmp/loganalyzer.py", line 858, in main
    analyzer.place_marker(
  File "/tmp/loganalyzer.py", line 262, in place_marker
    raise RuntimeError(
RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.2026-03-04-08:34:39 in /var/log/syslog

Root cause analysis:

The test creates heavy CPU load (16 cores running yes) and then calls config_reload, which restarts all containers. During this period:

  1. _add_end_marker() calls loganalyzer.py --action add_end_marker without --logs, so log_file_list is empty.
  2. place_marker() skips place_marker_to_file() (no files in list) and only calls place_marker_to_syslog(), which writes through the /dev/log Unix domain socket (UDP/SOCK_DGRAM).
  3. The host rsyslogd receives massive syslog traffic from all restarting containers (orchagent, syncd, teamd, etc.). Its UDP receive buffer overflows and the kernel silently drops the marker message.
  4. wait_for_marker() polls /var/log/syslog for 120 seconds but the marker never appears, resulting in RuntimeError.

This is a generic issue affecting any test that uses LogAnalyzer under heavy system load, not just test_po_cleanup_after_reload.

Note: PR #20327 attempted to fix this by stopping swss rsyslogd before config_reload, but config_reload restarts the swss container, bringing rsyslogd back. Even with swss rsyslogd stopped, other containers still generate enough syslog traffic to overflow the UDP buffer.

How did you do it?

Three changes:

  1. _add_end_marker() (tests/common/plugins/loganalyzer/loganalyzer.py): Always pass --logs /var/log/syslog so the end marker is written directly to the file.

  2. _setup_marker() (tests/common/plugins/loganalyzer/loganalyzer.py): Ensure /var/log/syslog is always included in --logs, so the start marker is also written directly to the file.

  3. place_marker() (ansible/roles/test/files/tools/loganalyzer/loganalyzer.py): Remove the place_marker_to_syslog() call. Since callers now always include /var/log/syslog in log_file_list, place_marker_to_file() writes the marker directly via open() + write() + flush(). The unreliable UDP socket path is no longer needed.

Direct file writes (open + append + flush) are immune to UDP buffer overflow — they always succeed regardless of system load.

How did you verify/test it?

Ran test_po_cleanup_after_reload on sn5640 (str4-sn5640-3, t1 topology, 16 vCPUs). Previously failed consistently; passes reliably after the fix.

Any platform specific information?

No. This is a generic loganalyzer issue affecting all platforms.

Supported testbed topology if it's a new test case?

N/A (bug fix)

Documentation

N/A

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

… drop

When the DUT is under heavy load (e.g. config_reload with CPU stress
from test_po_cleanup_after_reload), the host rsyslogd UDP receive buffer
can overflow. Markers sent solely through /dev/log (syslog UDP socket)
may be silently dropped by the kernel, causing wait_for_marker() to time
out with:

  RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog

Root cause:
- _add_end_marker() did not pass --logs to loganalyzer.py on the DUT,
  so the end marker was only written via place_marker_to_syslog() which
  sends through the /dev/log UDP socket.
- _setup_marker() only passed --logs when additional_files were
  configured, leaving the default case without a direct file write.
- place_marker() in the DUT-side loganalyzer.py always called
  place_marker_to_syslog() (UDP path) in addition to any file writes.

Under heavy syslog traffic (container restarts, orchagent floods), the
kernel drops UDP datagrams when rsyslogd's receive buffer is full.
The marker never reaches /var/log/syslog, and wait_for_marker() times
out after 120 seconds.

Fix:
- _add_end_marker(): always pass --logs /var/log/syslog so the marker
  is written directly to the file.
- _setup_marker(): ensure /var/log/syslog is always in the --logs list.
- place_marker() (DUT-side): remove the place_marker_to_syslog() call
  since markers are now always written directly to log files, making
  the unreliable UDP path redundant.

Signed-off-by: Ze Gan <[email protected]>
@Pterosaur Pterosaur force-pushed the fix/loganalyzer-marker-udp-drop branch from 66c400c to e3d7920 Compare March 6, 2026 14:37
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends fixes the root cause. Three issues below: a stale docstring referencing removed code, a mutable-default-argument-style bug, and a fragile caller contract.


Always pass --logs /var/log/syslog so that the marker is written
directly to the file via place_marker_to_file(), in addition to
the normal syslog-socket path (place_marker_to_syslog()).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring says the marker is written "in addition to the normal syslog-socket path (place_marker_to_syslog())", but the ansible-side change removes the place_marker_to_syslog() call entirely. The docstring is now misleading — please update it to reflect that only direct file writes are used.

rsyslogd may drop UDP messages, causing markers sent solely via
/dev/log to be silently lost. Direct file writes are immune to
this. See also: _add_end_marker().
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the caller's log_files list in-place (log_files.append(...)) is a subtle side effect. If a caller passes a list and reuses it later, it will unexpectedly contain /var/log/syslog. Safer to copy first:

log_files = list(log_files) if log_files else []

'''
@summary: Place marker into '/dev/log' and each log file specified.
@summary: Place marker directly into each log file specified.
@param log_file_list : List of file paths, to be applied with marker.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says "Must include /var/log/syslog (callers are responsible for ensuring this)" but the code doesn't validate or enforce it. Since place_marker_to_syslog() is removed, forgetting /var/log/syslog in the list means the marker silently never reaches syslog. Consider either:

  • Adding an assertion / auto-append here (defensive), or
  • Dropping the caller-responsibility note if the pytest-side code already guarantees it.

Copy link
Copy Markdown

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fix is sound — replacing unreliable UDP syslog writes with direct file appends is the right call for marker reliability. A few concerns remain:

  1. place_marker_to_syslog() is now dead code — should be removed or kept with intent documented.
  2. Direct file writes are vulnerable to log rotation (marker lost if file is rotated between write and read).
  3. _add_end_marker only passes /var/log/syslog but not the other monitored log files, unlike _setup_marker which passes all of them.

'''

for log_file in log_file_list:
self.place_marker_to_file(log_file, marker)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place_marker_to_syslog() is no longer called from place_marker() and there appear to be no other call sites. This makes the method dead code. Consider removing place_marker_to_syslog() entirely to avoid confusion — or if it's intentionally kept as a fallback, add a comment explaining why.

Direct file writes are immune to this problem.
'''

for log_file in log_file_list:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct file writes are vulnerable to log rotation: if logrotate rotates /var/log/syslog between place_marker_to_file() and wait_for_marker(), the marker ends up in the rotated file and wait_for_marker() won't find it. The old syslog-socket approach was immune because rsyslogd handles rotation internally.

This may be rare in practice, but worth noting — consider re-writing the marker after a wait_for_marker() timeout as a retry, or documenting this as a known limitation.

cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \
" --logs /var/log/syslog"\
.format(run_dir=self.dut_run_dir, marker=marker)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_add_end_marker only passes --logs /var/log/syslog, whereas _setup_marker passes all monitored log files plus /var/log/syslog. If the analyzer monitors additional log files, the end marker won't appear in them (it previously did via place_marker_to_syslog() which rsyslogd could route to multiple files). Is this asymmetry intentional?

Copy link
Copy Markdown

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends is a pragmatic fix. One behavioral concern below; the rest of the issues I noticed were already flagged by prior reviewers.

if log_files:
cmd += " --logs {}".format(','.join(log_files))
if not log_files:
log_files = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral regression when log_files=None: Previously, when no log_files were specified, --logs was omitted entirely, letting the ansible-side script fall through to its own default log file list. Now this always passes --logs /var/log/syslog, which could restrict monitoring to only syslog when the caller expected the full default set.

If callers rely on the ansible script's built-in defaults (e.g. monitoring multiple log files), this change silently narrows the scope. Please verify that all _setup_marker() call sites always pass explicit log_files, or preserve the old no---logs behavior when the original list was empty.

Copy link
Copy Markdown

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@Pterosaur Pterosaur marked this pull request as draft March 10, 2026 22:08

for log_file in log_file_list:
self.place_marker_to_file(log_file, marker)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place_marker_to_file() has a missing
eturn after the early-exit guard (lines 176-178):

`python
if not len(log_file) or self.is_filename_stdin(log_file):
self.print_diagnostic_message('Log file {} not found. Skip adding marker.'.format(log_file))

<-- no return here; falls through to open(log_file, 'a')

`

For an empty string, open('', 'a') raises FileNotFoundError. With the old code, place_marker_to_syslog() was called unconditionally after the loop, so a file-write failure was silently absorbed. Now that place_marker_to_syslog() is removed, this exception propagates up and the marker is lost entirely. Please add
eturn after the diagnostic message.


self.place_marker_to_syslog(marker)
if wait_for_marker:
if self.wait_for_marker(marker) is False:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_marker() always polls the hardcoded path /var/log/syslog (not the files in log_file_list that were just written to). This creates an implicit contract: log_file_list must always contain /var/log/syslog, otherwise place_marker_to_file() writes the marker to a completely different file and wait_for_marker() times out — the exact same failure mode this PR is trying to fix, just from a different cause.

The callers added in this PR (_add_end_marker, _setup_marker) do guarantee /var/log/syslog is in the list, but there's nothing enforcing this contract for future callers or direct script users. Consider either:

  • Adding an assertion at the top of place_marker() when wait_for_marker=True: �ssert '/var/log/syslog' in log_file_list
  • Or updating wait_for_marker() to accept the file list as a parameter and search in the files that were actually written to.

cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \
" --logs /var/log/syslog"\
.format(run_dir=self.dut_run_dir, marker=marker)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_add_end_marker always passes --logs /var/log/syslog regardless of which log files the LogAnalyzer instance was configured to monitor. This is fine for the _add_end_marker use case (it's only used to anchor the syslog extraction window), but worth noting: the ansible �nalyze action (line 833-834 of ansible's loganalyzer.py) also calls place_marker(log_file_list, end_marker) using the full monitored file list, which will produce a second �nd-LogAnalyzer-{run_id} entry in /var/log/syslog.

Please verify that �nalyze_file() handles duplicate end markers gracefully (i.e., �rr_duplicate_end_marker is not set as an error when the same end marker string appears twice in the file). If it does treat duplicates as an error, this PR could introduce a new failure mode in tests where the analysis log file is the same as the one _add_end_marker writes to.

if log_files:
cmd += " --logs {}".format(','.join(log_files))
if not log_files:
log_files = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: log_files is mutated in place here, which is a side effect on the caller's list (already noted elsewhere). But there's also a behavioural change worth double-checking: with the old code, when log_files=None and --logs was omitted, the ansible-side main() at line 844-845 would append system_log_file to log_file_list only for the �nalyze action — not for init. So for init with no --logs, log_file_list was empty and the start marker was written purely via the syslog socket.

With this change, init now receives --logs /var/log/syslog, so place_marker(['/var/log/syslog'], start_marker) is called, which calls place_marker_to_file('/var/log/syslog', ...). This is functionally equivalent only if the file write is reliable and the calling user has write permission to /var/log/syslog. If the loganalyzer script is ever run without root / �dm group privileges, open('/var/log/syslog', 'a') will raise PermissionError — whereas the old syslog-socket path worked for any user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants