loganalyzer: write markers directly to /var/log/syslog to prevent UDP drop by Pterosaur · Pull Request #22776 · sonic-net/sonic-mgmt

Pterosaur · 2026-03-06T14:35:47Z

Description of PR

Summary:
Fix LogAnalyzer marker placement to write directly to /var/log/syslog instead of relying solely on the syslog UDP socket (/dev/log), which can silently drop messages under heavy system load.

Type of change

Back port request

Approach

What is the motivation for this PR?

test_po_cleanup_after_reload fails intermittently with:

RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.2026-03-04-08:34:39 in /var/log/syslog

Full error trace:

tests.common.errors.RunAnsibleModuleFail: run module command failed, Ansible Results =>
failed = True
changed = True
rc = 1
cmd = ['python', '/tmp/loganalyzer.py', '--action', 'add_end_marker', '--run_id', 'port_channel_cleanup.2026-03-04-08:34:39']
start = 2026-03-04 08:48:07.817786
end = 2026-03-04 08:50:19.294633
delta = 0:02:11.476847
msg = non-zero return code
stderr =
Traceback (most recent call last):
  File "/tmp/loganalyzer.py", line 877, in <module>
    main(sys.argv[1:])
  File "/tmp/loganalyzer.py", line 858, in main
    analyzer.place_marker(
  File "/tmp/loganalyzer.py", line 262, in place_marker
    raise RuntimeError(
RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.2026-03-04-08:34:39 in /var/log/syslog

Root cause analysis:

The test creates heavy CPU load (16 cores running yes) and then calls config_reload, which restarts all containers. During this period:

_add_end_marker() calls loganalyzer.py --action add_end_marker without --logs, so log_file_list is empty.
place_marker() skips place_marker_to_file() (no files in list) and only calls place_marker_to_syslog(), which writes through the /dev/log Unix domain socket (UDP/SOCK_DGRAM).
The host rsyslogd receives massive syslog traffic from all restarting containers (orchagent, syncd, teamd, etc.). Its UDP receive buffer overflows and the kernel silently drops the marker message.
wait_for_marker() polls /var/log/syslog for 120 seconds but the marker never appears, resulting in RuntimeError.

This is a generic issue affecting any test that uses LogAnalyzer under heavy system load, not just test_po_cleanup_after_reload.

Note: PR #20327 attempted to fix this by stopping swss rsyslogd before config_reload, but config_reload restarts the swss container, bringing rsyslogd back. Even with swss rsyslogd stopped, other containers still generate enough syslog traffic to overflow the UDP buffer.

How did you do it?

Three changes:

_add_end_marker() (tests/common/plugins/loganalyzer/loganalyzer.py): Always pass --logs /var/log/syslog so the end marker is written directly to the file.
_setup_marker() (tests/common/plugins/loganalyzer/loganalyzer.py): Ensure /var/log/syslog is always included in --logs, so the start marker is also written directly to the file.
place_marker() (ansible/roles/test/files/tools/loganalyzer/loganalyzer.py): Remove the place_marker_to_syslog() call. Since callers now always include /var/log/syslog in log_file_list, place_marker_to_file() writes the marker directly via open() + write() + flush(). The unreliable UDP socket path is no longer needed.

Direct file writes (open + append + flush) are immune to UDP buffer overflow — they always succeed regardless of system load.

How did you verify/test it?

Ran test_po_cleanup_after_reload on sn5640 (str4-sn5640-3, t1 topology, 16 vCPUs). Previously failed consistently; passes reliably after the fix.

Any platform specific information?

No. This is a generic loganalyzer issue affecting all platforms.

Supported testbed topology if it's a new test case?

N/A (bug fix)

Documentation

N/A

mssonicbld · 2026-03-06T14:35:56Z

/azp run

azure-pipelines · 2026-03-06T14:36:12Z

Azure Pipelines successfully started running 1 pipeline(s).

… drop When the DUT is under heavy load (e.g. config_reload with CPU stress from test_po_cleanup_after_reload), the host rsyslogd UDP receive buffer can overflow. Markers sent solely through /dev/log (syslog UDP socket) may be silently dropped by the kernel, causing wait_for_marker() to time out with: RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog Root cause: - _add_end_marker() did not pass --logs to loganalyzer.py on the DUT, so the end marker was only written via place_marker_to_syslog() which sends through the /dev/log UDP socket. - _setup_marker() only passed --logs when additional_files were configured, leaving the default case without a direct file write. - place_marker() in the DUT-side loganalyzer.py always called place_marker_to_syslog() (UDP path) in addition to any file writes. Under heavy syslog traffic (container restarts, orchagent floods), the kernel drops UDP datagrams when rsyslogd's receive buffer is full. The marker never reaches /var/log/syslog, and wait_for_marker() times out after 120 seconds. Fix: - _add_end_marker(): always pass --logs /var/log/syslog so the marker is written directly to the file. - _setup_marker(): ensure /var/log/syslog is always in the --logs list. - place_marker() (DUT-side): remove the place_marker_to_syslog() call since markers are now always written directly to log files, making the unreliable UDP path redundant. Signed-off-by: Ze Gan <[email protected]>

mssonicbld · 2026-03-06T14:37:59Z

/azp run

azure-pipelines · 2026-03-06T14:38:16Z

Azure Pipelines successfully started running 1 pipeline(s).

banidoru

Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends fixes the root cause. Three issues below: a stale docstring referencing removed code, a mutable-default-argument-style bug, and a fragile caller contract.

banidoru · 2026-03-10T03:11:51Z

tests/common/plugins/loganalyzer/loganalyzer.py


+        Always pass --logs /var/log/syslog so that the marker is written
+        directly to the file via place_marker_to_file(), in addition to
+        the normal syslog-socket path (place_marker_to_syslog()).


This docstring says the marker is written "in addition to the normal syslog-socket path (place_marker_to_syslog())", but the ansible-side change removes the place_marker_to_syslog() call entirely. The docstring is now misleading — please update it to reflect that only direct file writes are used.

banidoru · 2026-03-10T03:11:51Z

tests/common/plugins/loganalyzer/loganalyzer.py

+        rsyslogd may drop UDP messages, causing markers sent solely via
+        /dev/log to be silently lost.  Direct file writes are immune to
+        this.  See also: _add_end_marker().
        """


Mutating the caller's log_files list in-place (log_files.append(...)) is a subtle side effect. If a caller passes a list and reuses it later, it will unexpectedly contain /var/log/syslog. Safer to copy first:

log_files = list(log_files) if log_files else []

banidoru · 2026-03-10T03:11:51Z

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py

        '''
-        @summary: Place marker into '/dev/log' and each log file specified.
+        @summary: Place marker directly into each log file specified.
        @param log_file_list : List of file paths, to be applied with marker.


The docstring says "Must include /var/log/syslog (callers are responsible for ensuring this)" but the code doesn't validate or enforce it. Since place_marker_to_syslog() is removed, forgetting /var/log/syslog in the list means the marker silently never reaches syslog. Consider either:

Adding an assertion / auto-append here (defensive), or

Dropping the caller-responsibility note if the pytest-side code already guarantees it.

banidoru

Overall the fix is sound — replacing unreliable UDP syslog writes with direct file appends is the right call for marker reliability. A few concerns remain:

place_marker_to_syslog() is now dead code — should be removed or kept with intent documented.
Direct file writes are vulnerable to log rotation (marker lost if file is rotated between write and read).
_add_end_marker only passes /var/log/syslog but not the other monitored log files, unlike _setup_marker which passes all of them.

banidoru · 2026-03-10T03:13:30Z

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py

        '''

        for log_file in log_file_list:
            self.place_marker_to_file(log_file, marker)


place_marker_to_syslog() is no longer called from place_marker() and there appear to be no other call sites. This makes the method dead code. Consider removing place_marker_to_syslog() entirely to avoid confusion — or if it's intentionally kept as a fallback, add a comment explaining why.

banidoru · 2026-03-10T03:13:30Z

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py

+        Direct file writes are immune to this problem.
        '''

        for log_file in log_file_list:


Direct file writes are vulnerable to log rotation: if logrotate rotates /var/log/syslog between place_marker_to_file() and wait_for_marker(), the marker ends up in the rotated file and wait_for_marker() won't find it. The old syslog-socket approach was immune because rsyslogd handles rotation internally.

This may be rare in practice, but worth noting — consider re-writing the marker after a wait_for_marker() timeout as a retry, or documenting this as a known limitation.

banidoru · 2026-03-10T03:13:30Z

tests/common/plugins/loganalyzer/loganalyzer.py

+        cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \
+              " --logs /var/log/syslog"\
            .format(run_dir=self.dut_run_dir, marker=marker)



_add_end_marker only passes --logs /var/log/syslog, whereas _setup_marker passes all monitored log files plus /var/log/syslog. If the analyzer monitors additional log files, the end marker won't appear in them (it previously did via place_marker_to_syslog() which rsyslogd could route to multiple files). Is this asymmetry intentional?

banidoru

Overall approach is sound — replacing unreliable UDP syslog writes with direct file appends is a pragmatic fix. One behavioral concern below; the rest of the issues I noticed were already flagged by prior reviewers.

banidoru · 2026-03-10T03:14:31Z

tests/common/plugins/loganalyzer/loganalyzer.py

-        if log_files:
-            cmd += " --logs {}".format(','.join(log_files))
+        if not log_files:
+            log_files = []


Behavioral regression when log_files=None: Previously, when no log_files were specified, --logs was omitted entirely, letting the ansible-side script fall through to its own default log file list. Now this always passes --logs /var/log/syslog, which could restrict monitoring to only syslog when the caller expected the full default set.

If callers rely on the ansible script's built-in defaults (e.g. monitoring multiple log files), this change silently narrows the scope. Please verify that all _setup_marker() call sites always pass explicit log_files, or preserve the old no---logs behavior when the original list was empty.

banidoru

All reviewers approved. LGTM.

yutongzhang-microsoft · 2026-03-12T08:32:53Z

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py


        for log_file in log_file_list:
            self.place_marker_to_file(log_file, marker)



place_marker_to_file() has a missing
eturn after the early-exit guard (lines 176-178):

`python
if not len(log_file) or self.is_filename_stdin(log_file):
self.print_diagnostic_message('Log file {} not found. Skip adding marker.'.format(log_file))

<-- no return here; falls through to open(log_file, 'a')

`

For an empty string, open('', 'a') raises FileNotFoundError. With the old code, place_marker_to_syslog() was called unconditionally after the loop, so a file-write failure was silently absorbed. Now that place_marker_to_syslog() is removed, this exception propagates up and the marker is lost entirely. Please add
eturn after the diagnostic message.

yutongzhang-microsoft · 2026-03-12T08:32:53Z

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py


-        self.place_marker_to_syslog(marker)
        if wait_for_marker:
            if self.wait_for_marker(marker) is False:


wait_for_marker() always polls the hardcoded path /var/log/syslog (not the files in log_file_list that were just written to). This creates an implicit contract: log_file_list must always contain /var/log/syslog, otherwise place_marker_to_file() writes the marker to a completely different file and wait_for_marker() times out — the exact same failure mode this PR is trying to fix, just from a different cause.

The callers added in this PR (_add_end_marker, _setup_marker) do guarantee /var/log/syslog is in the list, but there's nothing enforcing this contract for future callers or direct script users. Consider either:

Adding an assertion at the top of place_marker() when wait_for_marker=True: �ssert '/var/log/syslog' in log_file_list

Or updating wait_for_marker() to accept the file list as a parameter and search in the files that were actually written to.

yutongzhang-microsoft · 2026-03-12T08:32:53Z

tests/common/plugins/loganalyzer/loganalyzer.py

+        cmd = "python {run_dir}/loganalyzer.py --action add_end_marker --run_id {marker}" \
+              " --logs /var/log/syslog"\
            .format(run_dir=self.dut_run_dir, marker=marker)



_add_end_marker always passes --logs /var/log/syslog regardless of which log files the LogAnalyzer instance was configured to monitor. This is fine for the _add_end_marker use case (it's only used to anchor the syslog extraction window), but worth noting: the ansible �nalyze action (line 833-834 of ansible's loganalyzer.py) also calls place_marker(log_file_list, end_marker) using the full monitored file list, which will produce a second �nd-LogAnalyzer-{run_id} entry in /var/log/syslog.

Please verify that �nalyze_file() handles duplicate end markers gracefully (i.e., �rr_duplicate_end_marker is not set as an error when the same end marker string appears twice in the file). If it does treat duplicates as an error, this PR could introduce a new failure mode in tests where the analysis log file is the same as the one _add_end_marker writes to.

yutongzhang-microsoft · 2026-03-12T08:32:53Z

tests/common/plugins/loganalyzer/loganalyzer.py

-        if log_files:
-            cmd += " --logs {}".format(','.join(log_files))
+        if not log_files:
+            log_files = []


Nit: log_files is mutated in place here, which is a side effect on the caller's list (already noted elsewhere). But there's also a behavioural change worth double-checking: with the old code, when log_files=None and --logs was omitted, the ansible-side main() at line 844-845 would append system_log_file to log_file_list only for the �nalyze action — not for init. So for init with no --logs, log_file_list was empty and the start marker was written purely via the syslog socket.

With this change, init now receives --logs /var/log/syslog, so place_marker(['/var/log/syslog'], start_marker) is called, which calls place_marker_to_file('/var/log/syslog', ...). This is functionally equivalent only if the file write is reliable and the calling user has write permission to /var/log/syslog. If the loganalyzer script is ever run without root / �dm group privileges, open('/var/log/syslog', 'a') will raise PermissionError — whereas the old syslog-socket path worked for any user.

github-actions bot requested review from ZhaohuiS, nhe-NV and xwjiang-ms March 6, 2026 14:36

Pterosaur force-pushed the fix/loganalyzer-marker-udp-drop branch from 66c400c to e3d7920 Compare March 6, 2026 14:37

Pterosaur added Request for msft-202412 Branch Request for 202511 branch Request to backport a change to 202511 branch labels Mar 6, 2026

mssonicbld added the Cherry Pick Conflict_msft-202412 label Mar 6, 2026

banidoru reviewed Mar 10, 2026

View reviewed changes

banidoru approved these changes Mar 10, 2026

View reviewed changes

Pterosaur marked this pull request as draft March 10, 2026 22:08

yutongzhang-microsoft reviewed Mar 12, 2026

View reviewed changes

vmittal-msft added the Approved for 202511 branch label Mar 16, 2026

Pterosaur closed this Mar 25, 2026

This was referenced Mar 25, 2026

test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop #23289

Closed

test_po_cleanup: fix loganalyzer marker loss under CPU stress #23291

Open


		for log_file in log_file_list:
		self.place_marker_to_file(log_file, marker)

Conversation

Pterosaur commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Mar 6, 2026

Uh oh!

azure-pipelines bot commented Mar 6, 2026

Uh oh!

mssonicbld commented Mar 6, 2026

Uh oh!

azure-pipelines bot commented Mar 6, 2026

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

<-- no return here; falls through to open(log_file, 'a')

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Pterosaur commented Mar 6, 2026 •

edited

Loading