test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop by Pterosaur · Pull Request #23289 · sonic-net/sonic-mgmt

Pterosaur · 2026-03-25T03:55:46Z

Description of PR

Fix test_po_cleanup_after_reload spurious failure caused by LogAnalyzer marker being dropped under heavy syslog load.

Type of change

Back port request

Approach

What is the motivation for this PR?

test_po_cleanup_after_reload fails consistently with:

RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog

Root cause: The test creates heavy CPU load (16 cores running yes) and then calls config_reload, which restarts all containers. The LogAnalyzer context manager places start/end markers through the syslog UDP socket (/dev/log). On scale setups (e.g. sn5640 with 448 port channels), the config_reload generates such heavy syslog traffic from restarting containers that the host rsyslogd UDP receive buffer overflows for minutes (verified >90 seconds), silently dropping marker messages and causing a RuntimeError.

This supersedes PR #22776 which attempted to fix this by writing markers directly to /var/log/syslog, but that approach causes log timestamp ordering issues.

How did you do it?

Replace the LogAnalyzer context manager usage in test_po_cleanup_after_reload with a direct approach:

Record the syslog line count (wc -l /var/log/syslog) before config_reload
Run config_reload under CPU stress as before
After reload, grep the syslog tail for the expected port-channel cleanup log patterns

This is immune to UDP buffer overflow since it reads the syslog file directly via tail -n +N.

Also improved cleanup: use try/finally to ensure CPU stress processes are always cleaned up.

How did you verify/test it?

Ran on vms70-t1-sn5640-3 (str4-sn5640-3, t1-isolated-d56u1-lag topology, 16 vCPUs, 448 port channels):

Without fix: Test fails consistently with RuntimeError: cannot find marker end-LogAnalyzer-... (verified)
With fix (direct syslog grep, no LogAnalyzer): Passes reliably (1 passed in 14:32)

Intermediate approaches that did NOT work (for context):

killall yes + time.sleep(5) inside the with loganalyzer: block: 5s not enough for syslog buffer to drain
killall yes + 90s polling (send logger messages, check if they appear): syslog UDP buffer stays congested >90 seconds

Any platform specific information?

Tested on Mellanox SN5640. Applies to any platform with enough port channels to generate heavy syslog traffic during config_reload.

Documentation

N/A

mssonicbld · 2026-03-25T03:55:55Z

/azp run

azure-pipelines · 2026-03-25T03:56:01Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

linux-foundation-easycla · 2026-03-25T03:56:30Z

The committers listed above are authorized under a signed CLA.

✅ login: Copilot / name: Copilot (f7d3a76)
✅ login: Pterosaur / name: Ze Gan (f7d3a76)

The LogAnalyzer context manager places start/end markers through the syslog UDP socket (/dev/log). On scale setups (e.g. sn5640 with 448 port channels), the config_reload under CPU stress generates such heavy syslog traffic from restarting containers that the host rsyslogd UDP receive buffer overflows for minutes, silently dropping the marker messages and causing a spurious RuntimeError. Replace the LogAnalyzer usage in test_po_cleanup_after_reload with a direct approach: record the syslog line count before the reload, then grep for the expected port-channel cleanup log patterns afterwards. This is immune to UDP buffer overflow since it reads the syslog file directly. Also use try/finally to ensure CPU stress processes are cleaned up on any failure path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mssonicbld · 2026-03-25T05:35:46Z

/azp run

github-actions bot requested review from r12f, sdszhang and wangxin March 25, 2026 03:56

Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 3461843 to f7d3a76 Compare March 25, 2026 05:34

Pterosaur closed this Mar 25, 2026

Pterosaur deleted the zegan/fix_po_cleanup_loganalyzer_marker branch March 25, 2026 05:35

github-actions bot requested a review from saiarcot895 March 25, 2026 05:35

Pterosaur restored the zegan/fix_po_cleanup_loganalyzer_marker branch March 25, 2026 05:35

github-actions bot requested review from ZhaohuiS and yejianquan March 25, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289

test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur:zegan/fix_po_cleanup_loganalyzer_marker

Pterosaur commented Mar 25, 2026

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

azure-pipelines bot commented Mar 25, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Pterosaur commented Mar 25, 2026

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Documentation

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

azure-pipelines bot commented Mar 25, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linux-foundation-easycla bot commented Mar 25, 2026 •

edited

Loading