Skip to content

test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289

Closed
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur:zegan/fix_po_cleanup_loganalyzer_marker
Closed

test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur:zegan/fix_po_cleanup_loganalyzer_marker

Conversation

@Pterosaur
Copy link
Copy Markdown
Contributor

Description of PR

Fix test_po_cleanup_after_reload spurious failure caused by LogAnalyzer marker being dropped under heavy syslog load.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

test_po_cleanup_after_reload fails consistently with:

RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog

Root cause: The test creates heavy CPU load (16 cores running yes) and then calls config_reload, which restarts all containers. The LogAnalyzer context manager places start/end markers through the syslog UDP socket (/dev/log). On scale setups (e.g. sn5640 with 448 port channels), the config_reload generates such heavy syslog traffic from restarting containers that the host rsyslogd UDP receive buffer overflows for minutes (verified >90 seconds), silently dropping marker messages and causing a RuntimeError.

This supersedes PR #22776 which attempted to fix this by writing markers directly to /var/log/syslog, but that approach causes log timestamp ordering issues.

How did you do it?

Replace the LogAnalyzer context manager usage in test_po_cleanup_after_reload with a direct approach:

  1. Record the syslog line count (wc -l /var/log/syslog) before config_reload
  2. Run config_reload under CPU stress as before
  3. After reload, grep the syslog tail for the expected port-channel cleanup log patterns

This is immune to UDP buffer overflow since it reads the syslog file directly via tail -n +N.

Also improved cleanup: use try/finally to ensure CPU stress processes are always cleaned up.

How did you verify/test it?

Ran on vms70-t1-sn5640-3 (str4-sn5640-3, t1-isolated-d56u1-lag topology, 16 vCPUs, 448 port channels):

  1. Without fix: Test fails consistently with RuntimeError: cannot find marker end-LogAnalyzer-... (verified)
  2. With fix (direct syslog grep, no LogAnalyzer): Passes reliably (1 passed in 14:32)

Intermediate approaches that did NOT work (for context):

  • killall yes + time.sleep(5) inside the with loganalyzer: block: 5s not enough for syslog buffer to drain
  • killall yes + 90s polling (send logger messages, check if they appear): syslog UDP buffer stays congested >90 seconds

Any platform specific information?

Tested on Mellanox SN5640. Applies to any platform with enough port channels to generate heavy syslog traffic during config_reload.

Documentation

N/A

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 25, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Copilot / name: Copilot (f7d3a76)
  • ✅ login: Pterosaur / name: Ze Gan (f7d3a76)

@github-actions github-actions bot requested review from r12f, sdszhang and wangxin March 25, 2026 03:56
The LogAnalyzer context manager places start/end markers through the
syslog UDP socket (/dev/log).  On scale setups (e.g. sn5640 with 448
port channels), the config_reload under CPU stress generates such heavy
syslog traffic from restarting containers that the host rsyslogd UDP
receive buffer overflows for minutes, silently dropping the marker
messages and causing a spurious RuntimeError.

Replace the LogAnalyzer usage in test_po_cleanup_after_reload with a
direct approach: record the syslog line count before the reload, then
grep for the expected port-channel cleanup log patterns afterwards.
This is immune to UDP buffer overflow since it reads the syslog file
directly.

Also use try/finally to ensure CPU stress processes are cleaned up
on any failure path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 3461843 to f7d3a76 Compare March 25, 2026 05:34
@Pterosaur Pterosaur closed this Mar 25, 2026
@Pterosaur Pterosaur deleted the zegan/fix_po_cleanup_loganalyzer_marker branch March 25, 2026 05:35
@github-actions github-actions bot requested a review from saiarcot895 March 25, 2026 05:35
@Pterosaur Pterosaur restored the zegan/fix_po_cleanup_loganalyzer_marker branch March 25, 2026 05:35
@github-actions github-actions bot requested review from ZhaohuiS and yejianquan March 25, 2026 05:35
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants