test_po_cleanup: fix loganalyzer marker loss under CPU stress#23291
Open
Pterosaur wants to merge 2 commits intosonic-net:masterfrom
Open
test_po_cleanup: fix loganalyzer marker loss under CPU stress#23291Pterosaur wants to merge 2 commits intosonic-net:masterfrom
Pterosaur wants to merge 2 commits intosonic-net:masterfrom
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
f7d3a76 to
b2feb6a
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
b2feb6a to
1a016b7
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1a016b7 to
89b0033
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
89b0033 to
9a19155
Compare
Collaborator
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
|
9a19155 to
89b0033
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
36c1f96 to
c4c3bc4
Compare
The test_po_cleanup_after_reload test fails with: RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog Root cause: two issues combine to lose the loganalyzer end marker. 1. The system logrotate timer can fire mid-test while config_reload is generating heavy syslog traffic under CPU stress. When logrotate rotates /var/log/syslog, loganalyzer markers can be pushed beyond syslog.1 into compressed archives (syslog.2.gz) that are not searched by the marker polling logic. 2. Under heavy CPU load (all cores running 'yes'), the host rsyslogd cannot keep up with the syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via /dev/log. Fix: - Wrap with DisableLogrotateCronContext to stop the logrotate timer and cron job for the duration of the test. - Move 'kill yes' into a try/finally block inside 'with loganalyzer' so CPU stress is removed before loganalyzer __exit__ places the end marker. - After stopping CPU stress, poll /var/log/syslog file size until it stabilizes (two consecutive 10s intervals with no growth), ensuring rsyslogd has drained the socket buffer backlog before the end marker is placed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Ze Gan <ganze718@gmail.com>
c4c3bc4 to
8dbc7bc
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Merged
5 tasks
yxieca
previously approved these changes
Mar 26, 2026
Collaborator
yxieca
left a comment
There was a problem hiding this comment.
AI agent on behalf of Ying. Reviewed; no issues found.
Pterosaur
added a commit
to Pterosaur/sonic-mgmt
that referenced
this pull request
Mar 26, 2026
…PU stress Cherry-pick upstream stop rsyslogd (PR sonic-net#20327 + sonic-net#20969) and add try/finally to ensure CPU stress cleanup + rsyslogd flush before loganalyzer places end marker. Changes: 1. Stop rsyslogd in swss container before config_reload (reduces syslog flood) 2. Use try/finally inside with loganalyzer: to ensure cleanup runs even if config_reload fails 3. Kill CPU stress + flush rsyslogd + sleep 30s before loganalyzer __exit__ places end marker Tested on: vms70-t1-sn5640-3 (str4-sn5640-3, SN5640, 16 CPU, 448 port channels) - PASSED Upstream PR: sonic-net#23291
Pterosaur
added a commit
to Pterosaur/sonic-mgmt
that referenced
this pull request
Mar 26, 2026
…PU stress ### Description Fix `test_po_cleanup_after_reload` failure caused by LogAnalyzer end marker being lost under heavy CPU stress and syslog load. ``` RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog ``` ### Root Cause Two issues combine to lose the loganalyzer end marker: 1. **Logrotate fires mid-test**: The system logrotate timer can trigger during config_reload under CPU stress, rotating `/var/log/syslog`. Markers get pushed beyond `syslog.1` into compressed archives (`syslog.2.gz`) that loganalyzer does not search. 2. **Syslog UDP buffer overflow**: Under heavy CPU load (all cores running `yes`), rsyslogd cannot keep up with syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via `/dev/log`. ### Fix - Wrap with `DisableLogrotateCronContext` to stop logrotate timer/cron during the test. - Move CPU stress cleanup (`killall yes`) into a `try/finally` block inside `with loganalyzer:` so it runs before `__exit__` places the end marker. - After stopping CPU stress, poll `/var/log/syslog` file size until stable (two consecutive 10s intervals with no growth), ensuring rsyslogd has drained the socket buffer backlog. ### Verification Tested on `vms70-t1-sn5640-3`: **PASSED** (1 passed in 1320s) Upstream PR: sonic-net#23291 ---- #### AI description (iteration 1) #### PR Classification Bug fix in test automation to prevent loganalyzer marker loss under heavy CPU load. #### PR Summary This pull request updates the test for config reload under CPU stress by disabling logrotate and ensuring the loganalyzer markers are preserved during high syslog activity. - In `/tests/pc/test_po_cleanup.py`, a `DisableLogrotateCronContext` is used to prevent log rotation from interfering with marker placement. - The test now includes a try-finally block that kills the CPU stress process and waits for the system log to stabilize before completing. - The imports are updated to include `DisableLogrotateCronContext` and the standard `time` module for the polling mechanism. <!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
7b8ce8a to
a088184
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ze Gan <ganze718@gmail.com>
a088184 to
6d05153
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Pterosaur
added a commit
to Azure/sonic-mgmt.msft
that referenced
this pull request
Mar 26, 2026
#1090) ### Description of PR Backport of sonic-net/sonic-mgmt#23291 to msft-202412. Fix `test_po_cleanup_after_reload` failure caused by LogAnalyzer end marker being lost under heavy CPU stress and syslog load. ### Type of change - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [x] Test case improvement ### Approach #### What is the motivation for this PR? `test_po_cleanup_after_reload` fails consistently with: ``` RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog ``` Two issues combine to lose the loganalyzer end marker: 1. **Logrotate fires mid-test**: The system logrotate timer can trigger during `config_reload` under CPU stress, rotating `/var/log/syslog`. Markers get pushed beyond `syslog.1` into compressed archives (`syslog.2.gz`) that loganalyzer does not search. 2. **Syslog UDP buffer overflow**: Under heavy CPU load (all cores running `yes`), rsyslogd cannot keep up with syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via `/dev/log`. #### How did you do it? Three changes to `test_po_cleanup_after_reload`: 1. **Wrap with `DisableLogrotateCronContext`** — stops the logrotate timer and cron job for the duration of the test, preventing mid-test syslog rotation. 2. **Move CPU stress cleanup into a `try/finally` block inside `with loganalyzer:`** — runs before loganalyzer `__exit__` places the end marker, so rsyslogd is no longer competing for CPU. 3. **Poll `/var/log/syslog` file size until stable** — after stopping CPU stress, wait for two consecutive 10-second intervals with no file growth before allowing the end marker to be placed. This ensures rsyslogd has drained any remaining socket buffer backlog. #### How did you verify/test it? 1. **Without fix:** Test fails consistently with `RuntimeError: cannot find marker end-LogAnalyzer-...` 2. **With fix:** PASSED (1 passed in 1153s / 19:13) #### Any platform specific information? Not platform-specific. ### Documentation N/A Signed-off-by: Ze Gan <ganze718@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Fix
test_po_cleanup_after_reloadfailure caused by LogAnalyzer end marker being lost under heavy CPU stress and syslog load.Type of change
Back port request
Approach
What is the motivation for this PR?
test_po_cleanup_after_reloadfails consistently with:Two issues combine to lose the loganalyzer end marker:
Logrotate fires mid-test: The system logrotate timer can trigger during
config_reloadunder CPU stress, rotating/var/log/syslog. Markers get pushed beyondsyslog.1into compressed archives (syslog.2.gz) that loganalyzer does not search.Syslog UDP buffer overflow: Under heavy CPU load (all cores running
yes), rsyslogd cannot keep up with syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via/dev/log.This supersedes PR #22776 which attempted to fix this by writing markers directly to
/var/log/syslog, but that approach causes log timestamp ordering issues.How did you do it?
Three changes to
test_po_cleanup_after_reload:Wrap with
DisableLogrotateCronContext— stops the logrotate timer and cron job for the duration of the test, preventing mid-test syslog rotation.Move CPU stress cleanup into a
try/finallyblock insidewith loganalyzer:—killall yesruns before loganalyzer__exit__places the end marker, so rsyslogd is no longer competing for CPU.Poll
/var/log/syslogfile size until stable — after stopping CPU stress, wait for two consecutive 10-second intervals with no file growth before allowing the end marker to be placed. This ensures rsyslogd has drained any remaining socket buffer backlog.How did you verify/test it?
RuntimeError: cannot find marker end-LogAnalyzer-...Intermediate approaches that did NOT work:
DisableLogrotateCronContextalone: logrotate is only one of two root causes; the syslog buffer overflow also drops the markerkillall yes+ no drain wait: rsyslogd buffer backlog can take minutes to drain after CPU stress endsAny platform specific information?
Not platform-specific. Applies to any setup with enough port channels to generate heavy syslog traffic during config_reload under CPU stress.
Documentation
N/A