Skip to content

test_po_cleanup: fix loganalyzer marker loss under CPU stress#23291

Open
Pterosaur wants to merge 2 commits intosonic-net:masterfrom
Pterosaur:zegan/fix_po_cleanup_loganalyzer_marker
Open

test_po_cleanup: fix loganalyzer marker loss under CPU stress#23291
Pterosaur wants to merge 2 commits intosonic-net:masterfrom
Pterosaur:zegan/fix_po_cleanup_loganalyzer_marker

Conversation

@Pterosaur
Copy link
Copy Markdown
Contributor

@Pterosaur Pterosaur commented Mar 25, 2026

Description of PR

Fix test_po_cleanup_after_reload failure caused by LogAnalyzer end marker being lost under heavy CPU stress and syslog load.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

test_po_cleanup_after_reload fails consistently with:

RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog

Two issues combine to lose the loganalyzer end marker:

  1. Logrotate fires mid-test: The system logrotate timer can trigger during config_reload under CPU stress, rotating /var/log/syslog. Markers get pushed beyond syslog.1 into compressed archives (syslog.2.gz) that loganalyzer does not search.

  2. Syslog UDP buffer overflow: Under heavy CPU load (all cores running yes), rsyslogd cannot keep up with syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via /dev/log.

This supersedes PR #22776 which attempted to fix this by writing markers directly to /var/log/syslog, but that approach causes log timestamp ordering issues.

How did you do it?

Three changes to test_po_cleanup_after_reload:

  1. Wrap with DisableLogrotateCronContext — stops the logrotate timer and cron job for the duration of the test, preventing mid-test syslog rotation.

  2. Move CPU stress cleanup into a try/finally block inside with loganalyzer:killall yes runs before loganalyzer __exit__ places the end marker, so rsyslogd is no longer competing for CPU.

  3. Poll /var/log/syslog file size until stable — after stopping CPU stress, wait for two consecutive 10-second intervals with no file growth before allowing the end marker to be placed. This ensures rsyslogd has drained any remaining socket buffer backlog.

How did you verify/test it?

  1. Without fix: Test fails consistently with RuntimeError: cannot find marker end-LogAnalyzer-...
  2. With fix: PASSED (1 passed in 1153s / 19:13)

Intermediate approaches that did NOT work:

  • DisableLogrotateCronContext alone: logrotate is only one of two root causes; the syslog buffer overflow also drops the marker
  • killall yes + no drain wait: rsyslogd buffer backlog can take minutes to drain after CPU stress ends

Any platform specific information?

Not platform-specific. Applies to any setup with enough port channels to generate heavy syslog traffic during config_reload under CPU stress.

Documentation

N/A

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from f7d3a76 to b2feb6a Compare March 25, 2026 05:51
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from b2feb6a to 1a016b7 Compare March 25, 2026 05:55
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 1a016b7 to 89b0033 Compare March 25, 2026 09:16
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur changed the title test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop test_po_cleanup: fix loganalyzer marker drop under CPU stress Mar 25, 2026
@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 89b0033 to 9a19155 Compare March 25, 2026 10:15
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 25, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Pterosaur / name: Ze Gan (6d05153)

@github-actions github-actions bot requested review from r12f, sdszhang and wangxin March 25, 2026 10:16
@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 9a19155 to 89b0033 Compare March 25, 2026 10:24
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested a review from ZhaohuiS March 25, 2026 13:40
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 36c1f96 to c4c3bc4 Compare March 25, 2026 13:41
@github-actions github-actions bot requested review from sdszhang and wangxin March 25, 2026 13:42
@Pterosaur Pterosaur changed the title test_po_cleanup: fix loganalyzer marker drop under CPU stress test_po_cleanup: fix loganalyzer marker loss under CPU stress Mar 25, 2026
The test_po_cleanup_after_reload test fails with:
  RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog

Root cause: two issues combine to lose the loganalyzer end marker.

1. The system logrotate timer can fire mid-test while config_reload is
   generating heavy syslog traffic under CPU stress.  When logrotate
   rotates /var/log/syslog, loganalyzer markers can be pushed beyond
   syslog.1 into compressed archives (syslog.2.gz) that are not
   searched by the marker polling logic.

2. Under heavy CPU load (all cores running 'yes'), the host rsyslogd
   cannot keep up with the syslog traffic and its UDP receive buffer
   overflows, silently dropping the end marker written via /dev/log.

Fix:
- Wrap with DisableLogrotateCronContext to stop the logrotate timer
  and cron job for the duration of the test.
- Move 'kill yes' into a try/finally block inside 'with loganalyzer'
  so CPU stress is removed before loganalyzer __exit__ places the
  end marker.
- After stopping CPU stress, poll /var/log/syslog file size until it
  stabilizes (two consecutive 10s intervals with no growth), ensuring
  rsyslogd has drained the socket buffer backlog before the end marker
  is placed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Ze Gan <ganze718@gmail.com>
@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from c4c3bc4 to 8dbc7bc Compare March 25, 2026 14:14
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Mar 26, 2026
Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI agent on behalf of Ying. Reviewed; no issues found.

Pterosaur added a commit to Pterosaur/sonic-mgmt that referenced this pull request Mar 26, 2026
…PU stress

Cherry-pick upstream stop rsyslogd (PR sonic-net#20327 + sonic-net#20969) and add try/finally to ensure CPU stress cleanup + rsyslogd flush before loganalyzer places end marker.

Changes:
1. Stop rsyslogd in swss container before config_reload (reduces syslog flood)
2. Use try/finally inside with loganalyzer: to ensure cleanup runs even if config_reload fails
3. Kill CPU stress + flush rsyslogd + sleep 30s before loganalyzer __exit__ places end marker

Tested on: vms70-t1-sn5640-3 (str4-sn5640-3, SN5640, 16 CPU, 448 port channels) - PASSED

Upstream PR: sonic-net#23291
Pterosaur added a commit to Pterosaur/sonic-mgmt that referenced this pull request Mar 26, 2026
…PU stress

### Description

Fix `test_po_cleanup_after_reload` failure caused by LogAnalyzer end marker being lost under heavy CPU stress and syslog load.

```
RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog
```

### Root Cause

Two issues combine to lose the loganalyzer end marker:

1. **Logrotate fires mid-test**: The system logrotate timer can trigger during config_reload under CPU stress, rotating `/var/log/syslog`. Markers get pushed beyond `syslog.1` into compressed archives (`syslog.2.gz`) that loganalyzer does not search.

2. **Syslog UDP buffer overflow**: Under heavy CPU load (all cores running `yes`), rsyslogd cannot keep up with syslog traffic and its UDP receive buffer overflows, silently dropping the end marker written via `/dev/log`.

### Fix

- Wrap with `DisableLogrotateCronContext` to stop logrotate timer/cron during the test.
- Move CPU stress cleanup (`killall yes`) into a `try/finally` block inside `with loganalyzer:` so it runs before `__exit__` places the end marker.
- After stopping CPU stress, poll `/var/log/syslog` file size until stable (two consecutive 10s intervals with no growth), ensuring rsyslogd has drained the socket buffer backlog.

### Verification

Tested on `vms70-t1-sn5640-3`: **PASSED** (1 passed in 1320s)

Upstream PR: sonic-net#23291

----
#### AI description  (iteration 1)
#### PR Classification
Bug fix in test automation to prevent loganalyzer marker loss under heavy CPU load.

#### PR Summary
This pull request updates the test for config reload under CPU stress by disabling logrotate and ensuring the loganalyzer markers are preserved during high syslog activity.
- In `/tests/pc/test_po_cleanup.py`, a `DisableLogrotateCronContext` is used to prevent log rotation from interfering with marker placement.
- The test now includes a try-finally block that kills the CPU stress process and waits for the system log to stabilize before completing.
- The imports are updated to include `DisableLogrotateCronContext` and the standard `time` module for the polling mechanism.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from 7b8ce8a to a088184 Compare March 26, 2026 10:13
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ze Gan <ganze718@gmail.com>
@Pterosaur Pterosaur force-pushed the zegan/fix_po_cleanup_loganalyzer_marker branch from a088184 to 6d05153 Compare March 26, 2026 10:16
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Pterosaur added a commit to Azure/sonic-mgmt.msft that referenced this pull request Mar 26, 2026
#1090)

### Description of PR

Backport of sonic-net/sonic-mgmt#23291 to
msft-202412.

Fix `test_po_cleanup_after_reload` failure caused by LogAnalyzer end
marker being lost under heavy CPU stress and syslog load.

### Type of change

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [x] Test case improvement

### Approach
#### What is the motivation for this PR?

`test_po_cleanup_after_reload` fails consistently with:
```
RuntimeError: cannot find marker end-LogAnalyzer-port_channel_cleanup.xxx in /var/log/syslog
```

Two issues combine to lose the loganalyzer end marker:

1. **Logrotate fires mid-test**: The system logrotate timer can trigger
during `config_reload` under CPU stress, rotating `/var/log/syslog`.
Markers get pushed beyond `syslog.1` into compressed archives
(`syslog.2.gz`) that loganalyzer does not search.

2. **Syslog UDP buffer overflow**: Under heavy CPU load (all cores
running `yes`), rsyslogd cannot keep up with syslog traffic and its UDP
receive buffer overflows, silently dropping the end marker written via
`/dev/log`.

#### How did you do it?

Three changes to `test_po_cleanup_after_reload`:

1. **Wrap with `DisableLogrotateCronContext`** — stops the logrotate
timer and cron job for the duration of the test, preventing mid-test
syslog rotation.

2. **Move CPU stress cleanup into a `try/finally` block inside `with
loganalyzer:`** — runs before loganalyzer `__exit__` places the end
marker, so rsyslogd is no longer competing for CPU.

3. **Poll `/var/log/syslog` file size until stable** — after stopping
CPU stress, wait for two consecutive 10-second intervals with no file
growth before allowing the end marker to be placed. This ensures
rsyslogd has drained any remaining socket buffer backlog.

#### How did you verify/test it?

1. **Without fix:** Test fails consistently with `RuntimeError: cannot
find marker end-LogAnalyzer-...`
2. **With fix:** PASSED (1 passed in 1153s / 19:13)

#### Any platform specific information?

Not platform-specific.

### Documentation

N/A

Signed-off-by: Ze Gan <ganze718@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants