[loganalyzer] Wait for rsyslog readiness before placing markers by 50n1c-rnsft · Pull Request #23294 · sonic-net/sonic-mgmt

50n1c-rnsft · 2026-03-25T06:42:10Z

Description of PR

When config_reload restarts rsyslog, the /dev/log socket may briefly disappear. Markers written via place_marker_to_syslog() during this window are silently lost, causing "cannot find marker end-LogAnalyzer-..." failures in tests such as port_channel_cleanup.

This PR adds a wait_for_rsyslog_ready() method that writes a probe message through /dev/log and polls /var/log/syslog until the probe appears, confirming rsyslog is running and flushing to disk. place_marker_to_syslog() now calls this before writing the actual marker.

Motivation

Observed in CI: after config_reload with CPU stress (nohup yes > /dev/null on all cores), the end marker was lost because rsyslog had not yet recovered when the marker was written. The test waited 120s and timed out.

Approach

Probe-then-write: a lightweight probe message is sent through the same /dev/log → rsyslogd → /var/log/syslog pipeline. Once the probe appears on disk, we know rsyslog is ready.
No ordering change: both probe and marker travel the same rsyslog path, so no new timing issues are introduced relative to container logs (which arrive via RELP or UDP to the same rsyslogd instance).
Graceful degradation: if rsyslog does not recover within the timeout (default 120s), a warning is printed and the marker is written anyway (preserving current behavior).

Files changed

ansible/roles/test/files/tools/loganalyzer/loganalyzer.py
- Added wait_for_rsyslog_ready(timeout=120) method
- Modified place_marker_to_syslog() to call wait_for_rsyslog_ready() before writing

Signed-off-by: 50n1c-rnsft [email protected]

When config_reload restarts rsyslog, the /dev/log socket may briefly disappear. Markers written during this window are silently lost, causing 'cannot find marker end-LogAnalyzer-...' failures. Add wait_for_rsyslog_ready() which writes a probe message via /dev/log and polls /var/log/syslog until the probe appears, confirming rsyslog is running and flushing to disk. Call it from place_marker_to_syslog() before writing the actual marker. This preserves message ordering because both the probe and the marker travel through the same rsyslog pipeline. Signed-off-by: 50n1c-rnsft <[email protected]>

mssonicbld · 2026-03-25T06:42:20Z

/azp run

linux-foundation-easycla · 2026-03-25T06:42:20Z

❌ - login: @50n1c-rnsft / name: 50n1c-rnsft . The commit (c996251) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

azure-pipelines · 2026-03-25T06:42:32Z

Azure Pipelines successfully started running 1 pipeline(s).

StormLiangMS

@50n1c-rnsft — ⚠️ Good fix for marker loss, but full syslog scan is a concern

Template: ✅ OK
DCO: ✅ signed
CI: ❌ KVM tests failing (t0, t1-lag, t2) — needs investigation

Code Review:

[Important] Full syslog file scan on every probe is O(n) and too slow for production

loganalyzer.py:218-224:

with open(path, 'r') as fp:
    for line in fp:
        if probe in line:
            return True

This reads the entire syslog file line-by-line on every poll iteration. On production systems, /var/log/syslog can be 50-100MB+. With the inner poll loop running every 1 second for up to 10 seconds, and the outer loop retrying for up to 120 seconds, this could read the full syslog file 60+ times.

Suggestion: Use tail or seek to the end of the file before writing the probe, then only scan new lines:

# Before probe: record file position
pos = os.path.getsize(syslog_file) if os.path.exists(syslog_file) else 0
# After probe: only scan from pos onwards
with open(syslog_file, 'r') as fp:
    fp.seek(pos)
    for line in fp:
        if probe in line:
            return True

Or simply use grep:

subprocess.run(['grep', '-q', probe, syslog_file])

[Medium] Also scans syslog.1 — log rotation during probe could cause false timeout

If syslog rotates between probe write and poll, the probe lands in syslog.1. The code handles this by checking both files, which is good. But scanning both files doubles the I/O cost per poll.

[Minor] CI failures need investigation

All 3 KVM test jobs failed. Since this PR modifies loganalyzer.py which is used in setup/teardown of nearly every test, these failures are likely caused by this change (e.g., the 120-second timeout delaying test startup, or the probe mechanism interfering with marker placement). Please check the KVM logs.

Overall: The root cause is real — rsyslog restarts after config_reload can drop markers. The fix approach (probe-and-wait) is correct. But the implementation needs the syslog scan optimization to avoid becoming a bottleneck, and the CI failures need to be resolved.

StormLiangMS

@50n1c-rnsft — Two issues to address:

Full syslog scan is too slow — open(path).read() line-by-line on every 1s poll reads the entire syslog (can be 50-100MB on production). Use fp.seek() to only scan new lines after the probe write, or use tail/grep.
CI failures — All 3 KVM tests (t0, t1-lag, t2) are failing. Since this PR modifies loganalyzer.py which runs in setup/teardown of nearly every test, these failures are very likely caused by this change. Please investigate.

Copilot AI mentioned this pull request Mar 26, 2026

Review PRs opened in the last 48 hours by humans #23330

Draft

StormLiangMS reviewed Mar 27, 2026

View reviewed changes

StormLiangMS requested changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[loganalyzer] Wait for rsyslog readiness before placing markers#23294

[loganalyzer] Wait for rsyslog readiness before placing markers#23294
50n1c-rnsft wants to merge 1 commit intosonic-net:masterfrom
50n1c-rnsft:fix/loganalyzer-wait-rsyslog-ready

50n1c-rnsft commented Mar 25, 2026

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 25, 2026

Uh oh!

azure-pipelines bot commented Mar 25, 2026

Uh oh!

StormLiangMS left a comment

Uh oh!

StormLiangMS left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

50n1c-rnsft commented Mar 25, 2026

Description of PR

Motivation

Approach

Files changed

Uh oh!

mssonicbld commented Mar 25, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 25, 2026

Uh oh!

azure-pipelines bot commented Mar 25, 2026

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants