Skip to content

[loganalyzer] Wait for rsyslog readiness before placing markers#23294

Draft
50n1c-rnsft wants to merge 1 commit intosonic-net:masterfrom
50n1c-rnsft:fix/loganalyzer-wait-rsyslog-ready
Draft

[loganalyzer] Wait for rsyslog readiness before placing markers#23294
50n1c-rnsft wants to merge 1 commit intosonic-net:masterfrom
50n1c-rnsft:fix/loganalyzer-wait-rsyslog-ready

Conversation

@50n1c-rnsft
Copy link

Description of PR

When config_reload restarts rsyslog, the /dev/log socket may briefly disappear. Markers written via place_marker_to_syslog() during this window are silently lost, causing "cannot find marker end-LogAnalyzer-..." failures in tests such as port_channel_cleanup.

This PR adds a wait_for_rsyslog_ready() method that writes a probe message through /dev/log and polls /var/log/syslog until the probe appears, confirming rsyslog is running and flushing to disk. place_marker_to_syslog() now calls this before writing the actual marker.

Motivation

Observed in CI: after config_reload with CPU stress (nohup yes > /dev/null on all cores), the end marker was lost because rsyslog had not yet recovered when the marker was written. The test waited 120s and timed out.

Approach

  • Probe-then-write: a lightweight probe message is sent through the same /dev/log → rsyslogd → /var/log/syslog pipeline. Once the probe appears on disk, we know rsyslog is ready.
  • No ordering change: both probe and marker travel the same rsyslog path, so no new timing issues are introduced relative to container logs (which arrive via RELP or UDP to the same rsyslogd instance).
  • Graceful degradation: if rsyslog does not recover within the timeout (default 120s), a warning is printed and the marker is written anyway (preserving current behavior).

Files changed

  • ansible/roles/test/files/tools/loganalyzer/loganalyzer.py
    • Added wait_for_rsyslog_ready(timeout=120) method
    • Modified place_marker_to_syslog() to call wait_for_rsyslog_ready() before writing

Signed-off-by: 50n1c-rnsft [email protected]

When config_reload restarts rsyslog, the /dev/log socket may briefly
disappear. Markers written during this window are silently lost, causing
'cannot find marker end-LogAnalyzer-...' failures.

Add wait_for_rsyslog_ready() which writes a probe message via /dev/log
and polls /var/log/syslog until the probe appears, confirming rsyslog is
running and flushing to disk. Call it from place_marker_to_syslog()
before writing the actual marker.

This preserves message ordering because both the probe and the marker
travel through the same rsyslog pipeline.

Signed-off-by: 50n1c-rnsft <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@linux-foundation-easycla
Copy link

CLA Not Signed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@50n1c-rnsft⚠️ Good fix for marker loss, but full syslog scan is a concern

Template: ✅ OK
DCO: ✅ signed
CI: ❌ KVM tests failing (t0, t1-lag, t2) — needs investigation

Code Review:

[Important] Full syslog file scan on every probe is O(n) and too slow for production

loganalyzer.py:218-224:

with open(path, 'r') as fp:
    for line in fp:
        if probe in line:
            return True

This reads the entire syslog file line-by-line on every poll iteration. On production systems, /var/log/syslog can be 50-100MB+. With the inner poll loop running every 1 second for up to 10 seconds, and the outer loop retrying for up to 120 seconds, this could read the full syslog file 60+ times.

Suggestion: Use tail or seek to the end of the file before writing the probe, then only scan new lines:

# Before probe: record file position
pos = os.path.getsize(syslog_file) if os.path.exists(syslog_file) else 0
# After probe: only scan from pos onwards
with open(syslog_file, 'r') as fp:
    fp.seek(pos)
    for line in fp:
        if probe in line:
            return True

Or simply use grep:

subprocess.run(['grep', '-q', probe, syslog_file])

[Medium] Also scans syslog.1 — log rotation during probe could cause false timeout

If syslog rotates between probe write and poll, the probe lands in syslog.1. The code handles this by checking both files, which is good. But scanning both files doubles the I/O cost per poll.

[Minor] CI failures need investigation

All 3 KVM test jobs failed. Since this PR modifies loganalyzer.py which is used in setup/teardown of nearly every test, these failures are likely caused by this change (e.g., the 120-second timeout delaying test startup, or the probe mechanism interfering with marker placement). Please check the KVM logs.

Overall: The root cause is real — rsyslog restarts after config_reload can drop markers. The fix approach (probe-and-wait) is correct. But the implementation needs the syslog scan optimization to avoid becoming a bottleneck, and the CI failures need to be resolved.

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@50n1c-rnsft — Two issues to address:

  1. Full syslog scan is too slowopen(path).read() line-by-line on every 1s poll reads the entire syslog (can be 50-100MB on production). Use fp.seek() to only scan new lines after the probe write, or use tail/grep.

  2. CI failures — All 3 KVM tests (t0, t1-lag, t2) are failing. Since this PR modifies loganalyzer.py which runs in setup/teardown of nearly every test, these failures are very likely caused by this change. Please investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants