[loganalyzer] Wait for rsyslog readiness before placing markers#23294
[loganalyzer] Wait for rsyslog readiness before placing markers#2329450n1c-rnsft wants to merge 1 commit intosonic-net:masterfrom
Conversation
When config_reload restarts rsyslog, the /dev/log socket may briefly disappear. Markers written during this window are silently lost, causing 'cannot find marker end-LogAnalyzer-...' failures. Add wait_for_rsyslog_ready() which writes a probe message via /dev/log and polls /var/log/syslog until the probe appears, confirming rsyslog is running and flushing to disk. Call it from place_marker_to_syslog() before writing the actual marker. This preserves message ordering because both the probe and the marker travel through the same rsyslog pipeline. Signed-off-by: 50n1c-rnsft <[email protected]>
|
/azp run |
|
|
Azure Pipelines successfully started running 1 pipeline(s). |
StormLiangMS
left a comment
There was a problem hiding this comment.
@50n1c-rnsft —
Template: ✅ OK
DCO: ✅ signed
CI: ❌ KVM tests failing (t0, t1-lag, t2) — needs investigation
Code Review:
[Important] Full syslog file scan on every probe is O(n) and too slow for production
loganalyzer.py:218-224:
with open(path, 'r') as fp:
for line in fp:
if probe in line:
return TrueThis reads the entire syslog file line-by-line on every poll iteration. On production systems, /var/log/syslog can be 50-100MB+. With the inner poll loop running every 1 second for up to 10 seconds, and the outer loop retrying for up to 120 seconds, this could read the full syslog file 60+ times.
Suggestion: Use tail or seek to the end of the file before writing the probe, then only scan new lines:
# Before probe: record file position
pos = os.path.getsize(syslog_file) if os.path.exists(syslog_file) else 0
# After probe: only scan from pos onwards
with open(syslog_file, 'r') as fp:
fp.seek(pos)
for line in fp:
if probe in line:
return TrueOr simply use grep:
subprocess.run(['grep', '-q', probe, syslog_file])[Medium] Also scans syslog.1 — log rotation during probe could cause false timeout
If syslog rotates between probe write and poll, the probe lands in syslog.1. The code handles this by checking both files, which is good. But scanning both files doubles the I/O cost per poll.
[Minor] CI failures need investigation
All 3 KVM test jobs failed. Since this PR modifies loganalyzer.py which is used in setup/teardown of nearly every test, these failures are likely caused by this change (e.g., the 120-second timeout delaying test startup, or the probe mechanism interfering with marker placement). Please check the KVM logs.
Overall: The root cause is real — rsyslog restarts after config_reload can drop markers. The fix approach (probe-and-wait) is correct. But the implementation needs the syslog scan optimization to avoid becoming a bottleneck, and the CI failures need to be resolved.
StormLiangMS
left a comment
There was a problem hiding this comment.
@50n1c-rnsft — Two issues to address:
-
Full syslog scan is too slow —
open(path).read()line-by-line on every 1s poll reads the entire syslog (can be 50-100MB on production). Usefp.seek()to only scan new lines after the probe write, or usetail/grep. -
CI failures — All 3 KVM tests (t0, t1-lag, t2) are failing. Since this PR modifies
loganalyzer.pywhich runs in setup/teardown of nearly every test, these failures are very likely caused by this change. Please investigate.
Description of PR
When
config_reloadrestarts rsyslog, the/dev/logsocket may briefly disappear. Markers written viaplace_marker_to_syslog()during this window are silently lost, causing"cannot find marker end-LogAnalyzer-..."failures in tests such asport_channel_cleanup.This PR adds a
wait_for_rsyslog_ready()method that writes a probe message through/dev/logand polls/var/log/sysloguntil the probe appears, confirming rsyslog is running and flushing to disk.place_marker_to_syslog()now calls this before writing the actual marker.Motivation
Observed in CI: after
config_reloadwith CPU stress (nohup yes > /dev/nullon all cores), the end marker was lost because rsyslog had not yet recovered when the marker was written. The test waited 120s and timed out.Approach
/dev/log→ rsyslogd →/var/log/syslogpipeline. Once the probe appears on disk, we know rsyslog is ready.Files changed
ansible/roles/test/files/tools/loganalyzer/loganalyzer.pywait_for_rsyslog_ready(timeout=120)methodplace_marker_to_syslog()to callwait_for_rsyslog_ready()before writingSigned-off-by: 50n1c-rnsft [email protected]