test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289
Closed
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Closed
test_po_cleanup: replace LogAnalyzer with direct syslog grep to fix marker drop#23289Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Pterosaur wants to merge 1 commit intosonic-net:masterfrom
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
The LogAnalyzer context manager places start/end markers through the syslog UDP socket (/dev/log). On scale setups (e.g. sn5640 with 448 port channels), the config_reload under CPU stress generates such heavy syslog traffic from restarting containers that the host rsyslogd UDP receive buffer overflows for minutes, silently dropping the marker messages and causing a spurious RuntimeError. Replace the LogAnalyzer usage in test_po_cleanup_after_reload with a direct approach: record the syslog line count before the reload, then grep for the expected port-channel cleanup log patterns afterwards. This is immune to UDP buffer overflow since it reads the syslog file directly. Also use try/finally to ensure CPU stress processes are cleaned up on any failure path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3461843 to
f7d3a76
Compare
Collaborator
|
/azp run |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Fix
test_po_cleanup_after_reloadspurious failure caused by LogAnalyzer marker being dropped under heavy syslog load.Type of change
Back port request
Approach
What is the motivation for this PR?
test_po_cleanup_after_reloadfails consistently with:Root cause: The test creates heavy CPU load (16 cores running
yes) and then callsconfig_reload, which restarts all containers. The LogAnalyzer context manager places start/end markers through the syslog UDP socket (/dev/log). On scale setups (e.g. sn5640 with 448 port channels), the config_reload generates such heavy syslog traffic from restarting containers that the host rsyslogd UDP receive buffer overflows for minutes (verified >90 seconds), silently dropping marker messages and causing a RuntimeError.This supersedes PR #22776 which attempted to fix this by writing markers directly to
/var/log/syslog, but that approach causes log timestamp ordering issues.How did you do it?
Replace the
LogAnalyzercontext manager usage intest_po_cleanup_after_reloadwith a direct approach:wc -l /var/log/syslog) before config_reloadgrepthe syslog tail for the expected port-channel cleanup log patternsThis is immune to UDP buffer overflow since it reads the syslog file directly via
tail -n +N.Also improved cleanup: use
try/finallyto ensure CPU stress processes are always cleaned up.How did you verify/test it?
Ran on
vms70-t1-sn5640-3(str4-sn5640-3, t1-isolated-d56u1-lag topology, 16 vCPUs, 448 port channels):RuntimeError: cannot find marker end-LogAnalyzer-...(verified)Intermediate approaches that did NOT work (for context):
killall yes+time.sleep(5)inside thewith loganalyzer:block: 5s not enough for syslog buffer to drainkillall yes+ 90s polling (sendloggermessages, check if they appear): syslog UDP buffer stays congested >90 secondsAny platform specific information?
Tested on Mellanox SN5640. Applies to any platform with enough port channels to generate heavy syslog traffic during config_reload.
Documentation
N/A