Wait for route programming after stress link flap test#23210
Open
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
Open
Wait for route programming after stress link flap test#23210rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
Conversation
After link flap stress test, routes are re-advertised but take time to be programmed to ASIC. Without waiting, the next test's sanity check sees routeCheck failing and reboots the DUT. Wait for route_check.py to pass, then poll monit status until it refreshes to OK.
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
● ### Description of PR
Summary:
Wait for route programming to complete in BGP stress link flap test
teardown to prevent unnecessary DUT reboot by next test's sanity check.
Type of change
Back port request
Approach
What is the motivation for this PR?
The BGP stress link flap test (
test_bgp_stress_link_flap.py) flapsDUT interfaces repeatedly to stress BGP reconvergence. In the test's
teardown (the yield fixture's cleanup section), it restores interfaces
and waits for BGP sessions to re-establish (line 431, 600s timeout).
However, BGP session establishment does not mean route programming is
complete. After sessions come up, peers re-advertise routes, and
these routes must be programmed to the ASIC via orchagent -> syncd.
On platforms with synchronous route programming (e.g., VPP where
each route requires an individual API call), this can take several
minutes for hundreds of routes.
The test teardown exits as soon as BGP sessions are up, but routes
are still being programmed. The next test's pre-test sanity check
runs
routeCheck(via monit), sees a mismatch between APP_DB andASIC_DB routes, and triggers a DUT reboot to recover. While the
reboot restores a clean state, it wastes ~5 minutes and marks the
sanity check as failed. The test that caused the stale state should
clean up after itself rather than relying on the framework to reboot.
How did you do it?
Added a wait loop in the test's teardown section (after the existing
check_bgp_session_statewait at line 431) that:route_check.pyevery 15 seconds, up to 20 attempts (5 minwindow), until it reports no route mismatches
monit status routeCheckevery 30seconds until monit's cached status refreshes to OK — this is
needed because monit caches routeCheck results and the pre-test
sanity check reads monit's status, not route_check.py directly
This way the test cleans up after itself rather than leaving stale
route state for the next test to deal with.
How did you verify/test it?
Ran BGP stress link flap test followed by subsequent tests on VPP
t1-lag topology (32 peers):
reboots DUT, wasting ~5 min
test proceeds normally with no reboot
Any platform specific information?
Most visible on VPP (synchronous route programming) but can affect
any platform under heavy load where route programming lags behind
BGP session establishment.
Supported testbed topology if it's a new test case?
N/A — existing test teardown fix. Tested on t1-lag.
Documentation
N/A