Wait for route programming after stress link flap test by rejithomas-arista · Pull Request #23210 · sonic-net/sonic-mgmt

rejithomas-arista · 2026-03-23T10:51:41Z

● ### Description of PR

Summary:
Wait for route programming to complete in BGP stress link flap test
teardown to prevent unnecessary DUT reboot by next test's sanity check.

Type of change

Back port request

Approach

What is the motivation for this PR?

The BGP stress link flap test (test_bgp_stress_link_flap.py) flaps
DUT interfaces repeatedly to stress BGP reconvergence. In the test's
teardown (the yield fixture's cleanup section), it restores interfaces
and waits for BGP sessions to re-establish (line 431, 600s timeout).

However, BGP session establishment does not mean route programming is
complete. After sessions come up, peers re-advertise routes, and
these routes must be programmed to the ASIC via orchagent -> syncd.
On platforms with synchronous route programming (e.g., VPP where
each route requires an individual API call), this can take several
minutes for hundreds of routes.

The test teardown exits as soon as BGP sessions are up, but routes
are still being programmed. The next test's pre-test sanity check
runs routeCheck (via monit), sees a mismatch between APP_DB and
ASIC_DB routes, and triggers a DUT reboot to recover. While the
reboot restores a clean state, it wastes ~5 minutes and marks the
sanity check as failed. The test that caused the stale state should
clean up after itself rather than relying on the framework to reboot.

How did you do it?

Added a wait loop in the test's teardown section (after the existing
check_bgp_session_state wait at line 431) that:

Polls route_check.py every 15 seconds, up to 20 attempts (5 min
window), until it reports no route mismatches
Once route_check passes, polls monit status routeCheck every 30
seconds until monit's cached status refreshes to OK — this is
needed because monit caches routeCheck results and the pre-test
sanity check reads monit's status, not route_check.py directly

This way the test cleans up after itself rather than leaving stale
route state for the next test to deal with.

How did you verify/test it?

Ran BGP stress link flap test followed by subsequent tests on VPP
t1-lag topology (32 peers):

Without fix: next test's sanity check detects routeCheck failure,
reboots DUT, wasting ~5 min
With fix: teardown waits ~2-3 min for routes to converge, next
test proceeds normally with no reboot

Any platform specific information?

Most visible on VPP (synchronous route programming) but can affect
any platform under heavy load where route programming lags behind
BGP session establishment.

Supported testbed topology if it's a new test case?

N/A — existing test teardown fix. Tested on t1-lag.

Documentation

N/A

After link flap stress test, routes are re-advertised but take time to be programmed to ASIC. Without waiting, the next test's sanity check sees routeCheck failing and reboots the DUT. Wait for route_check.py to pass, then poll monit status until it refreshes to OK.

mssonicbld · 2026-03-23T10:51:48Z

/azp run

azure-pipelines · 2026-03-23T10:52:01Z

Azure Pipelines successfully started running 1 pipeline(s).

github-actions bot requested review from cyw233, lolyu and sanjair-git March 23, 2026 10:52

Copilot AI mentioned this pull request Mar 24, 2026

Review: human-authored PRs opened in past 48 hours (24 PRs) #23231

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for route programming after stress link flap test#23210

Wait for route programming after stress link flap test#23210
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-stress

rejithomas-arista commented Mar 23, 2026

Uh oh!

mssonicbld commented Mar 23, 2026

Uh oh!

azure-pipelines bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rejithomas-arista commented Mar 23, 2026

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Mar 23, 2026

Uh oh!

azure-pipelines bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants