Skip to content

Wait for route programming after stress link flap test#23210

Open
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-stress
Open

Wait for route programming after stress link flap test#23210
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-stress

Conversation

@rejithomas-arista
Copy link
Copy Markdown

● ### Description of PR

Summary:
Wait for route programming to complete in BGP stress link flap test
teardown to prevent unnecessary DUT reboot by next test's sanity check.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

The BGP stress link flap test (test_bgp_stress_link_flap.py) flaps
DUT interfaces repeatedly to stress BGP reconvergence. In the test's
teardown (the yield fixture's cleanup section), it restores interfaces
and waits for BGP sessions to re-establish (line 431, 600s timeout).

However, BGP session establishment does not mean route programming is
complete. After sessions come up, peers re-advertise routes, and
these routes must be programmed to the ASIC via orchagent -> syncd.
On platforms with synchronous route programming (e.g., VPP where
each route requires an individual API call), this can take several
minutes for hundreds of routes.

The test teardown exits as soon as BGP sessions are up, but routes
are still being programmed. The next test's pre-test sanity check
runs routeCheck (via monit), sees a mismatch between APP_DB and
ASIC_DB routes, and triggers a DUT reboot to recover. While the
reboot restores a clean state, it wastes ~5 minutes and marks the
sanity check as failed. The test that caused the stale state should
clean up after itself rather than relying on the framework to reboot.

How did you do it?

Added a wait loop in the test's teardown section (after the existing
check_bgp_session_state wait at line 431) that:

  1. Polls route_check.py every 15 seconds, up to 20 attempts (5 min
    window), until it reports no route mismatches
  2. Once route_check passes, polls monit status routeCheck every 30
    seconds until monit's cached status refreshes to OK — this is
    needed because monit caches routeCheck results and the pre-test
    sanity check reads monit's status, not route_check.py directly

This way the test cleans up after itself rather than leaving stale
route state for the next test to deal with.

How did you verify/test it?

Ran BGP stress link flap test followed by subsequent tests on VPP
t1-lag topology (32 peers):

  • Without fix: next test's sanity check detects routeCheck failure,
    reboots DUT, wasting ~5 min
  • With fix: teardown waits ~2-3 min for routes to converge, next
    test proceeds normally with no reboot

Any platform specific information?

Most visible on VPP (synchronous route programming) but can affect
any platform under heavy load where route programming lags behind
BGP session establishment.

Supported testbed topology if it's a new test case?

N/A — existing test teardown fix. Tested on t1-lag.

Documentation

N/A

After link flap stress test, routes are re-advertised but take time to be
programmed to ASIC. Without waiting, the next test's sanity check sees
routeCheck failing and reboots the DUT. Wait for route_check.py to pass,
then poll monit status until it refreshes to OK.
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants