Skip to content

[action] [PR:22895] Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison#22938

Open
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/22895
Open

[action] [PR:22895] Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison#22938
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/22895

Conversation

@mssonicbld
Copy link
Collaborator

Description

\ est_add_rack\ was failing ~1% of PR and baseline runs with:
\
AssertionError: DB compare failed after adding T0 via generic patch updater
\\

Root Cause

In \generic_patch_add_t0(), the DB comparison (config-db, app-db, state-db) ran before BGP sessions established. After applying a JSON patch to add T0 config (BGP_NEIGHBOR, INTERFACE, PORT, etc.), BGP peers need time to converge and populate app-db route entries. The DB comparison would repeatedly fail its 5-minute retry window because app-db routes hadn't settled.

The BGP session check was placed after the DB comparison and used a bare \�ssert\ with no retry — so it never had a chance to gate the comparison.

Evidence (Kusto, last 30 days on master baseline+PR)

  • 23 failures, ~90% with \DB compare failed after adding T0 via generic patch updater\
  • Failures span random unrelated PRs (sonic-buildimage, sonic-mgmt) confirming it is flaky, not a regression
  • Fail rate: ~0.6-1.0% on PRTest, ~0.6-2.1% on BaselineTest

Changes

  • **\ ests/common/configlet/utils.py**: Add \is_bgp_session_established()\ helper that returns \True/False\ (compatible with \wait_until\ retry), with logging
  • **\ ests/configlet/util/generic_patch.py**: Move BGP convergence check before DB comparison in \generic_patch_add_t0(), using \wait_until\ retry (up to 5 min). This ensures app-db routes are populated before comparing against baseline.

Before (broken ordering)

\
Apply patch → 60s pause → DB comparison (5min retry) ❌ → BGP check (no retry) ❌
\\

After (fixed ordering)

\
Apply patch → 60s pause → Wait BGP Established (5min retry) ✅ → DB comparison (5min retry) ✅
\\

…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
@mssonicbld
Copy link
Collaborator Author

Original PR: #22895

@mssonicbld
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@radha-danda
Copy link

/azpw run

@mssonicbld
Copy link
Collaborator Author

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld
Copy link
Collaborator Author

Build not found. Please close and reopen the PR or rebase your branch to trigger a new build.

@radha-danda radha-danda reopened this Mar 27, 2026
@mssonicbld
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radha-danda
Copy link

/azpw run

@mssonicbld
Copy link
Collaborator Author

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld
Copy link
Collaborator Author

Retrying failed(or canceled) stages in build 1072424:

✅Stage Test:

  • Job impacted-area-kvmtest-t2 by Elastictest: retried.
  • Job impacted-area-kvmtest-t1-lag by Elastictest: retried.
  • Job impacted-area-kvmtest-t0 by Elastictest: retried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants