Skip to content

Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison#22895

Merged
StormLiangMS merged 2 commits intosonic-net:masterfrom
StormLiangMS:fix/test-add-rack-bgp-race
Mar 13, 2026
Merged

Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison#22895
StormLiangMS merged 2 commits intosonic-net:masterfrom
StormLiangMS:fix/test-add-rack-bgp-race

Conversation

@StormLiangMS
Copy link
Collaborator

Description

\ est_add_rack\ was failing ~1% of PR and baseline runs with:
\
AssertionError: DB compare failed after adding T0 via generic patch updater
\\

Root Cause

In \generic_patch_add_t0(), the DB comparison (config-db, app-db, state-db) ran before BGP sessions established. After applying a JSON patch to add T0 config (BGP_NEIGHBOR, INTERFACE, PORT, etc.), BGP peers need time to converge and populate app-db route entries. The DB comparison would repeatedly fail its 5-minute retry window because app-db routes hadn't settled.

The BGP session check was placed after the DB comparison and used a bare \�ssert\ with no retry — so it never had a chance to gate the comparison.

Evidence (Kusto, last 30 days on master baseline+PR)

  • 23 failures, ~90% with \DB compare failed after adding T0 via generic patch updater\
  • Failures span random unrelated PRs (sonic-buildimage, sonic-mgmt) confirming it is flaky, not a regression
  • Fail rate: ~0.6-1.0% on PRTest, ~0.6-2.1% on BaselineTest

Changes

  • **\ ests/common/configlet/utils.py**: Add \is_bgp_session_established()\ helper that returns \True/False\ (compatible with \wait_until\ retry), with logging
  • **\ ests/configlet/util/generic_patch.py**: Move BGP convergence check before DB comparison in \generic_patch_add_t0(), using \wait_until\ retry (up to 5 min). This ensures app-db routes are populated before comparing against baseline.

Before (broken ordering)

\
Apply patch → 60s pause → DB comparison (5min retry) ❌ → BGP check (no retry) ❌
\\

After (fixed ordering)

\
Apply patch → 60s pause → Wait BGP Established (5min retry) ✅ → DB comparison (5min retry) ✅
\\

…comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator Author

hi @xincunli-sonic @vaibhavhd could you help to review? This is to improve the flakyness of this case.

@StormLiangMS
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ZhaohuiS
Copy link
Contributor

@StormLiangMS the format of the description is not good enough, you can ask your agency to use this .github/PULL_REQUEST_TEMPLATE.md to generate PR.

@StormLiangMS StormLiangMS merged commit 560b4f7 into sonic-net:master Mar 13, 2026
16 checks passed
@StormLiangMS StormLiangMS added Request for 202511 branch Request to backport a change to 202511 branch Approved for 202511 branch labels Mar 13, 2026
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 13, 2026
…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #22938

selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Mar 16, 2026
…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: selldinesh <[email protected]>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Abhishek <[email protected]>
vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
@weiguo-nvidia
Copy link
Contributor

weiguo-nvidia commented Mar 23, 2026

Hi @StormLiangMS ,

This PR unconditionally checks both IPv4 and IPv6 BGP sessions in generic_patch_add_t0():

assert wait_until(DB_COMP_WAIT_TIME, 20, 0, is_bgp_session_established,
                  duthost, tor_data["ip"]["remote"]), \
    "BGP IPv4 session for {} not established before DB comparison".format(tor_data["ip"]["remote"])
assert wait_until(DB_COMP_WAIT_TIME, 20, 0, is_bgp_session_established,
                  duthost, tor_data["ipv6"]["remote"].lower()), \
    "BGP IPv6 session for {} not established before DB comparison".format(tor_data["ipv6"]["remote"].lower())

On IPv6-only topologies (e.g. t1-isolated-v6-d56u1-lag), tor_data["ip"]["remote"] is an empty string because there is no IPv4 BGP neighbor. This causes is_bgp_session_established(duthost, "") to fail for the entire 5-minute retry window, resulting in:

AssertionError: BGP IPv4 session for  not established before DB comparison

I submit PR #23241 to fix the issue

weiguo-nvidia added a commit to weiguo-nvidia/sonic-mgmt that referenced this pull request Mar 24, 2026
PR sonic-net#22895 added BGP session convergence wait before DB comparison in
generic_patch_add_t0(), but unconditionally checks both IPv4 and IPv6
BGP sessions. On IPv6-only topologies (e.g. t1-isolated-v6-d56u1-lag),
tor_data["ip"]["remote"] is empty, causing is_bgp_session_established()
to fail.

Fix by checking whether each neighbor IP exists before waiting for the
BGP session, consistent with the chk_any_bgp_session() approach from
PR sonic-net#21591.

Change-Id: I694fd17b7b49b4ce4ece53e0521773b71f301b70
Signed-off-by: weiguo-nvidia <[email protected]>
yxieca pushed a commit that referenced this pull request Mar 26, 2026
…#23241)

What is the motivation for this PR?\nPR #22895 added BGP session convergence wait before DB comparison in generic_patch_add_t0(), but unconditionally checks both IPv4 and IPv6 BGP sessions. On IPv6-only topologies (e.g. t1-isolated-v6-d56u1-lag), tor_data["ip"]["remote"] is empty, causing is_bgp_session_established() to fail.\n\nHow did you do it?\nFix by checking whether each neighbor IP exists before waiting for the BGP session, consistent with the chk_any_bgp_session() approach from PR #21591.\n\nHow did you verify/test it?\nRegression test pass\n\nSigned-off-by:\nweiguo-nvidia <[email protected]>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…comparison (sonic-net#22895)

* Fix test_add_rack flakiness by waiting for BGP convergence before DB comparison

The test_add_rack test was failing ~1% of runs with 'DB compare failed
after adding T0 via generic patch updater'. Root cause: DB comparison
ran before BGP sessions established, causing app-db route entry
mismatches.

Changes:
- Add is_bgp_session_established() helper that returns bool for use
  with wait_until retry mechanism
- Move BGP session convergence check BEFORE DB comparison in
  generic_patch_add_t0(), so app-db routes are populated before
  comparing against baseline
- BGP check now retries with wait_until instead of bare assert

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Storm Liang <[email protected]>

* Remove unused chk_bgp_session import to fix flake8 F401

Signed-off-by: Storm Liang <[email protected]>

---------

Signed-off-by: Storm Liang <[email protected]>
Co-authored-by: Copilot <[email protected]>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…sonic-net#23241)

What is the motivation for this PR?\nPR sonic-net#22895 added BGP session convergence wait before DB comparison in generic_patch_add_t0(), but unconditionally checks both IPv4 and IPv6 BGP sessions. On IPv6-only topologies (e.g. t1-isolated-v6-d56u1-lag), tor_data["ip"]["remote"] is empty, causing is_bgp_session_established() to fail.\n\nHow did you do it?\nFix by checking whether each neighbor IP exists before waiting for the BGP session, consistent with the chk_any_bgp_session() approach from PR sonic-net#21591.\n\nHow did you verify/test it?\nRegression test pass\n\nSigned-off-by:\nweiguo-nvidia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants