[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime by mssonicbld · Pull Request #975 · Azure/sonic-mgmt.msft

mssonicbld · 2026-01-23T19:02:49Z

Description of PR

Summary:
Fixes # (issue)
This PR fixes excessively high dataplane downtime attributed to nexthop behavior in the high‑BGP test scenarios

Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that:

Traffic is always tested towards valid, expected nexthops,
Stale or mis‑mapped nexthops no longer inflate the observed downtime,
Downtime better reflects the actual behavior.

The fix put out in PR #20842 now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed.

Dependency:

Depends on the fixes introduced in:
PR #21936

Type of change

[ x ] Bug fix
Testbed and Framework(new/improvement)
New Test case
- Skipped for non-supported platforms
Test case improvement

Back port request

Approach

What is the motivation for this PR?

Measured dataplane downtime remained unexpectedly high when:
- The number of nexthops increased,
- The test exercised different nexthop sets or ECMP groups.
Downtime spikes appeared that did not match the BGP session and route programming timelines.

How did you do it?

A fresh clean ptf dataplane environment for the nexthop group member scale similar to the PR #21936
Uses the bulk reannouncement of the starting state as per the fix introduced by PR #20842

How did you verify/test it?

Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on:
- Topology: t0-isolated-d2u510s2
- Platform: Broadcom Arista-7060X6-64PE-B-C512S2
Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds.

Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE
Dataplane Downtime results now:
Shutdown Phase - 0.11 seconds as expected
Startup Phase - 0.14 seconds as expected

Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

### Description of PR  Summary: Fixes # (issue) This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that: - Traffic is always tested towards valid, expected nexthops, - Stale or mis‑mapped nexthops no longer inflate the observed downtime, - Downtime better reflects the actual behavior. The fix put out in [PR #20842](sonic-net/sonic-mgmt#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed. Dependency: - Depends on the fixes introduced in: - [PR #21936 ](sonic-net/sonic-mgmt#21936) ### Type of change  - [ x ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? - Measured dataplane downtime remained unexpectedly high when: - The number of nexthops increased, - The test exercised different nexthop sets or ECMP groups. - Downtime spikes appeared that did not match the BGP session and route programming timelines. #### How did you do it? - A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR #21936](sonic-net/sonic-mgmt#21936) - Uses the bulk reannouncement of the starting state as per the fix introduced by [PR #20842](sonic-net/sonic-mgmt#20842) #### How did you verify/test it? - Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on: - Topology: `t0-isolated-d2u510s2` - Platform: Broadcom Arista-7060X6-64PE-B-C512S2 - Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds. Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE Dataplane Downtime results now: Shutdown Phase - 0.11 seconds as expected Startup Phase - 0.14 seconds as expected Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation

mssonicbld · 2026-01-23T19:02:53Z

Original PR: sonic-net/sonic-mgmt#21939

mssonicbld · 2026-01-23T19:02:55Z

/azp run

azure-pipelines · 2026-01-23T19:03:09Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld requested a review from StormLiangMS as a code owner January 23, 2026 19:02

mssonicbld added the automerge label Jan 23, 2026

mssonicbld mentioned this pull request Jan 23, 2026

Fix/nonlinear high nexthop dataplane downtime sonic-net/sonic-mgmt#21939

Merged

10 tasks

mssonicbld merged commit 2bb2a53 into Azure:202412 Jan 23, 2026
11 of 14 checks passed

ccroy-arista mentioned this pull request Jan 23, 2026

[BGP Scale] Fix NHG Member Scale Announce Routes Convergence Timeout sonic-net/sonic-mgmt#20842

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#975

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#975
mssonicbld merged 1 commit intoAzure:202412from
mssonicbld:cherry/msft-202412/21939

mssonicbld commented Jan 23, 2026

Uh oh!

mssonicbld commented Jan 23, 2026

Uh oh!

mssonicbld commented Jan 23, 2026

Uh oh!

azure-pipelines bot commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mssonicbld commented Jan 23, 2026

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Jan 23, 2026

Uh oh!

mssonicbld commented Jan 23, 2026

Uh oh!

azure-pipelines bot commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant