Skip to content

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#975

Merged
mssonicbld merged 1 commit intoAzure:202412from
mssonicbld:cherry/msft-202412/21939
Jan 23, 2026
Merged

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#975
mssonicbld merged 1 commit intoAzure:202412from
mssonicbld:cherry/msft-202412/21939

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

Summary:
Fixes # (issue)
This PR fixes excessively high dataplane downtime attributed to nexthop behavior in the high‑BGP test scenarios

Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that:

  • Traffic is always tested towards valid, expected nexthops,
  • Stale or mis‑mapped nexthops no longer inflate the observed downtime,
  • Downtime better reflects the actual behavior.

The fix put out in PR #20842 now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed.

Dependency:

Type of change

  • [ x ] Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

  • Measured dataplane downtime remained unexpectedly high when:
    • The number of nexthops increased,
    • The test exercised different nexthop sets or ECMP groups.
  • Downtime spikes appeared that did not match the BGP session and route programming timelines.

How did you do it?

  • A fresh clean ptf dataplane environment for the nexthop group member scale similar to the PR #21936
  • Uses the bulk reannouncement of the starting state as per the fix introduced by PR #20842

How did you verify/test it?

  • Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on:

    • Topology: t0-isolated-d2u510s2
    • Platform: Broadcom Arista-7060X6-64PE-B-C512S2
  • Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds.

Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE
Dataplane Downtime results now:
Shutdown Phase - 0.11 seconds as expected
Startup Phase - 0.14 seconds as expected

Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios

Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that:

- Traffic is always tested towards valid, expected nexthops,
- Stale or mis‑mapped nexthops no longer inflate the observed downtime,
- Downtime better reflects the actual behavior.

The fix put out in [PR #20842](sonic-net/sonic-mgmt#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed.

Dependency:

- Depends on the fixes introduced in:
- [PR #21936 ](sonic-net/sonic-mgmt#21936)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ x ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
- Measured dataplane downtime remained unexpectedly high when:
  - The number of nexthops increased,
  - The test exercised different nexthop sets or ECMP groups.
- Downtime spikes appeared that did not match the BGP session and route programming timelines.

#### How did you do it?
- A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR #21936](sonic-net/sonic-mgmt#21936)
- Uses the bulk reannouncement of the starting state as per the fix introduced by [PR #20842](sonic-net/sonic-mgmt#20842)

#### How did you verify/test it?
- Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on:
  - Topology: `t0-isolated-d2u510s2`
  - Platform: Broadcom Arista-7060X6-64PE-B-C512S2

- Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds.

Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE
Dataplane Downtime results now:
Shutdown Phase - 0.11 seconds as expected
Startup Phase - 0.14 seconds as expected

Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: sonic-net/sonic-mgmt#21939

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit 2bb2a53 into Azure:202412 Jan 23, 2026
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant