Skip to content

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#22204

Merged
vmittal-msft merged 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/21939
Feb 2, 2026
Merged

[action] [PR:21939] Fix/nonlinear high nexthop dataplane downtime#22204
vmittal-msft merged 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/21939

Conversation

@mssonicbld
Copy link
Collaborator

Description of PR

Summary:
Fixes # (issue)
This PR fixes excessively high dataplane downtime attributed to nexthop behavior in the high‑BGP test scenarios

Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that:

  • Traffic is always tested towards valid, expected nexthops,
  • Stale or mis‑mapped nexthops no longer inflate the observed downtime,
  • Downtime better reflects the actual behavior.

The fix put out in PR #20842 now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed.

Dependency:

Type of change

  • [ x ] Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
  • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

  • Measured dataplane downtime remained unexpectedly high when:
  • The number of nexthops increased,
  • The test exercised different nexthop sets or ECMP groups.
  • Downtime spikes appeared that did not match the BGP session and route programming timelines.

How did you do it?

  • A fresh clean ptf dataplane environment for the nexthop group member scale similar to the PR #21936
  • Uses the bulk reannouncement of the starting state as per the fix introduced by PR #20842

How did you verify/test it?

  • Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on:

  • Topology: t0-isolated-d2u510s2

  • Platform: Broadcom Arista-7060X6-64PE-B-C512S2

  • Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds.

Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE
Dataplane Downtime results now:
Shutdown Phase - 0.11 seconds as expected
Startup Phase - 0.14 seconds as expected

Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
@mssonicbld
Copy link
Collaborator Author

Original PR: #21939

@mssonicbld
Copy link
Collaborator Author

/azp run

@github-actions github-actions bot requested review from cyw233 and lolyu January 30, 2026 22:42
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested a review from sanjair-git January 30, 2026 22:42
@radha-danda
Copy link

/azpw run Azure.sonic-mgmt

@mssonicbld
Copy link
Collaborator Author

/AzurePipelines run Azure.sonic-mgmt

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vmittal-msft vmittal-msft merged commit 10b2847 into sonic-net:202511 Feb 2, 2026
18 checks passed
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Feb 11, 2026
…c-net#22204)

* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Co-authored-by: Priyansh <77935498+PriyanshTratiya@users.noreply.github.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants