[BGP Scale] Fix NHG Member Scale Announce Routes Convergence Timeout#20842
[BGP Scale] Fix NHG Member Scale Announce Routes Convergence Timeout#20842ccroy-arista wants to merge 1 commit intosonic-net:masterfrom
Conversation
Revert the topo_bgp_routes change from PR20238 to enable the test_nexthop_group_member_scale test to progress past the announce routes portion of the test.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi Chris, although the code change looks good to me, however, I wonder if you could help update the PR description to explain the issue and fix with a bit more detail? currently it basically reads as "We need to revert basically the entire test case", which is not true. the code is essentially improving or simplifing the route announcement code, so it can be done faster. |
|
+ @PriyanshTratiya here for viz and review. |
PriyanshTratiya
left a comment
There was a problem hiding this comment.
Looks good overall, the update seems to do a full reannounce using announce_routes, with parallel sending functions saving time from the previous per‑route adhoc restoration. I just have a question; it doesn’t meet the original 30s downtime threshold. In my dev runs:
- Full re‑announce (new code): ~63s downtime during the announce phase.
- Previous adhoc approach: ~128s downtime in the same phase.
So, the change is a clear improvement (roughly cutting the announce‑phase downtime in half), but we’re still over the legacy 30s target. If we adopt the relaxed threshold proposed in PR #20843, this will pass. Is the higher threshold in #20843 intended to become the new baseline for this test scenario?
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that: - Traffic is always tested towards valid, expected nexthops, - Stale or mis‑mapped nexthops no longer inflate the observed downtime, - Downtime better reflects the actual behavior. The fix put out in [PR #20842](sonic-net/sonic-mgmt#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed. Dependency: - Depends on the fixes introduced in: - [PR #21936 ](sonic-net/sonic-mgmt#21936) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ x ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? - Measured dataplane downtime remained unexpectedly high when: - The number of nexthops increased, - The test exercised different nexthop sets or ECMP groups. - Downtime spikes appeared that did not match the BGP session and route programming timelines. #### How did you do it? - A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR #21936](sonic-net/sonic-mgmt#21936) - Uses the bulk reannouncement of the starting state as per the fix introduced by [PR #20842](sonic-net/sonic-mgmt#20842) #### How did you verify/test it? - Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on: - Topology: `t0-isolated-d2u510s2` - Platform: Broadcom Arista-7060X6-64PE-B-C512S2 - Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds. Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE Dataplane Downtime results now: Shutdown Phase - 0.11 seconds as expected Startup Phase - 0.14 seconds as expected Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that: - Traffic is always tested towards valid, expected nexthops, - Stale or mis‑mapped nexthops no longer inflate the observed downtime, - Downtime better reflects the actual behavior. The fix put out in [PR #20842](sonic-net/sonic-mgmt#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed. Dependency: - Depends on the fixes introduced in: - [PR #21936 ](sonic-net/sonic-mgmt#21936) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ x ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? - Measured dataplane downtime remained unexpectedly high when: - The number of nexthops increased, - The test exercised different nexthop sets or ECMP groups. - Downtime spikes appeared that did not match the BGP session and route programming timelines. #### How did you do it? - A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR #21936](sonic-net/sonic-mgmt#21936) - Uses the bulk reannouncement of the starting state as per the fix introduced by [PR #20842](sonic-net/sonic-mgmt#20842) #### How did you verify/test it? - Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on: - Topology: `t0-isolated-d2u510s2` - Platform: Broadcom Arista-7060X6-64PE-B-C512S2 - Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds. Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE Dataplane Downtime results now: Shutdown Phase - 0.11 seconds as expected Startup Phase - 0.14 seconds as expected Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
|
Closing this PR as these changes have been directly incorporated into Azure/sonic-mgmt.msft#975. |
Description of PR
Revert the topo_bgp_routes change from PR20238 to enable the test_nexthop_group_member_scale test to progress past the announce routes portion of the test.
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
To remedy the annouce routes convergence check, which otherwise fails due to a timeout waiting for routes to converge.
How did you do it?
Reverted the topo_bgp_routes portion of PR20238.
How did you verify/test it?
Ran the test against the t0-isolated-d2u510s2 topology, confirming that the test case now passes this step.
Any platform specific information?
Tested on Arista-7060X6-64PE-B-C512S2.