[fast reboot] allow test to replace fast-reboot script on the DUT before fast-rebooting#975
Conversation
…ore rebooting - fast-reboot script is an adapted version from 201811 branch. The change is around syncd stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown. - Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob allow_vlan_flooding to ignore this symptom and continue with fast-reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
"replace fast-reboot script" is not a good solution for the problems we are facing. We'd better not to inject code into image to fix a broken feature.
The Mellanox 201803 problem does not need "replace fast-reboot script". We can separate the fix in a standalone PR.
The statement is not clear to me. If Broadcom 201803 fast-reboot feature is not perfectly implemented, we should fix in 201803 image and the upgrade path is like 201803 bad image -> 201803 good image -> 201811 good image. |
…ore rebooting (#975) - fast-reboot script is an adapted version from 201811 branch. The change is around syncd stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown. - Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob allow_vlan_flooding to ignore this symptom and continue with fast-reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…nic-net#975) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that: - Traffic is always tested towards valid, expected nexthops, - Stale or mis‑mapped nexthops no longer inflate the observed downtime, - Downtime better reflects the actual behavior. The fix put out in [PR sonic-net#20842](sonic-net#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed. Dependency: - Depends on the fixes introduced in: - [PR sonic-net#21936 ](sonic-net#21936) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ x ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? - Measured dataplane downtime remained unexpectedly high when: - The number of nexthops increased, - The test exercised different nexthop sets or ECMP groups. - Downtime spikes appeared that did not match the BGP session and route programming timelines. #### How did you do it? - A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR sonic-net#21936](sonic-net#21936) - Uses the bulk reannouncement of the starting state as per the fix introduced by [PR sonic-net#20842](sonic-net#20842) #### How did you verify/test it? - Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on: - Topology: `t0-isolated-d2u510s2` - Platform: Broadcom Arista-7060X6-64PE-B-C512S2 - Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds. Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE Dataplane Downtime results now: Shutdown Phase - 0.11 seconds as expected Startup Phase - 0.14 seconds as expected Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
…omatically (sonic-net#21781) #### Why I did it src/sonic-swss-common ``` * 7aa1a47 - (HEAD -> 202411, origin/202411) Added field for policer counter (sonic-net#975) (18 hours ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
As part of this commit and previous commit ff6cb6c sonic-utilities submodule for 201911 has been updated to take following changes: Add support for QSFP-DD cables on 'show' command (sonic-net#989) [show] Fix for 'trunk' PortChannel reported as 'routed' port (sonic-net#1002) Enable HW watchdog before fast-reboot (sonic-net#977) [filter-fdb] Check VLAN Presence When Filter FDB (sonic-net#957) (sonic-net#975) [filter-fdb] Fix For Vlan Defined With No CIDR (sonic-net#976) [show/config]: combine feature and container feature cli (sonic-net#1015)
Summary:
This change enables fast-reboot upgrade test from 201803 branch to 201811 branch.
Type of change
Approach
How did you do it?
stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown.
allow_vlan_flooding to ignore this symptom and continue with fast-reboot.
Signed-off-by: Ying Xie ying.xie@microsoft.com
How did you verify/test it?
fast reboot from 201803 branch image to 201811 branch image.