Skip to content

[fast reboot] allow test to replace fast-reboot script on the DUT before fast-rebooting#975

Merged
lguohan merged 1 commit intosonic-net:masterfrom
yxieca:fastreboot
Jun 27, 2019
Merged

[fast reboot] allow test to replace fast-reboot script on the DUT before fast-rebooting#975
lguohan merged 1 commit intosonic-net:masterfrom
yxieca:fastreboot

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Jun 24, 2019

Summary:
This change enables fast-reboot upgrade test from 201803 branch to 201811 branch.

Type of change

  • [] Bug fix
  • [] Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

How did you do it?

  • fast-reboot script is an adapted version from 201811 branch. The change is around syncd
    stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown.
  • Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob
    allow_vlan_flooding to ignore this symptom and continue with fast-reboot.

Signed-off-by: Ying Xie ying.xie@microsoft.com

How did you verify/test it?

fast reboot from 201803 branch image to 201811 branch image.

…ore rebooting

- fast-reboot script is an adapted version from 201811 branch. The change is around syncd
  stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown.
- Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob
  allow_vlan_flooding to ignore this symptom and continue with fast-reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@lguohan lguohan merged commit 49f1035 into sonic-net:master Jun 27, 2019
@qiluo-msft
Copy link
Contributor

"replace fast-reboot script" is not a good solution for the problems we are facing. We'd better not to inject code into image to fix a broken feature.

Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob
allow_vlan_flooding to ignore this symptom and continue with fast-reboot.

The Mellanox 201803 problem does not need "replace fast-reboot script". We can separate the fix in a standalone PR.

The change is around syncd stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown

The statement is not clear to me. If Broadcom 201803 fast-reboot feature is not perfectly implemented, we should fix in 201803 image and the upgrade path is like 201803 bad image -> 201803 good image -> 201811 good image.

neethajohn pushed a commit that referenced this pull request Jun 27, 2019
…ore rebooting (#975)

- fast-reboot script is an adapted version from 201811 branch. The change is around syncd
  stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown.
- Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob
  allow_vlan_flooding to ignore this symptom and continue with fast-reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@yxieca yxieca deleted the fastreboot branch July 8, 2019 15:43
fraserg-arista pushed a commit to fraserg-arista/sonic-mgmt that referenced this pull request Feb 24, 2026
…nic-net#975)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
This PR fixes **excessively high dataplane downtime attributed to nexthop behavior** in the high‑BGP test scenarios

Nexthop handling in the test logic caused downtime measurements to stay high and inconsistent. This PR corrects nexthop‑related announcement, and verification so that:

- Traffic is always tested towards valid, expected nexthops,
- Stale or mis‑mapped nexthops no longer inflate the observed downtime,
- Downtime better reflects the actual behavior.

The fix put out in [PR sonic-net#20842](sonic-net#20842) now also fixes the recently found issue where the failed nexthop_group_member_scale pollutes the test environment for future re-runs of the entire testbed.

Dependency:

- Depends on the fixes introduced in:
- [PR sonic-net#21936 ](sonic-net#21936)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ x ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
 - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
- Measured dataplane downtime remained unexpectedly high when:
 - The number of nexthops increased,
 - The test exercised different nexthop sets or ECMP groups.
- Downtime spikes appeared that did not match the BGP session and route programming timelines.

#### How did you do it?
- A fresh clean ptf dataplane environment for the nexthop group member scale similar to the [PR sonic-net#21936](sonic-net#21936)
- Uses the bulk reannouncement of the starting state as per the fix introduced by [PR sonic-net#20842](sonic-net#20842)

#### How did you verify/test it?
- Ran the high‑BGP convergence, flap, nexthop group member scale tests end‑to‑end with the nexthop fixes applied on:
 - Topology: `t0-isolated-d2u510s2`
 - Platform: Broadcom Arista-7060X6-64PE-B-C512S2

- Verified that the dataplane downtime does not fail the expected the MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE of 30 seconds.

Dataplane Downtime results before: 63 seconds > MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE
Dataplane Downtime results now:
Shutdown Phase - 0.11 seconds as expected
Startup Phase - 0.14 seconds as expected

Also fixes the recently found issue where the failed nexthop group member scale pollutes the FIB on the switch for future re runs of the testbed.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…omatically (sonic-net#21781)

#### Why I did it
src/sonic-swss-common
```
* 7aa1a47 - (HEAD -> 202411, origin/202411) Added field for policer counter (sonic-net#975) (18 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
As part of this commit and previous commit ff6cb6c
sonic-utilities submodule for 201911 has been updated to take following
changes:

 Add support for QSFP-DD cables on 'show' command (sonic-net#989)
 [show] Fix for 'trunk' PortChannel reported as 'routed' port (sonic-net#1002)
Enable HW watchdog before fast-reboot (sonic-net#977)
 [filter-fdb] Check VLAN Presence When Filter FDB (sonic-net#957) (sonic-net#975)
[filter-fdb] Fix For Vlan Defined With No CIDR (sonic-net#976)
 [show/config]: combine feature and container feature cli (sonic-net#1015)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants