Skip to content

[advanced-reboot] improvements and enable CPU/VLAN ARP watchers during warm reboot#890

Merged
yxieca merged 5 commits intosonic-net:masterfrom
stepanblyschak:stepanb/adv-reboot
Apr 30, 2019
Merged

[advanced-reboot] improvements and enable CPU/VLAN ARP watchers during warm reboot#890
yxieca merged 5 commits intosonic-net:masterfrom
stepanblyschak:stepanb/adv-reboot

Conversation

@stepanblyschak
Copy link
Contributor

Description of PR

Summary:

  • Refactor ReloadTest code and move Arista class into separate python module
  • Reuse from_t1 and from_vlan_packet in generate_bidirectional
  • Keep watching CPU/VLAN ARP, etc. while running fast data plane IO - send_in_background()
  • Use lock to achieve one data plane tarffic generator running at the same time

Fixes # (issue)

  • CPU/VLAN ARP, etc not watched during warm reboot test

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

The idea is to apply a filter on PTF socket while running send_in_background() to archive stable results

How did you do it?

When send_in_background() starts it locks dataplane_io_lock to guaranty data plane watcher will not run at the same time; before start sending traffic apply filter on ptf ports sockets to filter out data plane TCP traffic and ARP requests from DUT. Otherwise CPU/VLAN ARP states will be unstable

How did you verify/test it?

Ran warm reboot test on Mellanox DUT; The CPU/VLAN ARP is watched during the test in parallel to data plane traffic:

2019-04-25 14:44:09 : Send   100 Received   100 servers->t1
2019-04-25 14:44:10 : Send   500 Received   500 t1->servers
2019-04-25 14:44:11 : Send    10 Received     0 ping DUT
2019-04-25 14:44:11 : Control plane state transition from up to down
2019-04-25 14:44:11 : Dut reboots: reboot start 2019-04-25 14:44:11.734492
2019-04-25 14:44:11 : Sniffer started at 2019-04-25 14:44:11.734957
2019-04-25 14:44:11 : Send     1 Received     1 arp ping
2019-04-25 14:44:12 : Send   100 Received   100 servers->t1
2019-04-25 14:44:14 : Send   500 Received   500 t1->servers
2019-04-25 14:44:15 : Send    10 Received     0 ping DUT
2019-04-25 14:44:15 : Send     1 Received     1 arp ping
2019-04-25 14:44:16 : Send    10 Received     0 ping DUT
2019-04-25 14:44:16 : Send     1 Received     1 arp ping
2019-04-25 14:44:17 : Send    10 Received     0 ping DUT
2019-04-25 14:44:17 : Send     1 Received     1 arp ping
2019-04-25 14:44:17 : Sender started at 2019-04-25 14:44:17.950181
...
2019-04-25 14:45:05 : Send    10 Received     0 ping DUT
2019-04-25 14:45:06 : Send     1 Received     2 arp ping
2019-04-25 14:45:06 : Send    10 Received     0 ping DUT
2019-04-25 14:45:07 : Send     1 Received     2 arp ping
2019-04-25 14:45:07 : Send    10 Received     0 ping DUT
2019-04-25 14:45:08 : Send     1 Received     2 arp ping
2019-04-25 14:45:09 : Send    10 Received   168 ping DUT
2019-04-25 14:45:09 : Control plane state transition from down to up
2019-04-25 14:45:09 : Send     1 Received     2 arp ping
2019-04-25 14:45:10 : Send    10 Received   240 ping DUT
2019-04-25 14:45:11 : Send     1 Received     2 arp ping
2019-04-25 14:45:12 : Send    10 Received   240 ping DUT
2019-04-25 14:45:12 : Send     1 Received     2 arp ping
2019-04-25 14:45:13 : Send    10 Received   240 ping DUT
2019-04-25 14:45:14 : Send     1 Received     2 arp ping
2019-04-25 14:45:15 : Send    10 Received    10 ping DUT
2019-04-25 14:45:15 : Send     1 Received     2 arp ping
2019-04-25 14:45:16 : Send    10 Received    10 ping DUT
2019-04-25 14:45:16 : Send     1 Received     2 arp ping
2019-04-25 14:45:17 : Send    10 Received    10 ping DUT
2019-04-25 14:45:17 : Send     1 Received     2 arp ping
2019-04-25 14:45:18 : Send    10 Received    10 ping DUT
2019-04-25 14:45:19 : Send     1 Received     2 arp ping
2019-04-25 14:45:19 : Send    10 Received    10 ping DUT
2019-04-25 14:45:20 : Send     1 Received     2 arp ping
2019-04-25 14:45:21 : Send    10 Received    10 ping DUT
2019-04-25 14:45:21 : Send     1 Received     2 arp ping
...
2019-04-25 14:48:29 : Send     1 Received     2 arp ping
2019-04-25 14:48:29 : Send    10 Received    10 ping DUT
2019-04-25 14:48:30 : Send     1 Received     2 arp ping
2019-04-25 14:48:31 : Send    10 Received    10 ping DUT
2019-04-25 14:48:32 : Stopping reachability state watch thread.
2019-04-25 14:48:32 : Send     1 Received     2 arp ping
2019-04-25 14:51:00 : Pcap file dumped to /tmp/capture.pcap
2019-04-25 14:51:00 : Packet flow examine started 0:06:48.657896 after the reboot
2019-04-25 14:54:45 : Disruption between packet ID 11359 and 11361. For 0.0065
2019-04-25 14:54:45 : Disruption between packet ID 11364 and 11366. For 0.0053
2019-04-25 14:54:45 : Disruption between packet ID 11844 and 11846. For 0.0060
2019-04-25 14:54:45 : Disruption between packet ID 11849 and 11851. For 0.0079
2019-04-25 14:54:45 : Disruption between packet ID 11854 and 11856. For 0.0060
2019-04-25 14:54:45 : Disruption between packet ID 11859 and 11861. For 0.0060
2019-04-25 14:54:45 : Disruption between packet ID 11864 and 11866. For 0.0075
2019-04-25 14:54:45 : Disruption between packet ID 11869 and 11871. For 0.0071
2019-04-25 14:54:51 : Disruptions happen between 0:01:15.156358 and 0:01:18.176951 after the reboot.
2019-04-25 14:54:51 : Total incoming packets captured 28531
2019-04-25 14:55:41 : Filtered pcap dumped to /tmp/capture_filtered.pcap
2019-04-25 14:55:41 : Packet flow examine finished after 0:04:41.098271
2019-04-25 14:55:41 : The longest disruption lasted 0.008 seconds. 1 packet(s) lost.
2019-04-25 14:55:41 : Total disruptions count is 8. All disruptions lasted 0.052 seconds. Total 8 packet(s) lost
2019-04-25 14:55:41 : Wait until bgp routing is up on all devices
2019-04-25 14:55:41 : Data plane works again. Start time: 2019-04-25 14:45:29.767663
2019-04-25 14:55:41 :
2019-04-25 14:55:41 : ==================================================
2019-04-25 14:55:41 : Report:
2019-04-25 14:55:41 : ==================================================
2019-04-25 14:55:41 : LACP/BGP were down for (extracted from cli):
2019-04-25 14:55:41 : --------------------------------------------------
2019-04-25 14:55:41 : --------------------------------------------------
2019-04-25 14:55:41 : Extracted from VM logs:
2019-04-25 14:55:41 : --------------------------------------------------
2019-04-25 14:55:41 : Summary:
2019-04-25 14:55:41 : --------------------------------------------------
2019-04-25 14:55:41 : Downtime was 0:00:00.007925
2019-04-25 14:55:41 : Reboot time was 0:01:18.033171
2019-04-25 14:55:41 : Expected downtime is less then 0:00:30
2019-04-25 14:55:41 : ==================================================
2019-04-25 14:55:41 : Disabling arp_responder

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Stepan Blyschak added 4 commits April 25, 2019 17:39
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* reuse from_t1 and from_vlan_server generated packets in
  generate_bidirectional
* use tcp instead ofudp in generate_bidirectional

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
reachability_watcher threads

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* Apply a filter on socket before sending fast data plane IO
* Save sniffed packets after the traffic test is done

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
@stepanblyschak
Copy link
Contributor Author

@yxieca Could you please review and test on your setup as well and check if you get stable CPU/VLAN ARP pings during reboot?

@pavel-shirshov
Copy link
Contributor

LGTM

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
@yxieca
Copy link
Collaborator

yxieca commented Apr 28, 2019

The reachability watcher and the fast sender are both require the io lock to send IO. How could the watcher send anything when the fast sender is blocking sending for minutes?

Your test log shows otherwise, I must be missing something?

@stepanblyschak
Copy link
Contributor Author

@yxieca Only ping_data_plane() in reachability watcher is protected by a lock. Besides I used non blocking acquire, so if lock cannot be acquired it will skip ping_data_plane and do pingDut and arpPing

@stepanblyschak stepanblyschak marked this pull request as ready for review April 30, 2019 07:11
@yxieca yxieca merged commit e8b86bd into sonic-net:master Apr 30, 2019
yxieca pushed a commit that referenced this pull request Apr 30, 2019
…g warm reboot (#890)

* [advanced-reboot] move Arista class to seperate module

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] use lock to synchronize fast data plane and
reachability_watcher threads

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] stabilize test when fast data plane send running

* Apply a filter on socket before sending fast data plane IO
* Save sniffed packets after the traffic test is done

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] refactor fast data plane generator code

* reuse from_t1 and from_vlan_server generated packets in
  generate_bidirectional
* use tcp instead ofudp in generate_bidirectional

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] add space back

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
deerao02 pushed a commit to deerao02/sonic-mgmt that referenced this pull request Dec 18, 2025
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
BBR feature is not required in t1-isolated-xx setup, skip it for now.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
 - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [x] 202505

### Approach
#### What is the motivation for this PR?
BBR feature is not required in t1-isolated-xx setup, skip it to reduce noise.

#### How did you do it?
skip in conditional mark.

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
Submodule src/sonic-utilities d7e8f84cf..8c21fc151:
  > [utility] Filter FDB entries (sonic-net#890)
  > Fix the warm-reboot script to support FRR based warm-reboot (sonic-net#842)

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
[fwutil]: Fix firmware update command. (sonic-net#895)
[utility] Filter FDB entries (sonic-net#890)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants