Skip to content

feat: add batch_mode support for bind_fp_ports and unbind_fp_ports#18790

Merged
bingwang-ms merged 1 commit intosonic-net:masterfrom
auspham:austinpham/32654908-improve-vmtopology
Jun 18, 2025
Merged

feat: add batch_mode support for bind_fp_ports and unbind_fp_ports#18790
bingwang-ms merged 1 commit intosonic-net:masterfrom
auspham:austinpham/32654908-improve-vmtopology

Conversation

@auspham
Copy link
Contributor

@auspham auspham commented Jun 4, 2025

Description of PR

Summary: This PR add option to use batch_mode support for bind_fp_ports. Which improves the speed by 50% tested on 128 VM neighbor.

Fixes # (issue) 32654908

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

When doing ovs flow creation, we're launching subprocess and waiting for each subprocess result before continue with the next call. This process is very inefficient even with the aid of multi-threading support.

How did you do it?

This PR change the behavior of multi-threading in the following way:

  1. Batching all the ovs flow creation commands that needed to execute into a single file
  2. Launch 1 process to call ovs-ofctl on the file using add-flows, put the process into queue for wait later and free the thread so that the same thread can be use to launch a different batch
  3. In the end, main thread will wait for all the batches launched from process finished

This PR also provide an options to opt in this feature

How did you verify/test it?

Verified on physical testbed with 128 VMs. Time deduction for the same settings of 8 Threads is reduced from 1 hour 30 minutes to 45 minutes average.

The following is a sample of the same settings, same number of threads, with batch_mode enabled on renumber topology and unbind topology.

We can see the majority of benefit in Renumber topology by batch the bind_fp_ports.

Before

Wednesday 04 June 2025  03:12:40 +0000 (0:00:00.055)       1:43:10.602 ********
===============================================================================
vm_set : Renumber topology lt2-o128 to VMs. base vm = VM73166 -------- 4998.28s
vm_set : Unbind topology lt2-o128 to VMs. base vm = VM73166 ----------- 557.45s
vm_set : Kill exabgp and ptf_nn_agent processes in PTF container ------ 206.43s
vm_set : Setup vlan port for vlan tunnel ------------------------------- 92.35s
vm_set : Verify that exabgp processes for IPv4 are started ------------- 45.90s
vm_set : Verify that exabgp processes for IPv6 are started ------------- 45.79s
vm_set : Configure exabgp processes for IPv4 on PTF -------------------- 27.62s
vm_set : configure exabgp processes for IPv6 on PTF -------------------- 26.66s
vm_set : Stop ptf container ptf_vms73-2 -------------------------------- 16.49s
vm_set : Run the "apt-get update" as a separate and retryable step ----- 14.26s
vm_set : Create ptf container ptf_vms73-2 ------------------------------ 14.25s
vm_set : Try to login into docker registry ----------------------------- 12.00s
vm_set : Remove ptf container ptf_vms73-2 ------------------------------ 11.54s
vm_set : Set ipv6 route max size of ptf_vms73-2 ------------------------ 11.13s
vm_set : Enable ipv6 for docker container ptf_vms73-2 ------------------ 11.03s
vm_set : Install necessary packages ------------------------------------ 10.58s
vm_set : Announce routes ------------------------------------------------ 9.99s
vm_set : Install necessary packages ------------------------------------- 9.09s
vm_set : Stop PTF portchannel ------------------------------------------- 4.60s
vm_set : Change PTF interface MAC addresses ----------------------------- 4.35s

After

Wednesday 04 June 2025  07:30:07 +0000 (0:00:00.069)       0:52:51.148 ********
===============================================================================
vm_set : Renumber topology lt2-o128 to VMs. base vm = VM73166 -------- 1980.45s
vm_set : Unbind topology lt2-o128 to VMs. base vm = VM73166 ----------- 552.96s
vm_set : Kill exabgp and ptf_nn_agent processes in PTF container ------ 206.56s
vm_set : Setup vlan port for vlan tunnel ------------------------------ 108.52s
vm_set : Verify that exabgp processes for IPv4 are started ------------- 45.76s
vm_set : Verify that exabgp processes for IPv6 are started ------------- 45.45s
vm_set : Configure exabgp processes for IPv4 on PTF -------------------- 27.49s
vm_set : configure exabgp processes for IPv6 on PTF -------------------- 25.96s
vm_set : Stop ptf container ptf_vms73-2 -------------------------------- 16.17s
vm_set : Create ptf container ptf_vms73-2 ------------------------------ 15.02s
vm_set : Remove ptf container ptf_vms73-2 ------------------------------ 12.38s
vm_set : Set ipv6 route max size of ptf_vms73-2 ------------------------ 11.56s
vm_set : Try to login into docker registry ----------------------------- 11.13s
vm_set : Install necessary packages ------------------------------------ 10.50s
vm_set : Install necessary packages ------------------------------------- 8.84s
vm_set : Announce routes ------------------------------------------------ 7.81s
vm_set : Run the "apt-get update" as a separate and retryable step ------ 6.45s
vm_set : Add exabgpv6 supervisor config and start related processes ----- 4.74s
vm_set : Change PTF interface MAC addresses ----------------------------- 4.67s
vm_set : Stop PTF portchannel ------------------------------------------- 4.51s

Other topology

The only affected functionality are renumber topology and unbind topology

topology no batch batch
t0 image image
t1-64-lag image image
dualtor-120 image image

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@auspham auspham changed the title feat: add fast_bind support feat: add batch_mode support for bind_fp_ports and unbind_fp_ports Jun 4, 2025
@yejianquan
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham requested a review from lolyu June 6, 2025 00:31
@auspham
Copy link
Contributor Author

auspham commented Jun 6, 2025

hi @lolyu could you help reviewing this one please? Thank you

Copy link
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice!
could you please help triage on t0/t1/dualtor?

@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from 0da2f51 to b4a8be8 Compare June 10, 2025 01:48
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from b4a8be8 to abd1383 Compare June 10, 2025 05:20
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from abd1383 to a88952b Compare June 10, 2025 05:23
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from a88952b to 586b829 Compare June 10, 2025 05:50
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from 586b829 to ce342e7 Compare June 10, 2025 07:02
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham requested a review from lolyu June 10, 2025 07:15
@auspham
Copy link
Contributor Author

auspham commented Jun 10, 2025

Hi @lolyu could you help reviewing again. I've added

  • Timeout functionality
  • Error handling for batch mode
  • Test result for t0, t1-64-lag, dualtor-120
  • Some adjustment to also support python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>

adjust logic

Signed-off-by: Austin Pham <austinpham@microsoft.com>

chore: set batchmode

Signed-off-by: Austin Pham <austinpham@microsoft.com>

add support for python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
@auspham auspham force-pushed the austinpham/32654908-improve-vmtopology branch from ce342e7 to e05b4b8 Compare June 10, 2025 08:29
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld pushed a commit that referenced this pull request Jun 18, 2025
adjust logic



chore: set batchmode



add support for python2



fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
@auspham
Copy link
Contributor Author

auspham commented Jun 19, 2025

cherry-pick Azure/sonic-mgmt.msft#417 202503

bingwang-ms added a commit to Azure/sonic-mgmt.msft that referenced this pull request Jun 19, 2025
… (#18790) (#417)

Cherry-pick sonic-net/sonic-mgmt#18790

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary: This PR add option to use batch_mode support for bind_fp_ports.
Which improves the speed by 50% tested on 128 VM neighbor.

Fixes # (issue) 32654908


### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement


### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

When doing ovs flow creation, we're launching subprocess and waiting for
each subprocess result before continue with the next call. This process
is very inefficient even with the aid of multi-threading support.

#### How did you do it?

This PR change the behavior of multi-threading in the following way:
1. Batching all the ovs flow creation commands that needed to execute
into a single file
2. Launch 1 process to call `ovs-ofctl` on the file using add-flows, put
the process into queue for wait later and free the thread so that the
same thread can be use to launch a different batch
3. In the end, main thread will wait for all the batches launched from
process finished

This PR also provide an options to opt in this feature

#### How did you verify/test it?

Verified on physical testbed with 128 VMs. Time deduction for the same
settings of 8 Threads is reduced from 1 hour 30 minutes to 45 minutes
average.


The following is a sample of the same settings, same number of threads,
with batch_mode enabled on renumber topology and unbind topology.

We can see the majority of benefit in Renumber topology by batch the
bind_fp_ports.

Before

```
Wednesday 04 June 2025  03:12:40 +0000 (0:00:00.055)       1:43:10.602 ********
===============================================================================
vm_set : Renumber topology lt2-o128 to VMs. base vm = VM73166 -------- 4998.28s
vm_set : Unbind topology lt2-o128 to VMs. base vm = VM73166 ----------- 557.45s
vm_set : Kill exabgp and ptf_nn_agent processes in PTF container ------ 206.43s
vm_set : Setup vlan port for vlan tunnel ------------------------------- 92.35s
vm_set : Verify that exabgp processes for IPv4 are started ------------- 45.90s
vm_set : Verify that exabgp processes for IPv6 are started ------------- 45.79s
vm_set : Configure exabgp processes for IPv4 on PTF -------------------- 27.62s
vm_set : configure exabgp processes for IPv6 on PTF -------------------- 26.66s
vm_set : Stop ptf container ptf_vms73-2 -------------------------------- 16.49s
vm_set : Run the "apt-get update" as a separate and retryable step ----- 14.26s
vm_set : Create ptf container ptf_vms73-2 ------------------------------ 14.25s
vm_set : Try to login into docker registry ----------------------------- 12.00s
vm_set : Remove ptf container ptf_vms73-2 ------------------------------ 11.54s
vm_set : Set ipv6 route max size of ptf_vms73-2 ------------------------ 11.13s
vm_set : Enable ipv6 for docker container ptf_vms73-2 ------------------ 11.03s
vm_set : Install necessary packages ------------------------------------ 10.58s
vm_set : Announce routes ------------------------------------------------ 9.99s
vm_set : Install necessary packages ------------------------------------- 9.09s
vm_set : Stop PTF portchannel ------------------------------------------- 4.60s
vm_set : Change PTF interface MAC addresses ----------------------------- 4.35s
```


**After**

```
Wednesday 04 June 2025  07:30:07 +0000 (0:00:00.069)       0:52:51.148 ********
===============================================================================
vm_set : Renumber topology lt2-o128 to VMs. base vm = VM73166 -------- 1980.45s
vm_set : Unbind topology lt2-o128 to VMs. base vm = VM73166 ----------- 552.96s
vm_set : Kill exabgp and ptf_nn_agent processes in PTF container ------ 206.56s
vm_set : Setup vlan port for vlan tunnel ------------------------------ 108.52s
vm_set : Verify that exabgp processes for IPv4 are started ------------- 45.76s
vm_set : Verify that exabgp processes for IPv6 are started ------------- 45.45s
vm_set : Configure exabgp processes for IPv4 on PTF -------------------- 27.49s
vm_set : configure exabgp processes for IPv6 on PTF -------------------- 25.96s
vm_set : Stop ptf container ptf_vms73-2 -------------------------------- 16.17s
vm_set : Create ptf container ptf_vms73-2 ------------------------------ 15.02s
vm_set : Remove ptf container ptf_vms73-2 ------------------------------ 12.38s
vm_set : Set ipv6 route max size of ptf_vms73-2 ------------------------ 11.56s
vm_set : Try to login into docker registry ----------------------------- 11.13s
vm_set : Install necessary packages ------------------------------------ 10.50s
vm_set : Install necessary packages ------------------------------------- 8.84s
vm_set : Announce routes ------------------------------------------------ 7.81s
vm_set : Run the "apt-get update" as a separate and retryable step ------ 6.45s
vm_set : Add exabgpv6 supervisor config and start related processes ----- 4.74s
vm_set : Change PTF interface MAC addresses ----------------------------- 4.67s
vm_set : Stop PTF portchannel ------------------------------------------- 4.51s
```

# Other topology
The only affected functionality are `renumber topology` and `unbind
topology`


| topology | no batch | batch|
|-----------|--------|---------|

|t0|![image](https://github.com/user-attachments/assets/0eabf1a0-f463-4c9a-bf0e-65b8d04fc1eb)|
![image](https://github.com/user-attachments/assets/d565bb82-cebf-417f-aad2-f3539ae5f4b1)|

|t1-64-lag|![image](https://github.com/user-attachments/assets/b978c5c2-b39d-4299-9f96-02dbbd31fce7)|![image](https://github.com/user-attachments/assets/176b8d71-f5be-4630-89a3-ed5859c448e0)|

|dualtor-120|![image](https://github.com/user-attachments/assets/38622ea6-cce2-4f95-b5e6-e1ff8d16a491)|![image](https://github.com/user-attachments/assets/b97b2349-b22e-41e0-b77b-1936f4e16caf)|





#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
r12f pushed a commit to Azure/sonic-mgmt.msft that referenced this pull request Aug 7, 2025
As PR sonic-net/sonic-mgmt#17647 and
sonic-net/sonic-mgmt#18790 are not included in
202412, some parameters in `vm_topology.py` are not supported. So in
this PR, we removed such unsupported parameters.
nissampa pushed a commit to nissampa/sonic-mgmt_dpu_test that referenced this pull request Aug 7, 2025
adjust logic



chore: set batchmode



add support for python2



fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
@r12f
Copy link
Contributor

r12f commented Aug 10, 2025

hi @auspham , do you mind to help create a manual pick to 202412 to solve the conflict?

@auspham
Copy link
Contributor Author

auspham commented Aug 10, 2025

@r12f could you help to sign-off? Thank you Azure/sonic-mgmt.msft#636

@r12f
Copy link
Contributor

r12f commented Aug 12, 2025

thanks! Kicked off CI and will follow up

r12f pushed a commit to Azure/sonic-mgmt.msft that referenced this pull request Aug 12, 2025
Cherry-pick sonic-net/sonic-mgmt#18790

Signed-off-by: Austin Pham <austinpham@microsoft.com>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
adjust logic



chore: set batchmode



add support for python2



fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 16, 2025
adjust logic

chore: set batchmode

add support for python2

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
adjust logic

chore: set batchmode

add support for python2

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
adjust logic

chore: set batchmode

add support for python2

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
adjust logic



chore: set batchmode



add support for python2



fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
adjust logic

chore: set batchmode

add support for python2

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
adjust logic

chore: set batchmode

add support for python2

fix python2

Signed-off-by: Austin Pham <austinpham@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants