Skip to content

[test plan] Test plan for BGP scale test#15702

Merged
Blueve merged 9 commits intosonic-net:masterfrom
w1nda:bgp-high-scale-test-plan
May 7, 2025
Merged

[test plan] Test plan for BGP scale test#15702
Blueve merged 9 commits intosonic-net:masterfrom
w1nda:bgp-high-scale-test-plan

Conversation

@w1nda
Copy link
Member

@w1nda w1nda commented Nov 22, 2024

Description of PR

Summary:
Fixes # (issue)
This test plan is to test if control/data plane can handle the initialization/flapping of numerous BGP session holding a lot routes, and estimate the impact on it.

Related PRs:

PR title State Context
[bgp-scale-test] Implement bgp scale test cases for sessions flapping, unisolation, nexthop group member change scenarios state context
[testbed] announce routes with routes generation switch, aggregate routes and variable ipv6 address pattern state context
[isolated-topo] disable ipv4 routes generation, add mocked aggregated addresses state context

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, to verify the functionality and estimate convergence time, we publish this test plan.

How did you do it?

Describe three test scenarios and introduce how we measure time in test.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@w1nda w1nda requested review from wangxin and yxieca as code owners November 22, 2024 09:28
@w1nda w1nda requested review from Blueve and r12f and removed request for wangxin and yxieca November 22, 2024 09:28
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@w1nda w1nda force-pushed the bgp-high-scale-test-plan branch from 4c1c003 to b191e21 Compare December 19, 2024 14:32
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

# Setup Configuration
The count of routes from BGP peers is vital, we will leverage exabpg to advertise routes to all BGP peers, and those routes be be advertised to device under test finally.

When DUT is T0, via exabgp, firstly, we will advertise 511 routes with prefix length 120 to all peer T1 devices for simulating downstream routes (VLAN IPv6 addresses of T0s), secondly, we will dvertise 15 routes with prefix length 64 to all peer T1 devices for simulating upstream routes (Aggregated IPv6 addresses of T0s' VLAN on T2s), finally, the DUT T0 will receive those routes from BGP peers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to say - for each neighbor, we will advertise 1k routes in total: 512 /120 and 512 /128.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will skip the T2 ones here. they won't make difference but can cause a lot confusions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have 1 /120 and 1/128 on T0 DUT, I think the routes count are 511 /120 plus 511 /128, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 511 or 512?

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Detail route scale is described in below table:
| Topology Type | BGP Routes Count | BGP Nexthop Group Count | BGP Nexthop Group Members Count |
| ------------------------------------------ | --------------------- | ----------------------- | ------------------------------- |
| t0-isolated-d2u254s1, t0-isolated-d2u254s2 | 254 * ( 511 + 511 ) | 254 | 254 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The huge next hop count is not what the topology will provide by default, but the mgmt test cases would do. We should move them down to the mgmt test, but provide the default numbers here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can make a new table showing the test as the requirement of Nexthop Group Member Scale Test.

Copy link
Member Author

@w1nda w1nda Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we deploy testbed, the script will setup route by default, and there are parameters in topo like: podset_number, tor_number, tor_subnet_number to control the routes scale, so routes in this table is default for each topology.

# Route Configuration Setup
The count of routes from BGP peers is vital, we will leverage exabpg to advertise routes to all BGP peers, and those routes be be advertised to device under test finally.

When DUT is T0, via exabgp, we will advertise 511 routes with prefix length 120 and 511 rotues with prefix length 128 to each neighbor T1 devices. The prefixes with length 120 are mocking VLAN address on downstream T0s, and the prefixes with length 128 are mocking loopback address on downstream T0s.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify my understanding of the text here.

when the DUT is a T0 - the expectation is that all of the T1 (emulated) are reflecting the same collection of /120 and /128 prefix announcements for a resulting prefix count on the T0 DUT of ~1022 prefixes spread over 256/512 NHs. correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

### Steps
1. Shut down all ports on device. (shut down T1 sessions ports on T0 DUT, shut down T0 sesssions ports on T1 DUT.)
1. Wait for routes are stable.
1. Start and keep sending packets with all routes to all portes via ptf.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: portes => ports :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Blueve
Blueve previously approved these changes May 6, 2025
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@Blueve Blueve merged commit f65f2e6 into sonic-net:master May 7, 2025
4 checks passed
@zhangyanzhao zhangyanzhao moved this from 📋 In Plan Features to ✅ Done in SONiC 202505 Release May 9, 2025
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202412: Azure/sonic-mgmt.msft#259

nhe-NV pushed a commit to nhe-NV/sonic-mgmt that referenced this pull request May 12, 2025
…or sessions flapping, unisolation, nexthop group member change scenarios (sonic-net#258)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Implement test plan sonic-net#15702.
Add test cases to test if control/data plane can handle the initialization/flapping of numerous BGP session holding a lot routes, and estimate the impact on it.
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)

### Back port request
- [ ] 202012
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405

### Approach
#### What is the motivation for this PR?
With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, we need test cases to verify the functionality and estimate convergence time, we publish this test plan.
#### How did you do it?
Implement sessions flapping test, unisolation test and nexthop group member scale test
#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
What is the motivation for this PR?
With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, to verify the functionality and estimate convergence time, we publish this test plan.

How did you do it?
Describe three test scenarios and introduce how we measure time in test.

Signed-off-by: opcoder0 <110003254+opcoder0@users.noreply.github.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
What is the motivation for this PR?
With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, to verify the functionality and estimate convergence time, we publish this test plan.

How did you do it?
Describe three test scenarios and introduce how we measure time in test.

Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, to verify the functionality and estimate convergence time, we publish this test plan.

How did you do it?
Describe three test scenarios and introduce how we measure time in test.

Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
What is the motivation for this PR?
With numerous BGP sessions holding a lot routes, any flapping on BGP sessions or routes cloud have more overhead on device, to verify the functionality and estimate convergence time, we publish this test plan.

How did you do it?
Describe three test scenarios and introduce how we measure time in test.

Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

8 participants