Skip to content

Add Multi-VRF testbed setup guidance.#23461

Open
yutongzhang-microsoft wants to merge 1 commit intosonic-net:masterfrom
yutongzhang-microsoft:yutongzhang/setup_multivrf
Open

Add Multi-VRF testbed setup guidance.#23461
yutongzhang-microsoft wants to merge 1 commit intosonic-net:masterfrom
yutongzhang-microsoft:yutongzhang/setup_multivrf

Conversation

@yutongzhang-microsoft
Copy link
Copy Markdown
Contributor

Description of PR

Add documentation for Multi-VRF testbed setup, which converges multiple cEOSLab BGP peer
containers into fewer host containers using VRFs to reduce resource consumption on large-scale
topologies.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Add documentation for Multi-VRF testbed setup, which converges multiple cEOSLab BGP peer
containers into fewer host containers using VRFs to reduce resource consumption on large-scale
topologies.

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@yutongzhang-microsoft yutongzhang-microsoft removed the request for review from developfast March 31, 2026 07:43


## Overview
In a standard testbed, each BGP neighbor device maps to its own dedicated cEOSLab container.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better have a link to VsSetup md.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the link to VsSetup md. "see VsSetup for general testbed setup"



## Overview
In a standard testbed, each BGP neighbor device maps to its own dedicated cEOSLab container.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just cEOS container, but it can also be cSONiC or others, will be better to make it generic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "neighbor container (e.g. cEOS, cSONiC)"

For large topologies, this requires a large number of containers, placing significant memory
and CPU demands on the host server.

Multi-VRF mode converges the total number of peer switches into the fewest possible number of cEOSLab
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converges total number of peer switches into the fewest possible number of cEOSLab

converges large number of peer switches into the fewest possible number of neighbor containers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

dataplane behavior are unchanged.

## Approach
cEOSLab peers in docker containers may be converged into a smaller number of host peers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't limit the approach to cEOS only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Neighbor containers (e.g. cEOS, cSONiC)"

dataplane behavior are unchanged.

## Approach
cEOSLab peers in docker containers may be converged into a smaller number of host peers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it is rare to call them cEOSLab.... we usually just call it cEOS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in whole page.

...
use_converged_peers: true
```
This file is read by the `TestbedProcessing.py` script, which sets global variables
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a workflow to show the high level how the process works. For people don't know about this script and where it is placed in the workflow. this section doesn't help at all.

./testbed-cli.sh redeploy-topo <testbed-name> password.txt
```

### Manual convergence (optional)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a step 4? or another way to enable it. if it is latter one, we should use right heading level

## Enabling Multi-VRF mode
### Approach #1: Use use_converged_peer flag in testbed.yaml

### Approach ##: Manual converge peer nodes

Copy link
Copy Markdown
Contributor Author

@yutongzhang-microsoft yutongzhang-microsoft Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Manual convergence' is just an alternative way to modify the YAML files — you can either edit them through ·TestbedProcessing.py· or manually call converge_testbed to achieve the same result.

```

### Step 3: Redeploy topology
```buildoutcfg
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is buildoutcfg?

should this be bash?

BGP neighborship.

## Known Limitations
+ cEOSLab instances do not allow for the creation of interfaces with interface-IDs greater
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, don't limit us to cEOS. cSONiC can also support this mode.


## Test Library Changes
Test libraries needed to be made aware of the new underlying structure of cEOSLab
containers, VRFs, and BGP adjacencies. In many cases this was done by reference to the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is too vague. we should call out the patterns and examples here.

also call out what changes we need in the common infra to support this mode.

@yxieca
Copy link
Copy Markdown
Collaborator

yxieca commented Mar 31, 2026

AI agent on behalf of Ying.\n\nIssues found: docs-only.\n\n

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@github-actions github-actions bot requested review from developfast and r12f April 1, 2026 02:16
@yutongzhang-microsoft
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Here's a summary of the changes made to address all comments:

Comment Change
Add a link to VsSetup doc Added a reference link to README.testbed.VsSetup.md in the Overview
Not just cEOS, make it generic (line 12) Changed to neighbor container (e.g. cEOS, cSONiC) throughout
"converges large number... neighbor containers" (line 16) Updated wording as suggested
Don't limit approach to cEOS (line 23) Rewrote Approach section with generic container language
Use "cEOS" not "cEOSLab" (line 23) Replaced all occurrences of cEOSLab with cEOS
Hard to picture, draw a graph (line 25) Added ASCII diagrams for both Standard and Multi-VRF topologies
Draw a graph (line 28) Same as above
Show a high-level workflow (line 62) Added a 4-step workflow overview before the detailed steps
Restructure headings (line 80) Reorganized into Approach #1 (use flag) and Approach #2 (manual convergence)
What is buildoutcfg? Should be bash (line 76) Changed all code blocks from buildoutcfg to yaml or bash
Don't limit to cEOS (line 99) Updated Known Limitations to use generic container language
Too vague, add patterns and examples (line 92) Expanded Test Library Changes with detailed patterns, nbrhosts fixture changes, BGP library updates, and a list of common infra changes required

@yutongzhang-microsoft yutongzhang-microsoft force-pushed the yutongzhang/setup_multivrf branch from dfd3d36 to f88fd54 Compare April 1, 2026 02:25
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@yutongzhang-microsoft
Copy link
Copy Markdown
Contributor Author

Made two additional updates based on feedback:

1. Enabling Multi-VRF Mode — restructured

Removed the Approach #1 / Approach #2 split. The section is now a straightforward Step 1–3 workflow. The manual convergence section has been rewritten to clarify that it is a fallback for cases where TestbedProcessing.py is not used (or not available), not an alternative approach.

2. Test Library Changes — rewritten based on actual code (PR #22171)

The section now describes each file changed with accurate details:

  • ansible/ceos_topo_converger.py (new): rewrites the topology YAML in-place, merges peer VMs into fewer host containers, injects convergence_data metadata, and sets topo_is_multi_vrf: true.
  • ansible/TestbedProcessing.py: reads use_converged_peers flag and invokes converge_testbed() to rewrite the topology file before ansible processing.
  • ansible/library/topo_facts.py: get_vm_list() and get_vlans() now read from convergence_data when topo_is_multi_vrf is set.
  • ansible/library/testbed_vm_info.py: builds the VRF-to-VM-name mapping from convergence_data.convergence_mapping in multi-VRF mode.
  • tests/conftest.pynbrhosts fixture: populates each neighbor entry with is_multi_vrf_peer and a multi_vrf_data dict (vrf, intf_config, intf_offset, primary_host, primary_host_asn, ptf_bp_config, etc.).
  • tests/bgp/bgp_helpers.py: VRF-aware route checks (check_routes_on_from_neighbor, check_other_neigh, check_propagate_route), VRF-aware PTF port resolution (get_ptf_recv_port), and convergence_data-based offset lookup (get_vm_offset, get_vm_name_list).
  • tests/bgp/conftest.py: graceful-restart config scoped to router bgp <asn> / vrf <vrf> for multi-VRF peers; concurrent config tasks limited to 1 to avoid EOS session races.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@yutongzhang-microsoft yutongzhang-microsoft force-pushed the yutongzhang/setup_multivrf branch from 22c7979 to 9b45a50 Compare April 1, 2026 02:54
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@yutongzhang-microsoft yutongzhang-microsoft force-pushed the yutongzhang/setup_multivrf branch from e721d49 to 1630995 Compare April 1, 2026 08:25
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@yxieca
Copy link
Copy Markdown
Collaborator

yxieca commented Apr 1, 2026

AI agent on behalf of Ying.

Issues found:

  • Docs-only change (policy: do not approve)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants