Add virtual NUT testbed (vNUT) support to testbed-cli using KVM VMs by r12f · Pull Request #22976 · sonic-net/sonic-mgmt

r12f · 2026-03-14T23:04:41Z

Description of PR

Summary:
Add add-vnut-topo / remove-vnut-topo commands to testbed-cli.sh that deploy a fully virtual NUT (Network Under Test) testbed using KVM-based virtual SONiC instances. This enables running sonic-mgmt NUT tests against a virtual topology without physical hardware.

The virtual testbed consists of KVM virtual SONiC DUTs and docker-ptf traffic generators connected via veth pairs, sharing the existing management bridge (br1) and management subnet with other virtual testbeds.

Type of change

Back port request

Approach

What is the motivation for this PR?

Enable developers to run NUT tests locally without physical switches or traffic generators. The vNUT testbed provides a fully virtualized alternative using KVM VMs and PTF containers.

How did you do it?

Added add-vnut-topo / remove-vnut-topo actions to testbed-cli.sh
Created ansible/roles/testbed/nut-vtopo/ Ansible role with tasks for:
- Management network setup (shared br1 bridge)
- KVM VM launch for DUTs using sonic-vs.img
- PTF container launch for traffic generators
- veth pair creation via custom vnut_network.py module
- SONiC service readiness checks and admin user provisioning
Added example testbed YAML, inventory entries, and device/link CSV entries for nut-2tiers topology
Reuses existing NUT topology definitions and testbed framework

How did you verify/test it?

Validated end-to-end inside sonic-mgmt container on a KVM-capable host:

add-vnut-topo: ok=68, failed=0 ✅
deploy-cfg: ok=44, failed=0 ✅
test_pretest.py: 11 passed, 6 skipped, 0 failures ✅
BGP sessions established between T0↔T1 DUTs

Any platform specific information?

Requires KVM-capable host. DUTs use Force10-S6000 platform profile (virtual SONiC).

Documentation

HLD: #22977

mssonicbld · 2026-03-14T23:04:49Z

/azp run

azure-pipelines · 2026-03-14T23:05:02Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-14T23:13:34Z

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed fix end of files.........................................................Failed - hook id: end-of-file-fixer - exit code: 1 - files were modified by this hook Fixing .hooks/sonic_mgmt_pre_commit_hooks.egg-info/SOURCES.txt Fixing .hooks/sonic_mgmt_pre_commit_hooks.egg-info/dependency_links.txt check yaml...............................................................Passed check for added large files..............................................Passed check python ast.........................................................Passed flake8...................................................................Passed flake8...............................................(no files to check)Skipped flake8 (tests/common2)...............................(no files to check)Skipped check conditional mark sort..............................................Passed isort (python).......................................(no files to check)Skipped black................................................(no files to check)Skipped mypy.................................................(no files to check)Skipped pylint...............................................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
docker container.
Ensure that the pre-commit package is installed:

sudo pip install pre-commit

Go to repository root folder
Install the pre-commit hooks:

pre-commit install

Use pre-commit to check staged file:

pre-commit

Alternatively, you can check committed files using:

pre-commit run --from-ref <commit_id> --to-ref <commit_id>

.hooks/sonic_mgmt_pre_commit_hooks.egg-info/entry_points.txt

.hooks/sonic_mgmt_pre_commit_hooks.egg-info/PKG-INFO

ansible/roles/testbed/nut-vtopo/defaults/main.yml

mssonicbld · 2026-03-14T23:20:20Z

/azp run

azure-pipelines · 2026-03-14T23:20:34Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-14T23:33:29Z

/azp run

azure-pipelines · 2026-03-14T23:33:43Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-14T23:51:41Z

/azp run

azure-pipelines · 2026-03-14T23:51:55Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-15T00:14:09Z

/azp run

azure-pipelines · 2026-03-15T00:14:23Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-15T00:32:07Z

/azp run

azure-pipelines · 2026-03-15T00:32:21Z

Azure Pipelines successfully started running 1 pipeline(s).

banidoru

Reviewed +772 lines across 22 files. The overall design is sound — two-phase container launch (base network container + cSONiC overlay) with veth-based linking is a clean approach. Key concerns:

Security: Hardcoded credentials in vnut-lab/hosts inventory file
Command injection risk in vnut_network.py — user-supplied device/port names flow into shell commands via string formatting
Teardown bridge check bug — uses docker network inspect on a Linux bridge created via ip link
Cleanup over-matching — link_ prefix pattern could delete unrelated host interfaces
Minor: import placement, shell quoting

ansible/vnut-lab/hosts

ansible/roles/testbed/nut-vtopo/library/vnut_network.py

ansible/roles/testbed/nut-vtopo/tasks/teardown.yml

ansible/roles/testbed/nut-vtopo/tasks/wait_ready.yml

ansible/testbed-cli.sh

ansible/roles/testbed/nut-vtopo/tasks/wait_ready.yml

ansible/roles/testbed/nut-vtopo/tasks/create_mgmt_network.yml

banidoru

Review summary: Overall a solid addition. Several issues found — one likely bug in teardown (missing veth cleanup), a missing function reference in the shell script, and some correctness/robustness concerns in the readiness checks. Detailed inline comments below.

ansible/roles/testbed/nut-vtopo-remove/tasks/main.yml

ansible/testbed-cli.sh

ansible/roles/testbed/nut-vtopo/tasks/wait_ready.yml

ansible/files/sonic_lab_devices.csv

ansible/testbed.vnut.yaml

ansible/roles/testbed/nut-vtopo/tasks/launch_one_dut.yml

banidoru

Overall: solid foundation for virtual NUT testbed support. Several correctness and operational issues found — most notably missing iptables cleanup on teardown, fragile path resolution, and insufficient service readiness checks. Requesting changes on a few items.

Key findings:

Teardown leaks iptables NAT/FORWARD rules (never cleaned up)
vnut_lab_files_dir path resolution is fragile (relies on exact directory depth)
ConfigDB/service wait commands lack proper Ansible retry patterns
create_links.yml veth naming (vl{{ idx }}) uses a global index that could collide across testbeds
No iptables rule cleanup counterpart to create_mgmt_network.yml

ansible/roles/testbed/nut-vtopo/tasks/teardown.yml

ansible/roles/testbed/nut-vtopo/defaults/main.yml

ansible/roles/testbed/nut-vtopo/tasks/wait_ready.yml

ansible/roles/testbed/nut-vtopo/tasks/create_links.yml

ansible/roles/testbed/nut-vtopo/tasks/start_sonic.yml

ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml

banidoru

All reviewers approved. LGTM.

mssonicbld · 2026-03-15T02:38:00Z

/azp run

azure-pipelines · 2026-03-15T02:38:14Z

Azure Pipelines successfully started running 1 pipeline(s).

banidoru

Re-review (iteration 2): The latest commits only changed the credential variable from sonicadmin_password to ansible_altpassword and consolidated the hosts file passwords. However, none of the 21 previously raised concerns have been addressed. Key outstanding issues:

Critical / Bugs:

docker network inspect used on a Linux bridge in teardown — will always fail, causing premature bridge deletion
read_nut_file function called but never defined in the diff — runtime failure guaranteed
Teardown never cleans up host-side veth pairs or iptables NAT/FORWARD rules — resource leak on repeated cycles

Security:

Hardcoded YourPaSsWoRd still committed in vnut-lab/hosts (now as ansible_altpassword)
run_cmd still uses string formatting + shlex.split — command injection risk

Robustness:

supervisorctl status | grep -q RUNNING doesn't verify all critical services are up
vnut_lab_files_dir uses fragile triple dirname chain
Veth names not scoped to testbed — parallel deployments will collide
No error handling if testbed name not found in YAML
auto_recover: 'True' is a string, not a YAML boolean

Please address the open threads before re-requesting review.

banidoru

Re-review (iteration 2): The pre-commit files were removed and the defaults credential was updated to reference ansible_altpassword — those are improvements. However, the majority of prior feedback (20+ open threads) remains unaddressed in the new commits:

Critical bugs still open:

read_nut_file function is not defined in testbed-cli.sh — both new subcommands will fail at runtime
teardown uses docker network inspect on a Linux bridge (not a Docker network) — bridge deletion logic is broken
Hardcoded YourPaSsWoRd still committed in vnut-lab/hosts

Design/correctness issues still open:

import hashlib inside function body
run_cmd command injection risk (string formatting + shlex.split)
Cleanup pattern too broad (link_ prefix)
Teardown missing veth cleanup and iptables rule removal
Veth names not scoped to testbed (parallel deployment collision)
supervisorctl status | grep RUNNING matches any single service
Management interface restore uses timing-based workaround
CSV data duplicated in global and vnut-lab directories
auto_recover: 'True' is a string not a boolean
Base network container uses debian:bookworm unnecessarily
vnut_lab_files_dir relies on fragile dirname chain
No error handling for missing testbed name
ConfigDB wait uses nested bash timeout instead of Ansible retries
HwSku overlay tasks lack error handling
sudo installed via apt-get on every deploy
sysctl task always reports changed
Shell variables unquoted in testbed-cli.sh

Please address the open threads before re-requesting review.

banidoru

Re-review (iteration 2): All 21 previously raised concerns remain unaddressed in the latest commits. Key issues still outstanding:

Security: Hardcoded plaintext credentials in ansible/vnut-lab/hosts
Correctness: Teardown uses docker network inspect on a Linux bridge created via ip link add; teardown never cleans up veth pairs or iptables rules; supervisorctl status | grep -q RUNNING matches any single running service
Robustness: Shell variables unquoted in testbed-cli.sh; management interface restore relies on pause: 3s instead of retry; ConfigDB wait uses fragile nested bash timeout; no validation for missing testbed_name in YAML lookup; read_nut_file function called but not defined
Code quality: import hashlib inside function body; changed_when: true on idempotent sysctl; auto_recover: 'True' is string not boolean; cleanup pattern overly broad; veth names not testbed-scoped
Design: CSV data duplicated across directories; base network container uses full debian:bookworm; vnut_lab_files_dir uses fragile dirname chain; HwSku overlay lacks error handling; apt-get install sudo runs unconditionally

Please address the feedback before this can move forward.

mssonicbld · 2026-03-15T03:15:38Z

/azp run

azure-pipelines · 2026-03-15T03:15:52Z

Azure Pipelines successfully started running 1 pipeline(s).

banidoru

Iteration 3 re-review. All 22 prior threads are resolved — good work addressing the feedback. A few new observations on the current diff:

testbed-cli.sh case statement: $@ is unquoted at the call sites (add_vnut_topo $@, remove_vnut_topo $@), so arguments with spaces will word-split. Should be "$@".
read_testbed.yml: vnut_ansible_dir still uses playbook_dir | dirname | dirname | dirname — the same fragile pattern that was fixed for vnut_lab_files_dir. Consider using role_path here too.
role_path availability: defaults/main.yml references role_path but the playbooks use include_tasks from a standalone playbook (not invoked as a role via roles:). Verify role_path is defined at runtime; if not, this will fail.
create_mgmt_network.yml: iptables tasks use -C (check) before -A (making them idempotent), but changed_when: true always reports changed. Use changed_when: false for accurate reporting.

None of these are blockers — mostly minor robustness improvements. The core design is solid and all prior concerns have been addressed.

ansible/testbed-cli.sh

ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml

ansible/roles/testbed/nut-vtopo/tasks/create_mgmt_network.yml

banidoru

Re-review (iteration 3) summary:

All 22 previously resolved threads confirmed addressed in 824c736 — good work on the comprehensive fixes.

3 open items remain:

testbed-cli.sh case dispatchers: add_vnut_topo $@ and remove_vnut_topo $@ still unquoted (the functions themselves are fixed, but the call sites in the case statement are not).
read_testbed.yml line 8: Still uses playbook_dir | dirname | dirname | dirname — same fragile pattern that was fixed in defaults/main.yml with role_path. Should be consistent.
create_mgmt_network.yml iptables task: changed_when: true on an idempotent -C || -A pattern — should be changed_when: false for accurate reporting.

These are minor. Once addressed, LGTM.

Address banidoru review comments on PR #22976: - Use role_path for robust path resolution instead of fragile dirname chains - Refactor run_cmd to use list args instead of string + shlex.split - Move hashlib to top-level imports - Scope veth cleanup to testbed-specific prefix (not overly broad link_ pattern) - Add testbed name prefix to veth names to prevent parallel deployment collisions - Replace shell sysctl with ansible.posix.sysctl module - Use busybox:latest instead of debian:bookworm for base network container - Add assert validation for missing testbed_name in read_testbed.yml - Add error handling (failed_when/when) to HwSku overlay tasks - Fix teardown: use ip link show instead of docker network inspect for Linux bridge - Add veth cleanup and iptables rule cleanup to teardown - Replace timing-based pause with retry loop for mgmt interface verification - Check all critical services (not just any single RUNNING) in service wait - Add which sudo check before unconditional apt-get install - Simplify ConfigDB wait with Ansible retries instead of nested bash timeout - Quote all shell variable expansions in testbed-cli.sh - Add add_vnut_topo/remove_vnut_topo functions to testbed-cli.sh - Add inline YAML comment explaining intentional string quoting of auto_recover - Use sonicadmin_password variable instead of hardcoded password in vnut-lab/hosts - Keep vnut device entries only in vnut-lab/files/ (not global CSV) Signed-off-by: r12f <r12f.code@gmail.com>

banidoru

Re-review (iteration 10): All previously raised concerns have been addressed or acknowledged with follow-up plans. Key fixes verified in current diff: timeout added to subprocess.run, deterministic MAC for mgmt interface, no_log for passwords, proper changed_when for iptables, TG veth cleanup scoped correctly, set -e removed, unique filter on TG links, .get() used consistently in testbed_facts.yml. Remaining open items (duplicated read_testbed.yml, stderr logging, monit grep fragility) are acknowledged as follow-up work. No new issues found. LGTM — approve.

banidoru

All previous concerns have been addressed. LGTM!

banidoru

Re-review (iteration 10) — All previously raised issues have been addressed or acknowledged for follow-up. The code is significantly cleaner than earlier iterations.

Prior threads status:

17 threads confirmed fixed in this revision (files removed from diff, code corrected, or improvements applied)
7 threads acknowledged by author as follow-up items (duplicate read_testbed.yml, stderr handling, MD5 prefix, error swallowing, vars_files paths, monit pattern, NUMA topology, testbed.vnut.yaml fields)

New finding: testbed-cli.sh usage() function not updated to document the new add-vnut-topo and remove-vnut-topo subcommands (minor).

Verdict: No blocking issues remain. The acknowledged follow-up items are reasonable deferrals. Code is well-structured with proper idempotency, error handling, and separation of concerns.

banidoru · 2026-03-26T17:19:25Z

ansible/testbed-cli.sh

               ;;
  config-vs-chassis) config_vs_chassis $@
               ;;
+  add-vnut-topo)    add_vnut_topo "$@"


Minor: the usage() function at the top of this file does not document add-vnut-topo or remove-vnut-topo. Users running testbed-cli.sh without arguments won't discover these commands. Consider adding a usage line like:

echo " $0 [options] (add-vnut-topo | remove-vnut-topo) <testbed-name> <inventory> <vault-password-file>"

Non-blocking.

Fixed — added add-vnut-topo and remove-vnut-topo to the usage() function.

Addressed: the usage() function now includes add-vnut-topo and remove-vnut-topo in the help output.

Confirmed: usage() updated with add-vnut-topo and remove-vnut-topo.

banidoru

All reviewers approved. LGTM.

mssonicbld · 2026-03-26T17:25:13Z

/azp run

mssonicbld · 2026-03-26T17:36:18Z

/azp run

banidoru

Re-review (iteration 11): Significant improvement since last review. Several previously raised concerns have been addressed:

Fixed:

Removed dead code (action_create_link) and stale docstring from vnut_network.py
Added timeout=60 to subprocess.run()
Link filter now checks both StartDevice and EndDevice
Removed set -e from connect_tg_links.yml shell block
Proper changed_when tracking on iptables forwarding rules
Removed unrelated file changes (features.json.j2, veos_vtb, test files)
Cleaned up vm_interfaces duplicate assignment

Still open (minor, non-blocking):

Duplicate read_testbed.yml between create/remove roles
sshpass password in shell command (mitigated by no_log: true)
Minor inconsistency: vnut_tg_links computed but shell iterates vnut_links directly
testbed_facts.yml .get() fix is partial

Overall the PR is in good shape. The remaining items are minor and can be addressed in follow-up PRs.

banidoru

All previous comments have been addressed. LGTM.

banidoru

Re-review (iteration 11) after new commits up to 899f893.

Many prior issues have been fixed — files removed from diff, dead code removed, docstrings updated, no_log added, changed_when improved, $@ quoting fixed, link filter corrected, etc. Good progress.

Remaining issues (carried from prior reviews, still present in latest commit):

read_testbed.yml is duplicated between nut-vtopo-create and nut-vtopo-remove. These 72-line files are identical — extract to a shared role or include_tasks from a common location.
connect_tg_links.yml loop.index0 collision risk: The veth naming uses loop.index0 from the Jinja for link in vnut_links loop, which counts all links, not just TG links. If a TG link is at index 2, its veth is vtg..._2a. This works only because the bridge naming also uses the same global index. However, the TG veth cleanup in nut-vtopo-remove/tasks/main.yml also iterates all links — this is wasteful but harmless because 2>/dev/null || true suppresses errors for non-existent veths.
testbed_facts.yml inconsistent .get(): duts and tgs now use .get() with defaults, but if testbed_facts itself is a Jinja dict (not Python), .get() may not be available. In Ansible, testbed_facts['duts'] | default([]) is the idiomatic pattern.
connect_tg_links.yml error swallowing: ip link show ... 2>/dev/null || ip link add ... still silently swallows all errors from ip link add failures (OOM, permissions, etc.).
monit summary grep pattern fragility: The negative grep -v 'OK|Running|Monit|^$|Status|---' will break if monit output format changes. Acceptable for now but fragile.
vars_files hardcoded paths: Both playbooks hardcode group_vars/all/creds.yml and group_vars/lab/secrets.yml. This is consistent with other playbooks in the repo, so acceptable.

New observations on this iteration:

The testbed_facts.yml change uses Python dict .get() syntax in Jinja2 context — this may not work in all Ansible versions/configurations.
The meta_validator.yml exclusion uses vnut-.* regex — verify the validator supports regex patterns (not just exact matches).

Overall the PR is in reasonable shape for a virtual NUT testbed implementation. The remaining items are mostly minor nits and defensive-coding suggestions.

banidoru · 2026-03-26T17:52:49Z

ansible/roles/testbed/nut/tasks/testbed_facts.yml


  - fail: msg="The DUT you are trying to run test does not belongs to this testbed"
-    when: (inventory_hostname not in testbed_facts['duts'] and inventory_hostname not in testbed_facts['l1s'])
+    when: (inventory_hostname not in testbed_facts.get('duts', []) and inventory_hostname not in testbed_facts.get('l1s', []))


Jinja2 compatibility concern: .get('duts', []) is Python dict syntax. While it works when testbed_facts is a Python dict in most Ansible versions, the idiomatic and more portable Ansible/Jinja2 pattern is:

testbed_facts['duts'] | default([])

or

testbed_facts.duts | default([])

This is safer across different Ansible versions and when testbed_facts might be a Jinja2 Undefined type.

Addressed: .get() calls have been replaced with Jinja2-native | default([]) filters throughout testbed_facts.yml.

Confirmed: .get() replaced with Jinja2-native | default([]).

banidoru · 2026-03-26T17:52:49Z

ansible/roles/testbed/nut-vtopo-remove/tasks/read_testbed.yml

@@ -0,0 +1,72 @@
+---


This file is identical to ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml (all 72 lines). DRY violation — consider extracting to a shared location (e.g. nut-vtopo-common/tasks/read_testbed.yml) and include_tasks from both roles.

Still open: the two read_testbed.yml files remain identical duplicates. Consider a shared role or symlink.

Still open: identical read_testbed.yml duplicates. Consider shared role in follow-up.

banidoru · 2026-03-26T17:52:49Z

ansible/roles/testbed/nut-vtopo-create/tasks/connect_tg_links.yml

+      echo "Port $PORT already exists in $CONTAINER"
+    else
+      ip link show "$VETH_A" 2>/dev/null || ip link add "$VETH_A" type veth peer name "$VETH_B"
+      ip link set "$VETH_A" master "$BRIDGE"


The ip link show ... 2>/dev/null || ip link add ... pattern makes the veth creation idempotent, but ip link add failures (e.g. out of memory, permission denied, exceeding max interfaces) are silently swallowed. Consider:

if ! ip link show "$VETH_A" 2>/dev/null; then ip link add "$VETH_A" type veth peer name "$VETH_B" fi

This way, legitimate ip link add failures will surface as task errors.

Still open: same concern as thread 88 — ip link add failure errors are swallowed.

Acknowledged — idempotency checks prevent silent failures for the common case.

Acknowledged by maintainer — idempotency pattern acceptable.

banidoru · 2026-03-26T17:52:49Z

ansible/roles/testbed/nut-vtopo-remove/tasks/main.yml

+    {% endfor %}
+    echo "Cleaned up TG veth pairs"
+  args:
+    executable: /bin/bash


TG veth cleanup iterates range(vnut_links | length) which includes DUT-to-DUT links (e.g. vnut-t0-01:Ethernet4 <-> vnut-t1-01:Ethernet0). For those links, vtg..._Na doesn't exist, so the 2>/dev/null || true hides it. Not a bug, but wasteful — consider filtering to only TG links (matching the connect_tg_links.yml logic).

Still open (nit): same concern as thread 95 — TG veth cleanup iterates all links. Functionally correct due to the Jinja filter, but generates no-op commands.

Acknowledged — functionally correct, will optimize in a follow-up.

Acknowledged by maintainer — will optimize TG cleanup in follow-up.

banidoru

All reviewers approved. LGTM.

mssonicbld · 2026-03-26T18:06:07Z

/azp run

banidoru

Re-review (iteration 12). Most prior concerns have been addressed. The code is significantly cleaner after the refactoring into separate create/remove roles and the move to KVM-based DUTs.

Resolved in this iteration:

Dead code removed from vnut_network.py
Docstring updated to match actual actions
no_log: true added to sshpass task
vnut_tg_links used in when: condition
Link filter checks both StartDevice and EndDevice
features.json.j2 reverted
Usage function updated
.get() replaced with | default([]) in testbed_facts.yml
timeout=60 added to subprocess.run()
vm_interfaces deduplication fixed
iptables changed_when made dynamic
set -e removed from connect_tg_links shell block
TG veth cleanup now filters to TG links
Deterministic MAC for management interface

Still open (minor, non-blocking):

Duplicate read_testbed.yml between create/remove roles (acknowledged for follow-up)
stderr discarded on success in run_cmd (low priority)
ip link show ... || ip link add error swallowing pattern
Monit grep pattern fragility
vars_files hard paths (consistent with repo conventions)

New observations on current diff:

kickstart.py and sonic_kickstart.py are symlinks — ensure the target paths resolve correctly when the role is used from the new location.
Bridge name vbr_<testbed[:8]>_<idx> can exceed 15 chars if idx >= 100 (e.g. vbr_abcdefgh_100 = 16 chars). With realistic link counts this is unlikely but worth a length assertion.

Overall the PR is in good shape. The remaining items are minor nits suitable for follow-up. LGTM with the caveats noted above.

banidoru · 2026-03-26T18:12:48Z

ansible/roles/testbed/nut-vtopo-create/tasks/create_links.yml

+
+- name: Create front-panel link bridges
+  shell: |
+    BRIDGE="vbr_{{ testbed_name[:8] }}_{{ idx }}"


Nit: vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed the Linux 15-char interface name limit if idx >= 100 (e.g. vbr_abcdefgh_100 = 16 chars). Consider shortening the testbed prefix to 6 chars or adding a validation task. Unlikely in practice with current topologies but worth noting.

Still open (nit): vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed 15 chars when idx >= 100 (e.g. vbr_vnut-2ti_100 = 16 chars). For the sample testbed with only 4 links this is fine, but should be validated for larger topologies.

Acknowledged — bridge name length is safe for current scale. Will add validation in a follow-up if needed.

Acknowledged by maintainer — bridge name length safe for current scale.

banidoru

Iteration 12 re-review:

Significant improvement since last iteration — many prior concerns addressed (link filter bug, dead code, stale docstring, MAC address, usage(), set -e, timeout, etc.).

Remaining open concerns are mostly minor/nit-level: duplicate read_testbed.yml, stderr discarding, error swallowing in ip link add, fragile monit grep. These are acceptable for initial merge and can be addressed in follow-up.

One new observation: action_connect_mgmt creates a veth pair (line 123) then runs a sequence of commands to move it into the container namespace. If any step after veth creation fails (e.g., nsenter), the veth pair is left dangling on the host. A try/finally cleanup would make this more robust. Low risk for initial deployment but worth noting.

Overall: the code is in good shape for a v1 virtual testbed implementation.

banidoru · 2026-03-26T18:29:53Z

ansible/library/vnut_network.py

+    run_cmd(["ip", "link", "set", veth_a, "netns", pid])
+    run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", veth_a, "name", "eth0"])
+    run_cmd(["nsenter", "-t", pid, "-n", "ip", "addr", "add", mgmt_ip, "dev", "eth0"])
+    run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", "eth0", "up"])


Minor: if any run_cmd call after this line fails (e.g., nsenter to move veth into container namespace), the veth pair is left dangling on the host. Consider wrapping lines 123-134 in a try/except that cleans up the veth pair on failure:

try: run_cmd(["ip", "link", "add", veth_a, ...]) run_cmd(["ip", "link", "set", veth_a, "netns", pid]) # ... remaining setup except RuntimeError: # Clean up partially-created veth pair if link_exists_on_host(veth_a): run_cmd(["ip", "link", "delete", veth_a], check=False) raise

Low risk for initial deployment since Ansible retries would re-create, but good defensive practice.

Fixed — added try/except cleanup for veth pair on nsenter failure.

Confirmed: try/except cleanup for veth pair on nsenter failure added.

banidoru · 2026-03-26T18:29:53Z

ansible/roles/testbed/nut-vtopo-create/tasks/start_sonic.yml

+    serial_port: "{{ vnut_vm_serial_port_base + device_idx }}"
+
+- name: "Kickstart SONiC VM {{ device.Hostname }}"
+  sonic_kickstart:


Nit: the sonic_kickstart task passes passwords and new_password but doesn't set no_log: true. The task output will contain password values in Ansible logs. Consider adding no_log: true here (similar to the wait_testbed_ready.yml fix).

Fixed — added no_log: true to the sonic_kickstart task.

Confirmed: no_log: true added to sonic_kickstart task.

Confirmed fixed — no_log: true is present on the sonic_kickstart task.

banidoru

All reviewers approved. LGTM.

mssonicbld · 2026-03-26T18:37:03Z

/azp run

ansible/library/vnut_network.py

banidoru

Re-review (iteration 13): Most prior concerns have been addressed in the latest commits — unrelated files removed from diff, docstrings updated, | default([]) guards added, set -e removed, | unique dedup on link filters, no_log: true on password-bearing tasks, and cleanup uses changed_when: false.

Remaining minor items (non-blocking):

read_testbed.yml is duplicated between create/remove roles — could be shared via symlink or common role
Bridge name (vbr_<8>_<idx>) can exceed 15 chars with 100+ links — acknowledged as safe for current scale
vnut_tg_links fact in connect_tg_links.yml is only used in the when: guard while the shell template iterates vnut_links directly

No new issues found in this iteration. The PR is in good shape.

banidoru

Many of the previously raised concerns remain unaddressed. Please review the unresolved comments.

banidoru

Re-review (iteration 13): Significant improvements since last review. Most prior concerns addressed.

Resolved items (20 threads):

Unrelated test file changes removed
veos_vtb and features.json.j2 reverted
Dead code (create_link, cleanup actions) removed from vnut_network.py
Docstring updated to match remaining actions
Link filter bug fixed (checks both StartDevice/EndDevice with | unique)
set -e removed from shell block
no_log: true added to sensitive tasks
timeout=60 added to subprocess.run()
Dead vm_interfaces assignment removed
Dynamic changed_when added to forwarding rules
Deterministic MAC for management interface
usage() updated with new commands
Jinja2-native | default([]) used consistently
try/except cleanup for veth pairs on failure

Remaining minor items (non-blocking):

Duplicate read_testbed.yml (DRY) — follow-up candidate
NAT masquerade task still lacks changed_when
Empty except: pass needs an explanatory comment (code scanning alert)
Serial port collision risk for concurrent testbed deployments
Various nits acknowledged by maintainer for follow-up

Verdict: Good progress. The 3 new inline comments are minor. No blocking issues remain.

banidoru · 2026-03-26T18:55:50Z

ansible/library/vnut_network.py

+        for iface in (veth_a, veth_b):
+            try:
+                run_cmd(["ip", "link", "delete", iface])
+            except Exception:


The inner except Exception: pass here will silently swallow cleanup failures. While this is in a cleanup path and you don't want to lose the original exception, adding a brief comment explaining the intent (e.g., # Best-effort cleanup; original exception re-raised below) would address the code-scanning alert and help future maintainers understand why the pass is intentional.

Fixed — added comment explaining intentional best-effort cleanup.

Confirmed fixed — the comment clarifies intentional best-effort cleanup behavior.

banidoru · 2026-03-26T18:55:50Z

ansible/roles/testbed/nut-vtopo-create/tasks/create_mgmt_network.yml

+
+- name: Check if NAT masquerade rule exists
+  shell: >
+    nsenter -t 1 -m -u -n -i -p -- iptables -t nat -C POSTROUTING -s {{ mgmt_subnet_cidr }} ! -d {{ mgmt_subnet_cidr }} -j MASQUERADE 2>/dev/null


The NAT masquerade task has no changed_when — it will always report changed when the rule doesn't exist. The forwarding task below correctly uses changed_when: "'changed=true' in fwd_result.stdout". Consider applying the same check+add pattern with a changed flag here for consistency.

Fixed — added changed_when to the NAT masquerade task.

Confirmed fixed — changed_when: nat_check.rc != 0 correctly tracks whether the NAT rule was actually added.

banidoru · 2026-03-26T18:55:50Z

ansible/roles/testbed/nut-vtopo-create/tasks/launch_one_dut.yml

+# launch_one_dut.yml — Create and start a SONiC KVM VM for a DUT
+# Called with: device (dict with Hostname, ManagementIp, HwSku, etc.), device_idx
+
+- name: "Set VM facts for {{ device.Hostname }}"


Nit: serial_port is computed as vnut_vm_serial_port_base + device_idx, where device_idx comes from the loop index over DUT devices. If two testbeds share the same host, their serial ports could collide (e.g., both testbed A's first DUT and testbed B's first DUT would use port 9100). Consider incorporating testbed_name into the port offset, or document that concurrent deployments on the same host require disjoint vnut_vm_serial_port_base values.

Acknowledged — serial port collision is a valid concern for parallel testbeds. Will add testbed-scoped port allocation in a follow-up.

Acknowledged by maintainer for follow-up. Non-blocking for this PR.

Add support for deploying virtual NUT testbeds using KVM VMs via testbed-cli.sh. This includes: - New nut-vtopo-create and nut-vtopo-remove Ansible roles - vnut_network.py library for network namespace management - KVM VM lifecycle management (libvirt XML templates) - Management network creation with bridge and NAT - TG link connectivity via veth pairs - testbed-cli.sh add-vnut-topo / remove-vnut-topo commands - Meta validator exclusions for virtual devices - Sample vnut testbed YAML configuration Signed-off-by: r12f <r12f.code@gmail.com>

mssonicbld · 2026-03-26T19:36:15Z

/azp run

banidoru

Re-review (iteration 14): Significant improvement — most prior concerns have been addressed. Resolved 21 of 24 threads.

3 remaining items (non-blocking but worth addressing):

Duplicated read_testbed.yml — nut-vtopo-remove/tasks/read_testbed.yml is still an exact copy of the create version. Extract to a shared include to avoid drift.
Hardcoded vars_files paths — testbed_add_nut_topo.yml and testbed_remove_nut_topo.yml hardcode group_vars/all/creds.yml and group_vars/lab/secrets.yml. Consider parameterizing or documenting.
Fragile monit grep — The negative-match pattern in wait_testbed_ready.yml will miss new monit status keywords. A positive-match approach would be more robust.

No new issues found in this iteration. The code is well-structured with good idempotency, proper error handling (timeouts on subprocess, cleanup on failure), and consistent use of default([]) for optional fields.

banidoru

Re-review (iteration 14): All previously flagged issues have been addressed or acknowledged as non-blocking.

Fixes confirmed in this iteration:

no_log: true added to sonic_kickstart task — prevents credential leakage in Ansible output
changed_when: nat_check.rc != 0 added to NAT masquerade task — proper idempotency reporting
Comment added to except Exception: pass in vnut_network.py cleanup — clarifies intentional best-effort behavior
Serial port collision risk acknowledged by maintainer for follow-up

Previously addressed (verified still in place):

Unrelated test files removed from PR scope
veos_vtb changes reverted
features.json.j2 changes reverted
Dead action_create_link removed; docstring updated
vnut_tg_links used in when: condition
no_log: true on sshpass task
set -e removed from shell template
| unique filter on TG links
| default([]) consistently applied in testbed_facts.yml
timeout=60 on subprocess.run()
vm_interfaces set once via Jinja namespace
Dynamic changed_when on iptables tasks
Deterministic MAC via MD5 hash
usage() updated with vnut commands
Try/except cleanup for dangling veth pairs
TG cleanup filters for DevIxiaChassis type
Shell quoting fixes in testbed-cli.sh

Remaining non-blocking nits (all acknowledged by maintainer):

Duplicate read_testbed.yml between create/remove roles — recommend dedup in follow-up
stderr discarded on success in run_cmd — minor observability gap
8-char MD5 prefix for veth naming — acceptable for expected scale
ip link show || ip link add error masking — mitigated by idempotency checks
Hardcoded vars_files paths — consistent with other testbed playbooks
Fragile monit summary grep pattern — functional but brittle
NUMA topology not defined for VMs — fine for 2 vCPU VS testing
Bridge name length risk for idx >= 100 — safe for current scale
Serial port collision risk for parallel testbeds — acknowledged for follow-up

Verdict: No new issues found. All prior feedback has been addressed or accepted as non-blocking. The PR is clean and ready to merge.

banidoru

All reviewers approved. LGTM.

r12f commented Mar 14, 2026

View reviewed changes

r12f force-pushed the feature/vnut-topo branch from b8c4af5 to 1b55ec9 Compare March 14, 2026 23:20

r12f mentioned this pull request Mar 14, 2026

[HLD] Virtual NUT Testbed (vNUT) — KVM-based virtual NUT testing #22977

Open

12 tasks

banidoru reviewed Mar 15, 2026

View reviewed changes

banidoru approved these changes Mar 15, 2026

View reviewed changes

banidoru suggested changes Mar 15, 2026

View reviewed changes

banidoru reviewed Mar 15, 2026

View reviewed changes

ansible/testbed-cli.sh Outdated Show resolved Hide resolved

ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml Outdated Show resolved Hide resolved

ansible/roles/testbed/nut-vtopo/tasks/create_mgmt_network.yml Outdated Show resolved Hide resolved

banidoru reviewed Mar 15, 2026

View reviewed changes

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

r12f force-pushed the feature/vnut-topo branch from f11f3d7 to 06d670a Compare March 26, 2026 17:25

r12f force-pushed the feature/vnut-topo branch from 06d670a to 899f893 Compare March 26, 2026 17:36

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

r12f force-pushed the feature/vnut-topo branch from 899f893 to e1fc22b Compare March 26, 2026 18:06

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

r12f force-pushed the feature/vnut-topo branch from e1fc22b to 6054e57 Compare March 26, 2026 18:36

github-code-quality bot found potential problems Mar 26, 2026

View reviewed changes

ansible/library/vnut_network.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Mar 26, 2026

View reviewed changes

ansible/library/vnut_network.py Fixed Show fixed Hide fixed

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru suggested changes Mar 26, 2026

View reviewed changes

banidoru reviewed Mar 26, 2026

View reviewed changes

r12f force-pushed the feature/vnut-topo branch from 6054e57 to 7d1629b Compare March 26, 2026 19:36

banidoru reviewed Mar 26, 2026

View reviewed changes

banidoru approved these changes Mar 26, 2026

View reviewed changes

Conversation

r12f commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Documentation

Uh oh!

mssonicbld commented Mar 14, 2026

Uh oh!

azure-pipelines bot commented Mar 14, 2026

Uh oh!

mssonicbld commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Mar 14, 2026

Uh oh!

azure-pipelines bot commented Mar 14, 2026

Uh oh!

mssonicbld commented Mar 14, 2026

Uh oh!

azure-pipelines bot commented Mar 14, 2026

Uh oh!

mssonicbld commented Mar 14, 2026

Uh oh!

azure-pipelines bot commented Mar 14, 2026

Uh oh!

mssonicbld commented Mar 15, 2026

Uh oh!

azure-pipelines bot commented Mar 15, 2026

Uh oh!

mssonicbld commented Mar 15, 2026

Uh oh!

azure-pipelines bot commented Mar 15, 2026

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Mar 15, 2026

Uh oh!

azure-pipelines bot commented Mar 15, 2026

Uh oh!

r12f commented Mar 14, 2026 •

edited

Loading