Skip to content

Add virtual NUT testbed (vNUT) support to testbed-cli using KVM VMs#22976

Open
r12f wants to merge 1 commit intomasterfrom
feature/vnut-topo
Open

Add virtual NUT testbed (vNUT) support to testbed-cli using KVM VMs#22976
r12f wants to merge 1 commit intomasterfrom
feature/vnut-topo

Conversation

@r12f
Copy link
Collaborator

@r12f r12f commented Mar 14, 2026

Description of PR

Summary:
Add add-vnut-topo / remove-vnut-topo commands to testbed-cli.sh that deploy a fully virtual NUT (Network Under Test) testbed using KVM-based virtual SONiC instances. This enables running sonic-mgmt NUT tests against a virtual topology without physical hardware.

The virtual testbed consists of KVM virtual SONiC DUTs and docker-ptf traffic generators connected via veth pairs, sharing the existing management bridge (br1) and management subnet with other virtual testbeds.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Enable developers to run NUT tests locally without physical switches or traffic generators. The vNUT testbed provides a fully virtualized alternative using KVM VMs and PTF containers.

How did you do it?

  • Added add-vnut-topo / remove-vnut-topo actions to testbed-cli.sh
  • Created ansible/roles/testbed/nut-vtopo/ Ansible role with tasks for:
    • Management network setup (shared br1 bridge)
    • KVM VM launch for DUTs using sonic-vs.img
    • PTF container launch for traffic generators
    • veth pair creation via custom vnut_network.py module
    • SONiC service readiness checks and admin user provisioning
  • Added example testbed YAML, inventory entries, and device/link CSV entries for nut-2tiers topology
  • Reuses existing NUT topology definitions and testbed framework

How did you verify/test it?

Validated end-to-end inside sonic-mgmt container on a KVM-capable host:

  • add-vnut-topo: ok=68, failed=0 ✅
  • deploy-cfg: ok=44, failed=0 ✅
  • test_pretest.py: 11 passed, 6 skipped, 0 failures ✅
  • BGP sessions established between T0↔T1 DUTs

Any platform specific information?

Requires KVM-capable host. DUTs use Force10-S6000 platform profile (virtual SONiC).

Documentation

HLD: #22977

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing .hooks/sonic_mgmt_pre_commit_hooks.egg-info/SOURCES.txt
Fixing .hooks/sonic_mgmt_pre_commit_hooks.egg-info/dependency_links.txt

check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Passed
flake8...............................................(no files to check)Skipped
flake8 (tests/common2)...............................(no files to check)Skipped
check conditional mark sort..............................................Passed
isort (python).......................................(no files to check)Skipped
black................................................(no files to check)Skipped
mypy.................................................(no files to check)Skipped
pylint...............................................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@r12f r12f force-pushed the feature/vnut-topo branch from b8c4af5 to 1b55ec9 Compare March 14, 2026 23:20
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed +772 lines across 22 files. The overall design is sound — two-phase container launch (base network container + cSONiC overlay) with veth-based linking is a clean approach. Key concerns:

  1. Security: Hardcoded credentials in vnut-lab/hosts inventory file
  2. Command injection risk in vnut_network.py — user-supplied device/port names flow into shell commands via string formatting
  3. Teardown bridge check bug — uses docker network inspect on a Linux bridge created via ip link
  4. Cleanup over-matchinglink_ prefix pattern could delete unrelated host interfaces
  5. Minor: import placement, shell quoting

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: Overall a solid addition. Several issues found — one likely bug in teardown (missing veth cleanup), a missing function reference in the shell script, and some correctness/robustness concerns in the readiness checks. Detailed inline comments below.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: solid foundation for virtual NUT testbed support. Several correctness and operational issues found — most notably missing iptables cleanup on teardown, fragile path resolution, and insufficient service readiness checks. Requesting changes on a few items.

Key findings:

  • Teardown leaks iptables NAT/FORWARD rules (never cleaned up)
  • vnut_lab_files_dir path resolution is fragile (relies on exact directory depth)
  • ConfigDB/service wait commands lack proper Ansible retry patterns
  • create_links.yml veth naming (vl{{ idx }}) uses a global index that could collide across testbeds
  • No iptables rule cleanup counterpart to create_mgmt_network.yml

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 2): The latest commits only changed the credential variable from sonicadmin_password to ansible_altpassword and consolidated the hosts file passwords. However, none of the 21 previously raised concerns have been addressed. Key outstanding issues:

Critical / Bugs:

  • docker network inspect used on a Linux bridge in teardown — will always fail, causing premature bridge deletion
  • read_nut_file function called but never defined in the diff — runtime failure guaranteed
  • Teardown never cleans up host-side veth pairs or iptables NAT/FORWARD rules — resource leak on repeated cycles

Security:

  • Hardcoded YourPaSsWoRd still committed in vnut-lab/hosts (now as ansible_altpassword)
  • run_cmd still uses string formatting + shlex.split — command injection risk

Robustness:

  • supervisorctl status | grep -q RUNNING doesn't verify all critical services are up
  • vnut_lab_files_dir uses fragile triple dirname chain
  • Veth names not scoped to testbed — parallel deployments will collide
  • No error handling if testbed name not found in YAML
  • auto_recover: 'True' is a string, not a YAML boolean

Please address the open threads before re-requesting review.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 2): The pre-commit files were removed and the defaults credential was updated to reference ansible_altpassword — those are improvements. However, the majority of prior feedback (20+ open threads) remains unaddressed in the new commits:

Critical bugs still open:

  • read_nut_file function is not defined in testbed-cli.sh — both new subcommands will fail at runtime
  • teardown uses docker network inspect on a Linux bridge (not a Docker network) — bridge deletion logic is broken
  • Hardcoded YourPaSsWoRd still committed in vnut-lab/hosts

Design/correctness issues still open:

  • import hashlib inside function body
  • run_cmd command injection risk (string formatting + shlex.split)
  • Cleanup pattern too broad (link_ prefix)
  • Teardown missing veth cleanup and iptables rule removal
  • Veth names not scoped to testbed (parallel deployment collision)
  • supervisorctl status | grep RUNNING matches any single service
  • Management interface restore uses timing-based workaround
  • CSV data duplicated in global and vnut-lab directories
  • auto_recover: 'True' is a string not a boolean
  • Base network container uses debian:bookworm unnecessarily
  • vnut_lab_files_dir relies on fragile dirname chain
  • No error handling for missing testbed name
  • ConfigDB wait uses nested bash timeout instead of Ansible retries
  • HwSku overlay tasks lack error handling
  • sudo installed via apt-get on every deploy
  • sysctl task always reports changed
  • Shell variables unquoted in testbed-cli.sh

Please address the open threads before re-requesting review.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 2): All 21 previously raised concerns remain unaddressed in the latest commits. Key issues still outstanding:

  • Security: Hardcoded plaintext credentials in ansible/vnut-lab/hosts
  • Correctness: Teardown uses docker network inspect on a Linux bridge created via ip link add; teardown never cleans up veth pairs or iptables rules; supervisorctl status | grep -q RUNNING matches any single running service
  • Robustness: Shell variables unquoted in testbed-cli.sh; management interface restore relies on pause: 3s instead of retry; ConfigDB wait uses fragile nested bash timeout; no validation for missing testbed_name in YAML lookup; read_nut_file function called but not defined
  • Code quality: import hashlib inside function body; changed_when: true on idempotent sysctl; auto_recover: 'True' is string not boolean; cleanup pattern overly broad; veth names not testbed-scoped
  • Design: CSV data duplicated across directories; base network container uses full debian:bookworm; vnut_lab_files_dir uses fragile dirname chain; HwSku overlay lacks error handling; apt-get install sudo runs unconditionally

Please address the feedback before this can move forward.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration 3 re-review. All 22 prior threads are resolved — good work addressing the feedback. A few new observations on the current diff:

  1. testbed-cli.sh case statement: $@ is unquoted at the call sites (add_vnut_topo $@, remove_vnut_topo $@), so arguments with spaces will word-split. Should be "$@".
  2. read_testbed.yml: vnut_ansible_dir still uses playbook_dir | dirname | dirname | dirname — the same fragile pattern that was fixed for vnut_lab_files_dir. Consider using role_path here too.
  3. role_path availability: defaults/main.yml references role_path but the playbooks use include_tasks from a standalone playbook (not invoked as a role via roles:). Verify role_path is defined at runtime; if not, this will fail.
  4. create_mgmt_network.yml: iptables tasks use -C (check) before -A (making them idempotent), but changed_when: true always reports changed. Use changed_when: false for accurate reporting.

None of these are blockers — mostly minor robustness improvements. The core design is solid and all prior concerns have been addressed.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 3) summary:

All 22 previously resolved threads confirmed addressed in 824c736 — good work on the comprehensive fixes.

3 open items remain:

  1. testbed-cli.sh case dispatchers: add_vnut_topo $@ and remove_vnut_topo $@ still unquoted (the functions themselves are fixed, but the call sites in the case statement are not).

  2. read_testbed.yml line 8: Still uses playbook_dir | dirname | dirname | dirname — same fragile pattern that was fixed in defaults/main.yml with role_path. Should be consistent.

  3. create_mgmt_network.yml iptables task: changed_when: true on an idempotent -C || -A pattern — should be changed_when: false for accurate reporting.

These are minor. Once addressed, LGTM.

r12f added a commit that referenced this pull request Mar 15, 2026
Address banidoru review comments on PR #22976:
- Use role_path for robust path resolution instead of fragile dirname chains
- Refactor run_cmd to use list args instead of string + shlex.split
- Move hashlib to top-level imports
- Scope veth cleanup to testbed-specific prefix (not overly broad link_ pattern)
- Add testbed name prefix to veth names to prevent parallel deployment collisions
- Replace shell sysctl with ansible.posix.sysctl module
- Use busybox:latest instead of debian:bookworm for base network container
- Add assert validation for missing testbed_name in read_testbed.yml
- Add error handling (failed_when/when) to HwSku overlay tasks
- Fix teardown: use ip link show instead of docker network inspect for Linux bridge
- Add veth cleanup and iptables rule cleanup to teardown
- Replace timing-based pause with retry loop for mgmt interface verification
- Check all critical services (not just any single RUNNING) in service wait
- Add which sudo check before unconditional apt-get install
- Simplify ConfigDB wait with Ansible retries instead of nested bash timeout
- Quote all shell variable expansions in testbed-cli.sh
- Add add_vnut_topo/remove_vnut_topo functions to testbed-cli.sh
- Add inline YAML comment explaining intentional string quoting of auto_recover
- Use sonicadmin_password variable instead of hardcoded password in vnut-lab/hosts
- Keep vnut device entries only in vnut-lab/files/ (not global CSV)

Signed-off-by: r12f <r12f.code@gmail.com>
Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 10): All previously raised concerns have been addressed or acknowledged with follow-up plans. Key fixes verified in current diff: timeout added to subprocess.run, deterministic MAC for mgmt interface, no_log for passwords, proper changed_when for iptables, TG veth cleanup scoped correctly, set -e removed, unique filter on TG links, .get() used consistently in testbed_facts.yml. Remaining open items (duplicated read_testbed.yml, stderr logging, monit grep fragility) are acknowledged as follow-up work. No new issues found. LGTM — approve.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous concerns have been addressed. LGTM!

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 10) — All previously raised issues have been addressed or acknowledged for follow-up. The code is significantly cleaner than earlier iterations.

Prior threads status:

  • 17 threads confirmed fixed in this revision (files removed from diff, code corrected, or improvements applied)
  • 7 threads acknowledged by author as follow-up items (duplicate read_testbed.yml, stderr handling, MD5 prefix, error swallowing, vars_files paths, monit pattern, NUMA topology, testbed.vnut.yaml fields)

New finding: testbed-cli.sh usage() function not updated to document the new add-vnut-topo and remove-vnut-topo subcommands (minor).

Verdict: No blocking issues remain. The acknowledged follow-up items are reasonable deferrals. Code is well-structured with proper idempotency, error handling, and separation of concerns.

;;
config-vs-chassis) config_vs_chassis $@
;;
add-vnut-topo) add_vnut_topo "$@"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the usage() function at the top of this file does not document add-vnut-topo or remove-vnut-topo. Users running testbed-cli.sh without arguments won't discover these commands. Consider adding a usage line like:

echo "    $0 [options] (add-vnut-topo | remove-vnut-topo) <testbed-name> <inventory> <vault-password-file>"

Non-blocking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added add-vnut-topo and remove-vnut-topo to the usage() function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed: the usage() function now includes add-vnut-topo and remove-vnut-topo in the help output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed: usage() updated with add-vnut-topo and remove-vnut-topo.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@r12f r12f force-pushed the feature/vnut-topo branch from f11f3d7 to 06d670a Compare March 26, 2026 17:25
@mssonicbld
Copy link
Collaborator

/azp run

@r12f r12f force-pushed the feature/vnut-topo branch from 06d670a to 899f893 Compare March 26, 2026 17:36
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 11): Significant improvement since last review. Several previously raised concerns have been addressed:

Fixed:

  • Removed dead code (action_create_link) and stale docstring from vnut_network.py
  • Added timeout=60 to subprocess.run()
  • Link filter now checks both StartDevice and EndDevice
  • Removed set -e from connect_tg_links.yml shell block
  • Proper changed_when tracking on iptables forwarding rules
  • Removed unrelated file changes (features.json.j2, veos_vtb, test files)
  • Cleaned up vm_interfaces duplicate assignment

Still open (minor, non-blocking):

  • Duplicate read_testbed.yml between create/remove roles
  • sshpass password in shell command (mitigated by no_log: true)
  • Minor inconsistency: vnut_tg_links computed but shell iterates vnut_links directly
  • testbed_facts.yml .get() fix is partial

Overall the PR is in good shape. The remaining items are minor and can be addressed in follow-up PRs.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous comments have been addressed. LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 11) after new commits up to 899f893.

Many prior issues have been fixed — files removed from diff, dead code removed, docstrings updated, no_log added, changed_when improved, $@ quoting fixed, link filter corrected, etc. Good progress.

Remaining issues (carried from prior reviews, still present in latest commit):

  1. read_testbed.yml is duplicated between nut-vtopo-create and nut-vtopo-remove. These 72-line files are identical — extract to a shared role or include_tasks from a common location.
  2. connect_tg_links.yml loop.index0 collision risk: The veth naming uses loop.index0 from the Jinja for link in vnut_links loop, which counts all links, not just TG links. If a TG link is at index 2, its veth is vtg..._2a. This works only because the bridge naming also uses the same global index. However, the TG veth cleanup in nut-vtopo-remove/tasks/main.yml also iterates all links — this is wasteful but harmless because 2>/dev/null || true suppresses errors for non-existent veths.
  3. testbed_facts.yml inconsistent .get(): duts and tgs now use .get() with defaults, but if testbed_facts itself is a Jinja dict (not Python), .get() may not be available. In Ansible, testbed_facts['duts'] | default([]) is the idiomatic pattern.
  4. connect_tg_links.yml error swallowing: ip link show ... 2>/dev/null || ip link add ... still silently swallows all errors from ip link add failures (OOM, permissions, etc.).
  5. monit summary grep pattern fragility: The negative grep -v 'OK|Running|Monit|^$|Status|---' will break if monit output format changes. Acceptable for now but fragile.
  6. vars_files hardcoded paths: Both playbooks hardcode group_vars/all/creds.yml and group_vars/lab/secrets.yml. This is consistent with other playbooks in the repo, so acceptable.

New observations on this iteration:

  • The testbed_facts.yml change uses Python dict .get() syntax in Jinja2 context — this may not work in all Ansible versions/configurations.
  • The meta_validator.yml exclusion uses vnut-.* regex — verify the validator supports regex patterns (not just exact matches).

Overall the PR is in reasonable shape for a virtual NUT testbed implementation. The remaining items are mostly minor nits and defensive-coding suggestions.


- fail: msg="The DUT you are trying to run test does not belongs to this testbed"
when: (inventory_hostname not in testbed_facts['duts'] and inventory_hostname not in testbed_facts['l1s'])
when: (inventory_hostname not in testbed_facts.get('duts', []) and inventory_hostname not in testbed_facts.get('l1s', []))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jinja2 compatibility concern: .get('duts', []) is Python dict syntax. While it works when testbed_facts is a Python dict in most Ansible versions, the idiomatic and more portable Ansible/Jinja2 pattern is:

testbed_facts['duts'] | default([])

or

testbed_facts.duts | default([])

This is safer across different Ansible versions and when testbed_facts might be a Jinja2 Undefined type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed: .get() calls have been replaced with Jinja2-native | default([]) filters throughout testbed_facts.yml.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed: .get() replaced with Jinja2-native | default([]).

@@ -0,0 +1,72 @@
---

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is identical to ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml (all 72 lines). DRY violation — consider extracting to a shared location (e.g. nut-vtopo-common/tasks/read_testbed.yml) and include_tasks from both roles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open: the two read_testbed.yml files remain identical duplicates. Consider a shared role or symlink.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open: identical read_testbed.yml duplicates. Consider shared role in follow-up.

echo "Port $PORT already exists in $CONTAINER"
else
ip link show "$VETH_A" 2>/dev/null || ip link add "$VETH_A" type veth peer name "$VETH_B"
ip link set "$VETH_A" master "$BRIDGE"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ip link show ... 2>/dev/null || ip link add ... pattern makes the veth creation idempotent, but ip link add failures (e.g. out of memory, permission denied, exceeding max interfaces) are silently swallowed. Consider:

if ! ip link show "$VETH_A" 2>/dev/null; then
  ip link add "$VETH_A" type veth peer name "$VETH_B"
fi

This way, legitimate ip link add failures will surface as task errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open: same concern as thread 88 — ip link add failure errors are swallowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — idempotency checks prevent silent failures for the common case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged by maintainer — idempotency pattern acceptable.

{% endfor %}
echo "Cleaned up TG veth pairs"
args:
executable: /bin/bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TG veth cleanup iterates range(vnut_links | length) which includes DUT-to-DUT links (e.g. vnut-t0-01:Ethernet4 <-> vnut-t1-01:Ethernet0). For those links, vtg..._Na doesn't exist, so the 2>/dev/null || true hides it. Not a bug, but wasteful — consider filtering to only TG links (matching the connect_tg_links.yml logic).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open (nit): same concern as thread 95 — TG veth cleanup iterates all links. Functionally correct due to the Jinja filter, but generates no-op commands.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — functionally correct, will optimize in a follow-up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged by maintainer — will optimize TG cleanup in follow-up.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@r12f r12f force-pushed the feature/vnut-topo branch from 899f893 to e1fc22b Compare March 26, 2026 18:06
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 12). Most prior concerns have been addressed. The code is significantly cleaner after the refactoring into separate create/remove roles and the move to KVM-based DUTs.

Resolved in this iteration:

  • Dead code removed from vnut_network.py
  • Docstring updated to match actual actions
  • no_log: true added to sshpass task
  • vnut_tg_links used in when: condition
  • Link filter checks both StartDevice and EndDevice
  • features.json.j2 reverted
  • Usage function updated
  • .get() replaced with | default([]) in testbed_facts.yml
  • timeout=60 added to subprocess.run()
  • vm_interfaces deduplication fixed
  • iptables changed_when made dynamic
  • set -e removed from connect_tg_links shell block
  • TG veth cleanup now filters to TG links
  • Deterministic MAC for management interface

Still open (minor, non-blocking):

  • Duplicate read_testbed.yml between create/remove roles (acknowledged for follow-up)
  • stderr discarded on success in run_cmd (low priority)
  • ip link show ... || ip link add error swallowing pattern
  • Monit grep pattern fragility
  • vars_files hard paths (consistent with repo conventions)

New observations on current diff:

  1. kickstart.py and sonic_kickstart.py are symlinks — ensure the target paths resolve correctly when the role is used from the new location.
  2. Bridge name vbr_<testbed[:8]>_<idx> can exceed 15 chars if idx >= 100 (e.g. vbr_abcdefgh_100 = 16 chars). With realistic link counts this is unlikely but worth a length assertion.

Overall the PR is in good shape. The remaining items are minor nits suitable for follow-up. LGTM with the caveats noted above.


- name: Create front-panel link bridges
shell: |
BRIDGE="vbr_{{ testbed_name[:8] }}_{{ idx }}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed the Linux 15-char interface name limit if idx >= 100 (e.g. vbr_abcdefgh_100 = 16 chars). Consider shortening the testbed prefix to 6 chars or adding a validation task. Unlikely in practice with current topologies but worth noting.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open (nit): vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed 15 chars when idx >= 100 (e.g. vbr_vnut-2ti_100 = 16 chars). For the sample testbed with only 4 links this is fine, but should be validated for larger topologies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — bridge name length is safe for current scale. Will add validation in a follow-up if needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged by maintainer — bridge name length safe for current scale.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration 12 re-review:

Significant improvement since last iteration — many prior concerns addressed (link filter bug, dead code, stale docstring, MAC address, usage(), set -e, timeout, etc.).

Remaining open concerns are mostly minor/nit-level: duplicate read_testbed.yml, stderr discarding, error swallowing in ip link add, fragile monit grep. These are acceptable for initial merge and can be addressed in follow-up.

One new observation: action_connect_mgmt creates a veth pair (line 123) then runs a sequence of commands to move it into the container namespace. If any step after veth creation fails (e.g., nsenter), the veth pair is left dangling on the host. A try/finally cleanup would make this more robust. Low risk for initial deployment but worth noting.

Overall: the code is in good shape for a v1 virtual testbed implementation.

run_cmd(["ip", "link", "set", veth_a, "netns", pid])
run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", veth_a, "name", "eth0"])
run_cmd(["nsenter", "-t", pid, "-n", "ip", "addr", "add", mgmt_ip, "dev", "eth0"])
run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", "eth0", "up"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: if any run_cmd call after this line fails (e.g., nsenter to move veth into container namespace), the veth pair is left dangling on the host. Consider wrapping lines 123-134 in a try/except that cleans up the veth pair on failure:

try:
    run_cmd(["ip", "link", "add", veth_a, ...])
    run_cmd(["ip", "link", "set", veth_a, "netns", pid])
    # ... remaining setup
except RuntimeError:
    # Clean up partially-created veth pair
    if link_exists_on_host(veth_a):
        run_cmd(["ip", "link", "delete", veth_a], check=False)
    raise

Low risk for initial deployment since Ansible retries would re-create, but good defensive practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added try/except cleanup for veth pair on nsenter failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed: try/except cleanup for veth pair on nsenter failure added.

serial_port: "{{ vnut_vm_serial_port_base + device_idx }}"

- name: "Kickstart SONiC VM {{ device.Hostname }}"
sonic_kickstart:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the sonic_kickstart task passes passwords and new_password but doesn't set no_log: true. The task output will contain password values in Ansible logs. Consider adding no_log: true here (similar to the wait_testbed_ready.yml fix).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added no_log: true to the sonic_kickstart task.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed: no_log: true added to sonic_kickstart task.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed fixed — no_log: true is present on the sonic_kickstart task.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@r12f r12f force-pushed the feature/vnut-topo branch from e1fc22b to 6054e57 Compare March 26, 2026 18:36
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 13): Most prior concerns have been addressed in the latest commits — unrelated files removed from diff, docstrings updated, | default([]) guards added, set -e removed, | unique dedup on link filters, no_log: true on password-bearing tasks, and cleanup uses changed_when: false.

Remaining minor items (non-blocking):

  • read_testbed.yml is duplicated between create/remove roles — could be shared via symlink or common role
  • Bridge name (vbr_<8>_<idx>) can exceed 15 chars with 100+ links — acknowledged as safe for current scale
  • vnut_tg_links fact in connect_tg_links.yml is only used in the when: guard while the shell template iterates vnut_links directly

No new issues found in this iteration. The PR is in good shape.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the previously raised concerns remain unaddressed. Please review the unresolved comments.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 13): Significant improvements since last review. Most prior concerns addressed.

Resolved items (20 threads):

  • Unrelated test file changes removed
  • veos_vtb and features.json.j2 reverted
  • Dead code (create_link, cleanup actions) removed from vnut_network.py
  • Docstring updated to match remaining actions
  • Link filter bug fixed (checks both StartDevice/EndDevice with | unique)
  • set -e removed from shell block
  • no_log: true added to sensitive tasks
  • timeout=60 added to subprocess.run()
  • Dead vm_interfaces assignment removed
  • Dynamic changed_when added to forwarding rules
  • Deterministic MAC for management interface
  • usage() updated with new commands
  • Jinja2-native | default([]) used consistently
  • try/except cleanup for veth pairs on failure

Remaining minor items (non-blocking):

  • Duplicate read_testbed.yml (DRY) — follow-up candidate
  • NAT masquerade task still lacks changed_when
  • Empty except: pass needs an explanatory comment (code scanning alert)
  • Serial port collision risk for concurrent testbed deployments
  • Various nits acknowledged by maintainer for follow-up

Verdict: Good progress. The 3 new inline comments are minor. No blocking issues remain.

for iface in (veth_a, veth_b):
try:
run_cmd(["ip", "link", "delete", iface])
except Exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner except Exception: pass here will silently swallow cleanup failures. While this is in a cleanup path and you don't want to lose the original exception, adding a brief comment explaining the intent (e.g., # Best-effort cleanup; original exception re-raised below) would address the code-scanning alert and help future maintainers understand why the pass is intentional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added comment explaining intentional best-effort cleanup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed fixed — the comment clarifies intentional best-effort cleanup behavior.


- name: Check if NAT masquerade rule exists
shell: >
nsenter -t 1 -m -u -n -i -p -- iptables -t nat -C POSTROUTING -s {{ mgmt_subnet_cidr }} ! -d {{ mgmt_subnet_cidr }} -j MASQUERADE 2>/dev/null

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NAT masquerade task has no changed_when — it will always report changed when the rule doesn't exist. The forwarding task below correctly uses changed_when: "'changed=true' in fwd_result.stdout". Consider applying the same check+add pattern with a changed flag here for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added changed_when to the NAT masquerade task.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed fixed — changed_when: nat_check.rc != 0 correctly tracks whether the NAT rule was actually added.

# launch_one_dut.yml — Create and start a SONiC KVM VM for a DUT
# Called with: device (dict with Hostname, ManagementIp, HwSku, etc.), device_idx

- name: "Set VM facts for {{ device.Hostname }}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: serial_port is computed as vnut_vm_serial_port_base + device_idx, where device_idx comes from the loop index over DUT devices. If two testbeds share the same host, their serial ports could collide (e.g., both testbed A's first DUT and testbed B's first DUT would use port 9100). Consider incorporating testbed_name into the port offset, or document that concurrent deployments on the same host require disjoint vnut_vm_serial_port_base values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — serial port collision is a valid concern for parallel testbeds. Will add testbed-scoped port allocation in a follow-up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged by maintainer for follow-up. Non-blocking for this PR.

Add support for deploying virtual NUT testbeds using KVM VMs via
testbed-cli.sh. This includes:
- New nut-vtopo-create and nut-vtopo-remove Ansible roles
- vnut_network.py library for network namespace management
- KVM VM lifecycle management (libvirt XML templates)
- Management network creation with bridge and NAT
- TG link connectivity via veth pairs
- testbed-cli.sh add-vnut-topo / remove-vnut-topo commands
- Meta validator exclusions for virtual devices
- Sample vnut testbed YAML configuration

Signed-off-by: r12f <r12f.code@gmail.com>
@r12f r12f force-pushed the feature/vnut-topo branch from 6054e57 to 7d1629b Compare March 26, 2026 19:36
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 14): Significant improvement — most prior concerns have been addressed. Resolved 21 of 24 threads.

3 remaining items (non-blocking but worth addressing):

  1. Duplicated read_testbed.ymlnut-vtopo-remove/tasks/read_testbed.yml is still an exact copy of the create version. Extract to a shared include to avoid drift.
  2. Hardcoded vars_files pathstestbed_add_nut_topo.yml and testbed_remove_nut_topo.yml hardcode group_vars/all/creds.yml and group_vars/lab/secrets.yml. Consider parameterizing or documenting.
  3. Fragile monit grep — The negative-match pattern in wait_testbed_ready.yml will miss new monit status keywords. A positive-match approach would be more robust.

No new issues found in this iteration. The code is well-structured with good idempotency, proper error handling (timeouts on subprocess, cleanup on failure), and consistent use of default([]) for optional fields.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (iteration 14): All previously flagged issues have been addressed or acknowledged as non-blocking.

Fixes confirmed in this iteration:

  • no_log: true added to sonic_kickstart task — prevents credential leakage in Ansible output
  • changed_when: nat_check.rc != 0 added to NAT masquerade task — proper idempotency reporting
  • Comment added to except Exception: pass in vnut_network.py cleanup — clarifies intentional best-effort behavior
  • Serial port collision risk acknowledged by maintainer for follow-up

Previously addressed (verified still in place):

  • Unrelated test files removed from PR scope
  • veos_vtb changes reverted
  • features.json.j2 changes reverted
  • Dead action_create_link removed; docstring updated
  • vnut_tg_links used in when: condition
  • no_log: true on sshpass task
  • set -e removed from shell template
  • | unique filter on TG links
  • | default([]) consistently applied in testbed_facts.yml
  • timeout=60 on subprocess.run()
  • vm_interfaces set once via Jinja namespace
  • Dynamic changed_when on iptables tasks
  • Deterministic MAC via MD5 hash
  • usage() updated with vnut commands
  • Try/except cleanup for dangling veth pairs
  • TG cleanup filters for DevIxiaChassis type
  • Shell quoting fixes in testbed-cli.sh

Remaining non-blocking nits (all acknowledged by maintainer):

  • Duplicate read_testbed.yml between create/remove roles — recommend dedup in follow-up
  • stderr discarded on success in run_cmd — minor observability gap
  • 8-char MD5 prefix for veth naming — acceptable for expected scale
  • ip link show || ip link add error masking — mitigated by idempotency checks
  • Hardcoded vars_files paths — consistent with other testbed playbooks
  • Fragile monit summary grep pattern — functional but brittle
  • NUMA topology not defined for VMs — fine for 2 vCPU VS testing
  • Bridge name length risk for idx >= 100 — safe for current scale
  • Serial port collision risk for parallel testbeds — acknowledged for follow-up

Verdict: No new issues found. All prior feedback has been addressed or accepted as non-blocking. The PR is clean and ready to merge.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants