Add virtual NUT testbed (vNUT) support to testbed-cli using KVM VMs#22976
Add virtual NUT testbed (vNUT) support to testbed-cli using KVM VMs#22976
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
banidoru
left a comment
There was a problem hiding this comment.
Reviewed +772 lines across 22 files. The overall design is sound — two-phase container launch (base network container + cSONiC overlay) with veth-based linking is a clean approach. Key concerns:
- Security: Hardcoded credentials in
vnut-lab/hostsinventory file - Command injection risk in
vnut_network.py— user-supplied device/port names flow into shell commands via string formatting - Teardown bridge check bug — uses
docker network inspecton a Linux bridge created viaip link - Cleanup over-matching —
link_prefix pattern could delete unrelated host interfaces - Minor: import placement, shell quoting
banidoru
left a comment
There was a problem hiding this comment.
Review summary: Overall a solid addition. Several issues found — one likely bug in teardown (missing veth cleanup), a missing function reference in the shell script, and some correctness/robustness concerns in the readiness checks. Detailed inline comments below.
banidoru
left a comment
There was a problem hiding this comment.
Overall: solid foundation for virtual NUT testbed support. Several correctness and operational issues found — most notably missing iptables cleanup on teardown, fragile path resolution, and insufficient service readiness checks. Requesting changes on a few items.
Key findings:
- Teardown leaks iptables NAT/FORWARD rules (never cleaned up)
vnut_lab_files_dirpath resolution is fragile (relies on exact directory depth)- ConfigDB/service wait commands lack proper Ansible retry patterns
create_links.ymlveth naming (vl{{ idx }}) uses a global index that could collide across testbeds- No iptables rule cleanup counterpart to
create_mgmt_network.yml
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 2): The latest commits only changed the credential variable from sonicadmin_password to ansible_altpassword and consolidated the hosts file passwords. However, none of the 21 previously raised concerns have been addressed. Key outstanding issues:
Critical / Bugs:
docker network inspectused on a Linux bridge in teardown — will always fail, causing premature bridge deletionread_nut_filefunction called but never defined in the diff — runtime failure guaranteed- Teardown never cleans up host-side veth pairs or iptables NAT/FORWARD rules — resource leak on repeated cycles
Security:
- Hardcoded
YourPaSsWoRdstill committed invnut-lab/hosts(now asansible_altpassword) run_cmdstill uses string formatting +shlex.split— command injection risk
Robustness:
supervisorctl status | grep -q RUNNINGdoesn't verify all critical services are upvnut_lab_files_diruses fragile tripledirnamechain- Veth names not scoped to testbed — parallel deployments will collide
- No error handling if testbed name not found in YAML
auto_recover: 'True'is a string, not a YAML boolean
Please address the open threads before re-requesting review.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 2): The pre-commit files were removed and the defaults credential was updated to reference ansible_altpassword — those are improvements. However, the majority of prior feedback (20+ open threads) remains unaddressed in the new commits:
Critical bugs still open:
read_nut_filefunction is not defined in testbed-cli.sh — both new subcommands will fail at runtime- teardown uses
docker network inspecton a Linux bridge (not a Docker network) — bridge deletion logic is broken - Hardcoded
YourPaSsWoRdstill committed invnut-lab/hosts
Design/correctness issues still open:
import hashlibinside function bodyrun_cmdcommand injection risk (string formatting + shlex.split)- Cleanup pattern too broad (
link_prefix) - Teardown missing veth cleanup and iptables rule removal
- Veth names not scoped to testbed (parallel deployment collision)
supervisorctl status | grep RUNNINGmatches any single service- Management interface restore uses timing-based workaround
- CSV data duplicated in global and vnut-lab directories
auto_recover: 'True'is a string not a boolean- Base network container uses debian:bookworm unnecessarily
vnut_lab_files_dirrelies on fragile dirname chain- No error handling for missing testbed name
- ConfigDB wait uses nested bash timeout instead of Ansible retries
- HwSku overlay tasks lack error handling
- sudo installed via apt-get on every deploy
- sysctl task always reports changed
- Shell variables unquoted in testbed-cli.sh
Please address the open threads before re-requesting review.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 2): All 21 previously raised concerns remain unaddressed in the latest commits. Key issues still outstanding:
- Security: Hardcoded plaintext credentials in
ansible/vnut-lab/hosts - Correctness: Teardown uses
docker network inspecton a Linux bridge created viaip link add; teardown never cleans up veth pairs or iptables rules;supervisorctl status | grep -q RUNNINGmatches any single running service - Robustness: Shell variables unquoted in
testbed-cli.sh; management interface restore relies onpause: 3sinstead of retry; ConfigDB wait uses fragile nested bash timeout; no validation for missingtestbed_namein YAML lookup;read_nut_filefunction called but not defined - Code quality:
import hashlibinside function body;changed_when: trueon idempotent sysctl;auto_recover: 'True'is string not boolean; cleanup pattern overly broad; veth names not testbed-scoped - Design: CSV data duplicated across directories; base network container uses full
debian:bookworm;vnut_lab_files_diruses fragile dirname chain; HwSku overlay lacks error handling;apt-get install sudoruns unconditionally
Please address the feedback before this can move forward.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
banidoru
left a comment
There was a problem hiding this comment.
Iteration 3 re-review. All 22 prior threads are resolved — good work addressing the feedback. A few new observations on the current diff:
testbed-cli.shcase statement:$@is unquoted at the call sites (add_vnut_topo $@,remove_vnut_topo $@), so arguments with spaces will word-split. Should be"$@".read_testbed.yml:vnut_ansible_dirstill usesplaybook_dir | dirname | dirname | dirname— the same fragile pattern that was fixed forvnut_lab_files_dir. Consider usingrole_pathhere too.role_pathavailability:defaults/main.ymlreferencesrole_pathbut the playbooks useinclude_tasksfrom a standalone playbook (not invoked as a role viaroles:). Verifyrole_pathis defined at runtime; if not, this will fail.create_mgmt_network.yml: iptables tasks use-C(check) before-A(making them idempotent), butchanged_when: truealways reports changed. Usechanged_when: falsefor accurate reporting.
None of these are blockers — mostly minor robustness improvements. The core design is solid and all prior concerns have been addressed.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 3) summary:
All 22 previously resolved threads confirmed addressed in 824c736 — good work on the comprehensive fixes.
3 open items remain:
-
testbed-cli.sh case dispatchers:
add_vnut_topo $@andremove_vnut_topo $@still unquoted (the functions themselves are fixed, but the call sites in the case statement are not). -
read_testbed.yml line 8: Still uses
playbook_dir | dirname | dirname | dirname— same fragile pattern that was fixed indefaults/main.ymlwithrole_path. Should be consistent. -
create_mgmt_network.yml iptables task:
changed_when: trueon an idempotent-C || -Apattern — should bechanged_when: falsefor accurate reporting.
These are minor. Once addressed, LGTM.
Address banidoru review comments on PR #22976: - Use role_path for robust path resolution instead of fragile dirname chains - Refactor run_cmd to use list args instead of string + shlex.split - Move hashlib to top-level imports - Scope veth cleanup to testbed-specific prefix (not overly broad link_ pattern) - Add testbed name prefix to veth names to prevent parallel deployment collisions - Replace shell sysctl with ansible.posix.sysctl module - Use busybox:latest instead of debian:bookworm for base network container - Add assert validation for missing testbed_name in read_testbed.yml - Add error handling (failed_when/when) to HwSku overlay tasks - Fix teardown: use ip link show instead of docker network inspect for Linux bridge - Add veth cleanup and iptables rule cleanup to teardown - Replace timing-based pause with retry loop for mgmt interface verification - Check all critical services (not just any single RUNNING) in service wait - Add which sudo check before unconditional apt-get install - Simplify ConfigDB wait with Ansible retries instead of nested bash timeout - Quote all shell variable expansions in testbed-cli.sh - Add add_vnut_topo/remove_vnut_topo functions to testbed-cli.sh - Add inline YAML comment explaining intentional string quoting of auto_recover - Use sonicadmin_password variable instead of hardcoded password in vnut-lab/hosts - Keep vnut device entries only in vnut-lab/files/ (not global CSV) Signed-off-by: r12f <r12f.code@gmail.com>
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 10): All previously raised concerns have been addressed or acknowledged with follow-up plans. Key fixes verified in current diff: timeout added to subprocess.run, deterministic MAC for mgmt interface, no_log for passwords, proper changed_when for iptables, TG veth cleanup scoped correctly, set -e removed, unique filter on TG links, .get() used consistently in testbed_facts.yml. Remaining open items (duplicated read_testbed.yml, stderr logging, monit grep fragility) are acknowledged as follow-up work. No new issues found. LGTM — approve.
banidoru
left a comment
There was a problem hiding this comment.
All previous concerns have been addressed. LGTM!
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 10) — All previously raised issues have been addressed or acknowledged for follow-up. The code is significantly cleaner than earlier iterations.
Prior threads status:
- 17 threads confirmed fixed in this revision (files removed from diff, code corrected, or improvements applied)
- 7 threads acknowledged by author as follow-up items (duplicate read_testbed.yml, stderr handling, MD5 prefix, error swallowing, vars_files paths, monit pattern, NUMA topology, testbed.vnut.yaml fields)
New finding: testbed-cli.sh usage() function not updated to document the new add-vnut-topo and remove-vnut-topo subcommands (minor).
Verdict: No blocking issues remain. The acknowledged follow-up items are reasonable deferrals. Code is well-structured with proper idempotency, error handling, and separation of concerns.
| ;; | ||
| config-vs-chassis) config_vs_chassis $@ | ||
| ;; | ||
| add-vnut-topo) add_vnut_topo "$@" |
There was a problem hiding this comment.
Minor: the usage() function at the top of this file does not document add-vnut-topo or remove-vnut-topo. Users running testbed-cli.sh without arguments won't discover these commands. Consider adding a usage line like:
echo " $0 [options] (add-vnut-topo | remove-vnut-topo) <testbed-name> <inventory> <vault-password-file>"
Non-blocking.
There was a problem hiding this comment.
Fixed — added add-vnut-topo and remove-vnut-topo to the usage() function.
There was a problem hiding this comment.
Addressed: the usage() function now includes add-vnut-topo and remove-vnut-topo in the help output.
There was a problem hiding this comment.
Confirmed: usage() updated with add-vnut-topo and remove-vnut-topo.
|
/azp run |
|
/azp run |
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 11): Significant improvement since last review. Several previously raised concerns have been addressed:
Fixed:
- Removed dead code (
action_create_link) and stale docstring fromvnut_network.py - Added
timeout=60tosubprocess.run() - Link filter now checks both
StartDeviceandEndDevice - Removed
set -efromconnect_tg_links.ymlshell block - Proper
changed_whentracking on iptables forwarding rules - Removed unrelated file changes (
features.json.j2,veos_vtb, test files) - Cleaned up
vm_interfacesduplicate assignment
Still open (minor, non-blocking):
- Duplicate
read_testbed.ymlbetween create/remove roles sshpasspassword in shell command (mitigated byno_log: true)- Minor inconsistency:
vnut_tg_linkscomputed but shell iteratesvnut_linksdirectly testbed_facts.yml.get()fix is partial
Overall the PR is in good shape. The remaining items are minor and can be addressed in follow-up PRs.
banidoru
left a comment
There was a problem hiding this comment.
All previous comments have been addressed. LGTM.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 11) after new commits up to 899f893.
Many prior issues have been fixed — files removed from diff, dead code removed, docstrings updated, no_log added, changed_when improved, $@ quoting fixed, link filter corrected, etc. Good progress.
Remaining issues (carried from prior reviews, still present in latest commit):
read_testbed.ymlis duplicated betweennut-vtopo-createandnut-vtopo-remove. These 72-line files are identical — extract to a shared role orinclude_tasksfrom a common location.connect_tg_links.ymlloop.index0collision risk: The veth naming usesloop.index0from the Jinjafor link in vnut_linksloop, which counts all links, not just TG links. If a TG link is at index 2, its veth isvtg..._2a. This works only because the bridge naming also uses the same global index. However, the TG veth cleanup innut-vtopo-remove/tasks/main.ymlalso iterates all links — this is wasteful but harmless because2>/dev/null || truesuppresses errors for non-existent veths.testbed_facts.ymlinconsistent.get():dutsandtgsnow use.get()with defaults, but iftestbed_factsitself is a Jinja dict (not Python),.get()may not be available. In Ansible,testbed_facts['duts'] | default([])is the idiomatic pattern.connect_tg_links.ymlerror swallowing:ip link show ... 2>/dev/null || ip link add ...still silently swallows all errors fromip link addfailures (OOM, permissions, etc.).monit summarygrep pattern fragility: The negative grep-v 'OK|Running|Monit|^$|Status|---'will break if monit output format changes. Acceptable for now but fragile.vars_fileshardcoded paths: Both playbooks hardcodegroup_vars/all/creds.ymlandgroup_vars/lab/secrets.yml. This is consistent with other playbooks in the repo, so acceptable.
New observations on this iteration:
- The
testbed_facts.ymlchange uses Python dict.get()syntax in Jinja2 context — this may not work in all Ansible versions/configurations. - The
meta_validator.ymlexclusion usesvnut-.*regex — verify the validator supports regex patterns (not just exact matches).
Overall the PR is in reasonable shape for a virtual NUT testbed implementation. The remaining items are mostly minor nits and defensive-coding suggestions.
|
|
||
| - fail: msg="The DUT you are trying to run test does not belongs to this testbed" | ||
| when: (inventory_hostname not in testbed_facts['duts'] and inventory_hostname not in testbed_facts['l1s']) | ||
| when: (inventory_hostname not in testbed_facts.get('duts', []) and inventory_hostname not in testbed_facts.get('l1s', [])) |
There was a problem hiding this comment.
Jinja2 compatibility concern: .get('duts', []) is Python dict syntax. While it works when testbed_facts is a Python dict in most Ansible versions, the idiomatic and more portable Ansible/Jinja2 pattern is:
testbed_facts['duts'] | default([])or
testbed_facts.duts | default([])This is safer across different Ansible versions and when testbed_facts might be a Jinja2 Undefined type.
There was a problem hiding this comment.
Addressed: .get() calls have been replaced with Jinja2-native | default([]) filters throughout testbed_facts.yml.
There was a problem hiding this comment.
Confirmed: .get() replaced with Jinja2-native | default([]).
| @@ -0,0 +1,72 @@ | |||
| --- | |||
There was a problem hiding this comment.
This file is identical to ansible/roles/testbed/nut-vtopo-create/tasks/read_testbed.yml (all 72 lines). DRY violation — consider extracting to a shared location (e.g. nut-vtopo-common/tasks/read_testbed.yml) and include_tasks from both roles.
There was a problem hiding this comment.
Still open: the two read_testbed.yml files remain identical duplicates. Consider a shared role or symlink.
There was a problem hiding this comment.
Still open: identical read_testbed.yml duplicates. Consider shared role in follow-up.
| echo "Port $PORT already exists in $CONTAINER" | ||
| else | ||
| ip link show "$VETH_A" 2>/dev/null || ip link add "$VETH_A" type veth peer name "$VETH_B" | ||
| ip link set "$VETH_A" master "$BRIDGE" |
There was a problem hiding this comment.
The ip link show ... 2>/dev/null || ip link add ... pattern makes the veth creation idempotent, but ip link add failures (e.g. out of memory, permission denied, exceeding max interfaces) are silently swallowed. Consider:
if ! ip link show "$VETH_A" 2>/dev/null; then
ip link add "$VETH_A" type veth peer name "$VETH_B"
fiThis way, legitimate ip link add failures will surface as task errors.
There was a problem hiding this comment.
Still open: same concern as thread 88 — ip link add failure errors are swallowed.
There was a problem hiding this comment.
Acknowledged — idempotency checks prevent silent failures for the common case.
There was a problem hiding this comment.
Acknowledged by maintainer — idempotency pattern acceptable.
| {% endfor %} | ||
| echo "Cleaned up TG veth pairs" | ||
| args: | ||
| executable: /bin/bash |
There was a problem hiding this comment.
TG veth cleanup iterates range(vnut_links | length) which includes DUT-to-DUT links (e.g. vnut-t0-01:Ethernet4 <-> vnut-t1-01:Ethernet0). For those links, vtg..._Na doesn't exist, so the 2>/dev/null || true hides it. Not a bug, but wasteful — consider filtering to only TG links (matching the connect_tg_links.yml logic).
There was a problem hiding this comment.
Still open (nit): same concern as thread 95 — TG veth cleanup iterates all links. Functionally correct due to the Jinja filter, but generates no-op commands.
There was a problem hiding this comment.
Acknowledged — functionally correct, will optimize in a follow-up.
There was a problem hiding this comment.
Acknowledged by maintainer — will optimize TG cleanup in follow-up.
|
/azp run |
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 12). Most prior concerns have been addressed. The code is significantly cleaner after the refactoring into separate create/remove roles and the move to KVM-based DUTs.
Resolved in this iteration:
- Dead code removed from
vnut_network.py - Docstring updated to match actual actions
no_log: trueadded to sshpass taskvnut_tg_linksused inwhen:condition- Link filter checks both
StartDeviceandEndDevice features.json.j2reverted- Usage function updated
.get()replaced with| default([])in testbed_facts.ymltimeout=60added to subprocess.run()- vm_interfaces deduplication fixed
- iptables
changed_whenmade dynamic set -eremoved from connect_tg_links shell block- TG veth cleanup now filters to TG links
- Deterministic MAC for management interface
Still open (minor, non-blocking):
- Duplicate
read_testbed.ymlbetween create/remove roles (acknowledged for follow-up) - stderr discarded on success in run_cmd (low priority)
ip link show ... || ip link adderror swallowing pattern- Monit grep pattern fragility
- vars_files hard paths (consistent with repo conventions)
New observations on current diff:
kickstart.pyandsonic_kickstart.pyare symlinks — ensure the target paths resolve correctly when the role is used from the new location.- Bridge name
vbr_<testbed[:8]>_<idx>can exceed 15 chars if idx >= 100 (e.g.vbr_abcdefgh_100= 16 chars). With realistic link counts this is unlikely but worth a length assertion.
Overall the PR is in good shape. The remaining items are minor nits suitable for follow-up. LGTM with the caveats noted above.
|
|
||
| - name: Create front-panel link bridges | ||
| shell: | | ||
| BRIDGE="vbr_{{ testbed_name[:8] }}_{{ idx }}" |
There was a problem hiding this comment.
Nit: vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed the Linux 15-char interface name limit if idx >= 100 (e.g. vbr_abcdefgh_100 = 16 chars). Consider shortening the testbed prefix to 6 chars or adding a validation task. Unlikely in practice with current topologies but worth noting.
There was a problem hiding this comment.
Still open (nit): vbr_{{ testbed_name[:8] }}_{{ idx }} can exceed 15 chars when idx >= 100 (e.g. vbr_vnut-2ti_100 = 16 chars). For the sample testbed with only 4 links this is fine, but should be validated for larger topologies.
There was a problem hiding this comment.
Acknowledged — bridge name length is safe for current scale. Will add validation in a follow-up if needed.
There was a problem hiding this comment.
Acknowledged by maintainer — bridge name length safe for current scale.
banidoru
left a comment
There was a problem hiding this comment.
Iteration 12 re-review:
Significant improvement since last iteration — many prior concerns addressed (link filter bug, dead code, stale docstring, MAC address, usage(), set -e, timeout, etc.).
Remaining open concerns are mostly minor/nit-level: duplicate read_testbed.yml, stderr discarding, error swallowing in ip link add, fragile monit grep. These are acceptable for initial merge and can be addressed in follow-up.
One new observation: action_connect_mgmt creates a veth pair (line 123) then runs a sequence of commands to move it into the container namespace. If any step after veth creation fails (e.g., nsenter), the veth pair is left dangling on the host. A try/finally cleanup would make this more robust. Low risk for initial deployment but worth noting.
Overall: the code is in good shape for a v1 virtual testbed implementation.
ansible/library/vnut_network.py
Outdated
| run_cmd(["ip", "link", "set", veth_a, "netns", pid]) | ||
| run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", veth_a, "name", "eth0"]) | ||
| run_cmd(["nsenter", "-t", pid, "-n", "ip", "addr", "add", mgmt_ip, "dev", "eth0"]) | ||
| run_cmd(["nsenter", "-t", pid, "-n", "ip", "link", "set", "eth0", "up"]) |
There was a problem hiding this comment.
Minor: if any run_cmd call after this line fails (e.g., nsenter to move veth into container namespace), the veth pair is left dangling on the host. Consider wrapping lines 123-134 in a try/except that cleans up the veth pair on failure:
try:
run_cmd(["ip", "link", "add", veth_a, ...])
run_cmd(["ip", "link", "set", veth_a, "netns", pid])
# ... remaining setup
except RuntimeError:
# Clean up partially-created veth pair
if link_exists_on_host(veth_a):
run_cmd(["ip", "link", "delete", veth_a], check=False)
raiseLow risk for initial deployment since Ansible retries would re-create, but good defensive practice.
There was a problem hiding this comment.
Fixed — added try/except cleanup for veth pair on nsenter failure.
There was a problem hiding this comment.
Confirmed: try/except cleanup for veth pair on nsenter failure added.
| serial_port: "{{ vnut_vm_serial_port_base + device_idx }}" | ||
|
|
||
| - name: "Kickstart SONiC VM {{ device.Hostname }}" | ||
| sonic_kickstart: |
There was a problem hiding this comment.
Nit: the sonic_kickstart task passes passwords and new_password but doesn't set no_log: true. The task output will contain password values in Ansible logs. Consider adding no_log: true here (similar to the wait_testbed_ready.yml fix).
There was a problem hiding this comment.
Fixed — added no_log: true to the sonic_kickstart task.
There was a problem hiding this comment.
Confirmed: no_log: true added to sonic_kickstart task.
There was a problem hiding this comment.
Confirmed fixed — no_log: true is present on the sonic_kickstart task.
|
/azp run |
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 13): Most prior concerns have been addressed in the latest commits — unrelated files removed from diff, docstrings updated, | default([]) guards added, set -e removed, | unique dedup on link filters, no_log: true on password-bearing tasks, and cleanup uses changed_when: false.
Remaining minor items (non-blocking):
read_testbed.ymlis duplicated between create/remove roles — could be shared via symlink or common role- Bridge name (
vbr_<8>_<idx>) can exceed 15 chars with 100+ links — acknowledged as safe for current scale vnut_tg_linksfact inconnect_tg_links.ymlis only used in thewhen:guard while the shell template iteratesvnut_linksdirectly
No new issues found in this iteration. The PR is in good shape.
banidoru
left a comment
There was a problem hiding this comment.
Many of the previously raised concerns remain unaddressed. Please review the unresolved comments.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 13): Significant improvements since last review. Most prior concerns addressed.
Resolved items (20 threads):
- Unrelated test file changes removed
veos_vtbandfeatures.json.j2reverted- Dead code (
create_link,cleanupactions) removed fromvnut_network.py - Docstring updated to match remaining actions
- Link filter bug fixed (checks both StartDevice/EndDevice with
| unique) set -eremoved from shell blockno_log: trueadded to sensitive taskstimeout=60added tosubprocess.run()- Dead
vm_interfacesassignment removed - Dynamic
changed_whenadded to forwarding rules - Deterministic MAC for management interface
usage()updated with new commands- Jinja2-native
| default([])used consistently - try/except cleanup for veth pairs on failure
Remaining minor items (non-blocking):
- Duplicate
read_testbed.yml(DRY) — follow-up candidate - NAT masquerade task still lacks
changed_when - Empty
except: passneeds an explanatory comment (code scanning alert) - Serial port collision risk for concurrent testbed deployments
- Various nits acknowledged by maintainer for follow-up
Verdict: Good progress. The 3 new inline comments are minor. No blocking issues remain.
| for iface in (veth_a, veth_b): | ||
| try: | ||
| run_cmd(["ip", "link", "delete", iface]) | ||
| except Exception: |
There was a problem hiding this comment.
The inner except Exception: pass here will silently swallow cleanup failures. While this is in a cleanup path and you don't want to lose the original exception, adding a brief comment explaining the intent (e.g., # Best-effort cleanup; original exception re-raised below) would address the code-scanning alert and help future maintainers understand why the pass is intentional.
There was a problem hiding this comment.
Fixed — added comment explaining intentional best-effort cleanup.
There was a problem hiding this comment.
Confirmed fixed — the comment clarifies intentional best-effort cleanup behavior.
|
|
||
| - name: Check if NAT masquerade rule exists | ||
| shell: > | ||
| nsenter -t 1 -m -u -n -i -p -- iptables -t nat -C POSTROUTING -s {{ mgmt_subnet_cidr }} ! -d {{ mgmt_subnet_cidr }} -j MASQUERADE 2>/dev/null |
There was a problem hiding this comment.
The NAT masquerade task has no changed_when — it will always report changed when the rule doesn't exist. The forwarding task below correctly uses changed_when: "'changed=true' in fwd_result.stdout". Consider applying the same check+add pattern with a changed flag here for consistency.
There was a problem hiding this comment.
Fixed — added changed_when to the NAT masquerade task.
There was a problem hiding this comment.
Confirmed fixed — changed_when: nat_check.rc != 0 correctly tracks whether the NAT rule was actually added.
| # launch_one_dut.yml — Create and start a SONiC KVM VM for a DUT | ||
| # Called with: device (dict with Hostname, ManagementIp, HwSku, etc.), device_idx | ||
|
|
||
| - name: "Set VM facts for {{ device.Hostname }}" |
There was a problem hiding this comment.
Nit: serial_port is computed as vnut_vm_serial_port_base + device_idx, where device_idx comes from the loop index over DUT devices. If two testbeds share the same host, their serial ports could collide (e.g., both testbed A's first DUT and testbed B's first DUT would use port 9100). Consider incorporating testbed_name into the port offset, or document that concurrent deployments on the same host require disjoint vnut_vm_serial_port_base values.
There was a problem hiding this comment.
Acknowledged — serial port collision is a valid concern for parallel testbeds. Will add testbed-scoped port allocation in a follow-up.
There was a problem hiding this comment.
Acknowledged by maintainer for follow-up. Non-blocking for this PR.
Add support for deploying virtual NUT testbeds using KVM VMs via testbed-cli.sh. This includes: - New nut-vtopo-create and nut-vtopo-remove Ansible roles - vnut_network.py library for network namespace management - KVM VM lifecycle management (libvirt XML templates) - Management network creation with bridge and NAT - TG link connectivity via veth pairs - testbed-cli.sh add-vnut-topo / remove-vnut-topo commands - Meta validator exclusions for virtual devices - Sample vnut testbed YAML configuration Signed-off-by: r12f <r12f.code@gmail.com>
|
/azp run |
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 14): Significant improvement — most prior concerns have been addressed. Resolved 21 of 24 threads.
3 remaining items (non-blocking but worth addressing):
- Duplicated read_testbed.yml —
nut-vtopo-remove/tasks/read_testbed.ymlis still an exact copy of the create version. Extract to a shared include to avoid drift. - Hardcoded vars_files paths —
testbed_add_nut_topo.ymlandtestbed_remove_nut_topo.ymlhardcodegroup_vars/all/creds.ymlandgroup_vars/lab/secrets.yml. Consider parameterizing or documenting. - Fragile monit grep — The negative-match pattern in
wait_testbed_ready.ymlwill miss new monit status keywords. A positive-match approach would be more robust.
No new issues found in this iteration. The code is well-structured with good idempotency, proper error handling (timeouts on subprocess, cleanup on failure), and consistent use of default([]) for optional fields.
banidoru
left a comment
There was a problem hiding this comment.
Re-review (iteration 14): All previously flagged issues have been addressed or acknowledged as non-blocking.
Fixes confirmed in this iteration:
no_log: trueadded tosonic_kickstarttask — prevents credential leakage in Ansible outputchanged_when: nat_check.rc != 0added to NAT masquerade task — proper idempotency reporting- Comment added to
except Exception: passinvnut_network.pycleanup — clarifies intentional best-effort behavior - Serial port collision risk acknowledged by maintainer for follow-up
Previously addressed (verified still in place):
- Unrelated test files removed from PR scope
veos_vtbchanges revertedfeatures.json.j2changes reverted- Dead
action_create_linkremoved; docstring updated vnut_tg_linksused inwhen:conditionno_log: trueonsshpasstaskset -eremoved from shell template| uniquefilter on TG links| default([])consistently applied intestbed_facts.ymltimeout=60onsubprocess.run()vm_interfacesset once via Jinja namespace- Dynamic
changed_whenon iptables tasks - Deterministic MAC via MD5 hash
usage()updated with vnut commands- Try/except cleanup for dangling veth pairs
- TG cleanup filters for
DevIxiaChassistype - Shell quoting fixes in
testbed-cli.sh
Remaining non-blocking nits (all acknowledged by maintainer):
- Duplicate
read_testbed.ymlbetween create/remove roles — recommend dedup in follow-up stderrdiscarded on success inrun_cmd— minor observability gap- 8-char MD5 prefix for veth naming — acceptable for expected scale
ip link show || ip link adderror masking — mitigated by idempotency checks- Hardcoded
vars_filespaths — consistent with other testbed playbooks - Fragile
monit summarygrep pattern — functional but brittle - NUMA topology not defined for VMs — fine for 2 vCPU VS testing
- Bridge name length risk for
idx >= 100— safe for current scale - Serial port collision risk for parallel testbeds — acknowledged for follow-up
Verdict: No new issues found. All prior feedback has been addressed or accepted as non-blocking. The PR is clean and ready to merge.
Description of PR
Summary:
Add
add-vnut-topo/remove-vnut-topocommands totestbed-cli.shthat deploy a fully virtual NUT (Network Under Test) testbed using KVM-based virtual SONiC instances. This enables running sonic-mgmt NUT tests against a virtual topology without physical hardware.The virtual testbed consists of KVM virtual SONiC DUTs and
docker-ptftraffic generators connected via veth pairs, sharing the existing management bridge (br1) and management subnet with other virtual testbeds.Type of change
Back port request
Approach
What is the motivation for this PR?
Enable developers to run NUT tests locally without physical switches or traffic generators. The vNUT testbed provides a fully virtualized alternative using KVM VMs and PTF containers.
How did you do it?
add-vnut-topo/remove-vnut-topoactions totestbed-cli.shansible/roles/testbed/nut-vtopo/Ansible role with tasks for:br1bridge)sonic-vs.imgvnut_network.pymodulenut-2tierstopologyHow did you verify/test it?
Validated end-to-end inside sonic-mgmt container on a KVM-capable host:
add-vnut-topo: ok=68, failed=0 ✅deploy-cfg: ok=44, failed=0 ✅test_pretest.py: 11 passed, 6 skipped, 0 failures ✅Any platform specific information?
Requires KVM-capable host. DUTs use
Force10-S6000platform profile (virtual SONiC).Documentation
HLD: #22977