Upgrade smartswitch via gNOI testcases#22393
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
hdwhdw
left a comment
There was a problem hiding this comment.
AI Agent on behalf of Dawei:
Thanks for adding the SmartSwitch gNOI upgrade test coverage — this fills an important gap. A few thoughts below, mostly around one design suggestion and a couple of smaller items.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vaibhavhd
left a comment
There was a problem hiding this comment.
Added non blocking comments.
|
|
||
| ss_target_index = request.config.getoption("ss_target_index") # int | ||
| ss_target_indices = request.config.getoption("ss_target_indices") # "0,1,2,3" | ||
| ss_reboot_ready_timeout = 600 |
There was a problem hiding this comment.
Non blocking - please justify why 600 seconds is set to reboot ready timeout? What factors govern or influence this decision?
| if ss_target_index in (None, ""): | ||
| ss_target_index = 3 | ||
| if not ss_reboot_ready_timeout: | ||
| ss_reboot_ready_timeout = 1200 |
There was a problem hiding this comment.
Non blocking - Why bumping to 1200s?
| # Best-effort cleanup (may run on NPU; harmless) | ||
| duthost.shell(f"rm -f {dut_image_path}", module_ignore_errors=True) |
There was a problem hiding this comment.
Considering that test has to represent what happens in production (real worls), keep in mind if you do need this step (even if this is optional) you need to
- make an API out of it (as the client may not have ability to run CLIs)
- formulize it as a step of upgrade MoP.
| return {"transfer_resp": transfer_resp, "setpkg_resp": setpkg_resp, "reboot_resp": reboot_resp} | ||
|
|
||
|
|
||
| def perform_gnoi_upgrade_smartswitch_dpu(duthost, tbinfo, ptf_gnoi, cfg: GnoiUpgradeConfig) -> Dict: |
There was a problem hiding this comment.
Optional - Considering the concurrent call has a name perform_gnoi_upgrade_smartswitch_dpus_parallel, a better name for this function should be perform_gnoi_upgrade_smartswitch_single_dpu?
| if cfg.allow_fail: | ||
| return {"transfer_resp": transfer_resp, "setpkg_resp": setpkg_resp, "reboot_resp": reboot_resp} | ||
|
|
||
| ok = _wait_gnoi_time_ready(ptf_gnoi, md, cfg) |
There was a problem hiding this comment.
@v-cshekar , as a next step from here, please include your changes (generic perform_reboot call in this section)
| "SetPackage via gNOI System.SetPackage (streaming): filename=%s version=%s activate=%s", | ||
| local_path, version, activate, | ||
| ) | ||
| self.grpc_client.configure_max_time(3600) # allow long SetPackage |
There was a problem hiding this comment.
grpc client and server connections dropped when "set_package" which is image installation. It complains about too many pings and "ENHANCE_YOUR_CALM" and disconnected, considering it's client/server side session time out issue, added the max_time and keep_alive parameters in the grpc command.
| local_path, version, activate, | ||
| ) | ||
| self.grpc_client.configure_max_time(3600) # allow long SetPackage | ||
| self.grpc_client.configure_keepalive_time(300) # 5 min keepalive (less aggressive |
There was a problem hiding this comment.
Same here - why is this step really needed?
There was a problem hiding this comment.
We really need to understand this step clearly - if this is client side need, or server's limitation -- @hdwhdw
There was a problem hiding this comment.
If this is client side need, we need to update the MoP that we have for our clients.
There was a problem hiding this comment.
We maybe need some changes on server side to determine when the installation is completed.
|
|
||
|
|
||
| @pytest.mark.device_type("smartswitch") | ||
| def test_upgrade_multiple_dpus_via_gnoi_parallel( |
There was a problem hiding this comment.
This testcase is not yet tested. The results are trusted only from single DPU upgrade.
Still allowing this to go in in favor of having test coverage out there.
The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot. This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped. Tests - Passed - tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_one_dpu_via_gnoi[str3-8102-07] ✓ 50% Skipped - tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_multiple_dpus_via_gnoi_parallel[str3-8102-07] Signed-off-by: Abhishek <[email protected]>
The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot. This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped. Tests - Passed - tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_one_dpu_via_gnoi[str3-8102-07] ✓ 50% Skipped - tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_multiple_dpus_via_gnoi_parallel[str3-8102-07] Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
Description of PR
The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot.
This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped.
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
Address test gap of smartswitch cold upgrade via gNOI.
How did you do it?
Added new testcases for smartswitch upgrade via gNOI.
How did you verify/test it?
Any platform specific information?
No
Supported testbed topology if it's a new test case?
smartswitch
Documentation