Skip to content

Upgrade smartswitch via gNOI testcases#22393

Merged
vaibhavhd merged 13 commits intosonic-net:masterfrom
ryanzhu706:dpu_upgrade
Mar 16, 2026
Merged

Upgrade smartswitch via gNOI testcases#22393
vaibhavhd merged 13 commits intosonic-net:masterfrom
ryanzhu706:dpu_upgrade

Conversation

@ryanzhu706
Copy link
Contributor

@ryanzhu706 ryanzhu706 commented Feb 12, 2026

Description of PR

The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot.
This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Address test gap of smartswitch cold upgrade via gNOI.

How did you do it?

Added new testcases for smartswitch upgrade via gNOI.

How did you verify/test it?

ryanzhu@sonic-mgmt-int:/data/sonic-mgmt-int/tests$ sudo ./run_tests.sh -i ../ansible/str3,../ansible/veos -n vms66-t1-8102-7 -u -m individual -c upgrade_path/test_upgrade_smart_switch_gnoi.py -u -e "--upgrade_type=cold --base_image_list=http://10.201.148.43/pipelines/Networking-acs-buildimage-Official/pensando/internal-202506/sonic-pensando.bin --target_image_list=http://10.201.148.43/pipelines/Networking-acs-buildimage-Official/pensando/internal-202506/sonic-pensando.bin --target_version=SONiC-OS-internal-202506.154741893-16d12968f4" -d str3-8102-07 -e "--skip_sanity"
pkill: pattern that searches for process name longer than 15 characters will result in zero matches
Try `pkill -f' option to match against the complete command line.
=== Clearing pytest cache ===
=== Running tests individually ===
Running: python3 -m pytest upgrade_path/test_upgrade_smart_switch_gnoi.py --inventory ../ansible/str3,../ansible/veos --host-pattern str3-8102-07 --dpu-pattern None --testbed vms66-t1-8102-7 --testbed_file /data/sonic-mgmt-int/ansible/testbed.yaml --log-cli-level warning --log-file-level debug --kube_master unset --showlocals --assert plain --show-capture no -rav --allow_recover --ignore=ptftests --ignore=acstests --ignore=saitests --ignore=scripts --ignore=k8s --ignore=sai_qualify --log-file logs/upgrade_path/test_upgrade_smart_switch_gnoi.log --junitxml=logs/upgrade_path/test_upgrade_smart_switch_gnoi.xml --upgrade_type=cold --base_image_list=http://10.201.148.43/pipelines/Networking-acs-buildimage-Official/pensando/internal-202506/sonic-pensando.bin --target_image_list=http://10.201.148.43/pipelines/Networking-acs-buildimage-Official/pensando/internal-202506/sonic-pensando.bin --target_version=SONiC-OS-internal-202506.154741893-16d12968f4 --skip_sanity
Test session starts (platform: linux, Python 3.12.3, pytest 9.0.2, pytest-sugar 1.1.1)
ansible: 2.18.12
rootdir: /data/sonic-mgmt-int
configfile: pyproject.toml
plugins: stress-1.0.1, ansible-25.12.0, sugar-1.1.1, metadata-3.1.1, plus-0.8.1, xdist-3.8.0, cov-7.0.0, allure-pytest-2.15.3, html-4.2.0, repeat-0.9.4
collecting ... [WARNING]: While constructing a mapping from /data/sonic-mgmt-int/ansible/str3,
line 146, column 9, found a duplicate dict key (pdu-85). Using last defined
value only.
[WARNING]: Found variable using reserved name: serial
[WARNING]: While constructing a mapping from /data/sonic-mgmt-int/ansible/str3,
line 146, column 9, found a duplicate dict key (pdu-85). Using last defined
value only.
[WARNING]: Found variable using reserved name: serial
[WARNING]: While constructing a mapping from /data/sonic-mgmt-int/ansible/str3,
line 146, column 9, found a duplicate dict key (pdu-85). Using last defined
value only.
[WARNING]: Found variable using reserved name: serial
[WARNING]: While constructing a mapping from /data/sonic-mgmt-int/ansible/str3,
line 146, column 9, found a duplicate dict key (pdu-85). Using last defined
value only.
[WARNING]: Found variable using reserved name: serial
[WARNING]: While constructing a mapping from /data/sonic-mgmt-int/ansible/str3,
line 146, column 9, found a duplicate dict key (pdu-85). Using last defined
value only.
[WARNING]: Found variable using reserved name: serial
collected 2 items


--------------------------------------------------------- live log teardown ---------------------------------------------------------
WARNING  tests.common.plugins.memory_utilization.memory_utilization:memory_utilization.py:64 Skipping memory check for docker-radv due to zero value

WARNING  tests.common.plugins.memory_utilization.memory_utilization:memory_utilization.py:64 Skipping memory check for docker-teamd due to zero value
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_one_dpu_via_gnoi[str3-8102-07] ✓                   50% █████
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_multiple_dpus_via_gnoi_parallel[str3-8102-07] s   100% ██████████

Any platform specific information?

No

Supported testbed topology if it's a new test case?

smartswitch

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@vvolam vvolam requested a review from gpunathilell February 26, 2026 21:36
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@ryanzhu706 ryanzhu706 changed the title Added DPU upgrade testcases. Upgrade smartswitch via gNOI testcases Feb 27, 2026
@hdwhdw hdwhdw self-requested a review March 1, 2026 06:01
Copy link
Contributor

@hdwhdw hdwhdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Agent on behalf of Dawei:

Thanks for adding the SmartSwitch gNOI upgrade test coverage — this fills an important gap. A few thoughts below, mostly around one design suggestion and a couple of smaller items.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ryanzhu706 ryanzhu706 requested a review from hdwhdw March 10, 2026 18:33
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ryanzhu706 ryanzhu706 requested a review from vaibhavhd March 11, 2026 00:27
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested review from rawal01 and xwjiang-ms March 16, 2026 17:05
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added non blocking comments.


ss_target_index = request.config.getoption("ss_target_index") # int
ss_target_indices = request.config.getoption("ss_target_indices") # "0,1,2,3"
ss_reboot_ready_timeout = 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking - please justify why 600 seconds is set to reboot ready timeout? What factors govern or influence this decision?

if ss_target_index in (None, ""):
ss_target_index = 3
if not ss_reboot_ready_timeout:
ss_reboot_ready_timeout = 1200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking - Why bumping to 1200s?

Comment on lines +102 to +103
# Best-effort cleanup (may run on NPU; harmless)
duthost.shell(f"rm -f {dut_image_path}", module_ignore_errors=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that test has to represent what happens in production (real worls), keep in mind if you do need this step (even if this is optional) you need to

  1. make an API out of it (as the client may not have ability to run CLIs)
  2. formulize it as a step of upgrade MoP.

return {"transfer_resp": transfer_resp, "setpkg_resp": setpkg_resp, "reboot_resp": reboot_resp}


def perform_gnoi_upgrade_smartswitch_dpu(duthost, tbinfo, ptf_gnoi, cfg: GnoiUpgradeConfig) -> Dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional - Considering the concurrent call has a name perform_gnoi_upgrade_smartswitch_dpus_parallel, a better name for this function should be perform_gnoi_upgrade_smartswitch_single_dpu?

if cfg.allow_fail:
return {"transfer_resp": transfer_resp, "setpkg_resp": setpkg_resp, "reboot_resp": reboot_resp}

ok = _wait_gnoi_time_ready(ptf_gnoi, md, cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-cshekar , as a next step from here, please include your changes (generic perform_reboot call in this section)

"SetPackage via gNOI System.SetPackage (streaming): filename=%s version=%s activate=%s",
local_path, version, activate,
)
self.grpc_client.configure_max_time(3600) # allow long SetPackage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpc client and server connections dropped when "set_package" which is image installation. It complains about too many pings and "ENHANCE_YOUR_CALM" and disconnected, considering it's client/server side session time out issue, added the max_time and keep_alive parameters in the grpc command.

local_path, version, activate,
)
self.grpc_client.configure_max_time(3600) # allow long SetPackage
self.grpc_client.configure_keepalive_time(300) # 5 min keepalive (less aggressive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why is this step really needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to understand this step clearly - if this is client side need, or server's limitation -- @hdwhdw

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is client side need, we need to update the MoP that we have for our clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We maybe need some changes on server side to determine when the installation is completed.



@pytest.mark.device_type("smartswitch")
def test_upgrade_multiple_dpus_via_gnoi_parallel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testcase is not yet tested. The results are trusted only from single DPU upgrade.

Still allowing this to go in in favor of having test coverage out there.

@vaibhavhd vaibhavhd merged commit c9fe85a into sonic-net:master Mar 16, 2026
15 of 16 checks passed
@vaibhavhd vaibhavhd added the Request for 202511 branch Request to backport a change to 202511 branch label Mar 16, 2026
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot.
This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped.

Tests -
Passed -
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_one_dpu_via_gnoi[str3-8102-07] ✓                   50%
Skipped -
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_multiple_dpus_via_gnoi_parallel[str3-8102-07]

Signed-off-by: Abhishek <[email protected]>
vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
The new test verifies that a gNOI-triggered upgrade results in an observable system reboot by checking expected reboot indicators (e.g. service downtime and CLI session interruption). This helps catch cases where an upgrade completes without actually triggering a full reboot.
This test improves coverage for SmartSwitch upgrade scenarios and prevents false positives where upgrades appear successful but reboot is skipped.

Tests -
Passed -
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_one_dpu_via_gnoi[str3-8102-07] ✓                   50%
Skipped -
 tests/upgrade_path/test_upgrade_smart_switch_gnoi.py::test_upgrade_multiple_dpus_via_gnoi_parallel[str3-8102-07]

Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
@ryanzhu706 ryanzhu706 deleted the dpu_upgrade branch March 24, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants