Skip to content

Module graceful shutdown support #4031

Open
rameshraghupathy wants to merge 56 commits intosonic-net:masterfrom
rameshraghupathy:graceful-shutdown
Open

Module graceful shutdown support #4031
rameshraghupathy wants to merge 56 commits intosonic-net:masterfrom
rameshraghupathy:graceful-shutdown

Conversation

@rameshraghupathy
Copy link
Copy Markdown
Contributor

@rameshraghupathy rameshraghupathy commented Aug 13, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description:

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-host-services: sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin
Copy link
Copy Markdown
Contributor

wangxin commented Aug 18, 2025

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin
Copy link
Copy Markdown
Contributor

wangxin commented Aug 18, 2025

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Aug 18, 2025

@rameshraghupathy Could you please update PR description more in detail of the code changes being done?

@gpunathilell
Copy link
Copy Markdown
Contributor

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot August 20, 2025 22:43
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

qiluo-msft pushed a commit to sonic-net/sonic-host-services that referenced this pull request Nov 21, 2025
Provide support for SmartSwitch DPU module graceful shutdown.

Description:
Single source of truth for transitions

All components now use sonic_platform_base.module_base.ModuleBase helpers:

set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.

Correct table everywhere

Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
HLD mismatch addressed in code (HLD fix tracked separately).
Ownership & lifecycle

The initiator of an operation (startup/shutdown/reboot) sets:

state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (set_admin_state()) is responsible for clearing:

state_transition_in_progress=False
optionally transition_end_time=<epoch> (or similar end stamp).
CLI pre-clears only when a prior transition is timed out.

Timeouts & policy

Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

Typical production values used:

startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

Boot behavior

chassisd on start:

Clears stale flags once (centralized sweep).
Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
gNOI shutdown daemon

Listens on CHASSIS_MODULE_TABLE and triggers only when:

state_transition_in_progress=True and transition_type=shutdown.
Never clears the flag (ownership stays with the platform).

Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

CLI (config chassis modules …)

Uses ModuleBase APIs for all set/get/timeout checks.
If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
Sets transition at the start of startup/shutdown; platform clears on completion.
Fabric card flow retained; edits are surgical.
Redis robustness

Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
Consistent HGETALL/HSET paths; resilient to connector differences.
Race reduction & consistency

Centralized writes prevent multi-writer races.
All transition writes include transition_start_time; clears may add an end stamp.
Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
Change scope

Minimal, targeted diffs.
No background tasks added, no broad refactors beyond transition handling.
Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
mssonicbld added a commit to mssonicbld/sonic-host-services that referenced this pull request Dec 3, 2025
Provide support for SmartSwitch DPU module graceful shutdown.

# Description:

* **Single source of truth for transitions**

  * All components now use `sonic_platform_base.module_base.ModuleBase` helpers:

    * `set_module_state_transition(db, name, transition_type)`
    * `clear_module_state_transition(db, name)`
    * `get_module_state_transition(db, name) -> dict`
    * `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
  * Eliminates duplicated logic and race-prone direct Redis writes.

* **Correct table everywhere**

  * Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
  * HLD mismatch addressed in code (HLD fix tracked separately).

* **Ownership & lifecycle**

  * The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:

    * `state_transition_in_progress=True`
    * `transition_type=<op>`
    * `transition_start_time=<utc-iso8601>`
  * The **platform** (`set_admin_state()`) is responsible for clearing:

    * `state_transition_in_progress=False`
    * optionally `transition_end_time=<epoch>` (or similar end stamp).
  * CLI pre-clears only when a prior transition is **timed out**.

* **Timeouts & policy**

  * Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
  * Typical production values used:

    * `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
  * **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.

* **Boot behavior**

  * `chassisd` on start:

    1. **Clears stale flags once** (centralized sweep).
    2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
    3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.

* **gNOI shutdown daemon**

  * Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:

    * `state_transition_in_progress=True` **and** `transition_type=shutdown`.
  * Never clears the flag (ownership stays with the platform).
  * Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

* **CLI (`config chassis modules …`)**

  * Uses ModuleBase APIs for all set/get/timeout checks.
  * If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
  * Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
  * Fabric card flow retained; edits are surgical.

* **Redis robustness**

  * Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
  * Consistent HGETALL/HSET paths; resilient to connector differences.

* **Race reduction & consistency**

  * Centralized writes prevent multi-writer races.
  * All transition writes include `transition_start_time`; clears may add an end stamp.
  * Existing PCI/file-lock logic left intact; unrelated behavior unchanged.

* **Change scope**

  * Minimal, targeted diffs.
  * No background tasks added, no broad refactors beyond transition handling.
  * Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567  sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
mssonicbld added a commit to sonic-net/sonic-host-services that referenced this pull request Dec 3, 2025
Provide support for SmartSwitch DPU module graceful shutdown.

# Description:

* **Single source of truth for transitions**

 failure_prs.log skip_prs.log All components now use `sonic_platform_base.module_base.ModuleBase` helpers:

 failure_prs.log skip_prs.log `set_module_state_transition(db, name, transition_type)`
 failure_prs.log skip_prs.log `clear_module_state_transition(db, name)`
 failure_prs.log skip_prs.log `get_module_state_transition(db, name) -> dict`
 failure_prs.log skip_prs.log `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
 failure_prs.log skip_prs.log Eliminates duplicated logic and race-prone direct Redis writes.

* **Correct table everywhere**

 failure_prs.log skip_prs.log Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
 failure_prs.log skip_prs.log HLD mismatch addressed in code (HLD fix tracked separately).

* **Ownership & lifecycle**

 failure_prs.log skip_prs.log The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:

 failure_prs.log skip_prs.log `state_transition_in_progress=True`
 failure_prs.log skip_prs.log `transition_type=<op>`
 failure_prs.log skip_prs.log `transition_start_time=<utc-iso8601>`
 failure_prs.log skip_prs.log The **platform** (`set_admin_state()`) is responsible for clearing:

 failure_prs.log skip_prs.log `state_transition_in_progress=False`
 failure_prs.log skip_prs.log optionally `transition_end_time=<epoch>` (or similar end stamp).
 failure_prs.log skip_prs.log CLI pre-clears only when a prior transition is **timed out**.

* **Timeouts & policy**

 failure_prs.log skip_prs.log Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
 failure_prs.log skip_prs.log Typical production values used:

 failure_prs.log skip_prs.log `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
 failure_prs.log skip_prs.log **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.

* **Boot behavior**

 failure_prs.log skip_prs.log `chassisd` on start:

 1. **Clears stale flags once** (centralized sweep).
 2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
 3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.

* **gNOI shutdown daemon**

 failure_prs.log skip_prs.log Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:

 failure_prs.log skip_prs.log `state_transition_in_progress=True` **and** `transition_type=shutdown`.
 failure_prs.log skip_prs.log Never clears the flag (ownership stays with the platform).
 failure_prs.log skip_prs.log Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

* **CLI (`config chassis modules …`)**

 failure_prs.log skip_prs.log Uses ModuleBase APIs for all set/get/timeout checks.
 failure_prs.log skip_prs.log If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
 failure_prs.log skip_prs.log Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
 failure_prs.log skip_prs.log Fabric card flow retained; edits are surgical.

* **Redis robustness**

 failure_prs.log skip_prs.log Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
 failure_prs.log skip_prs.log Consistent HGETALL/HSET paths; resilient to connector differences.

* **Race reduction & consistency**

 failure_prs.log skip_prs.log Centralized writes prevent multi-writer races.
 failure_prs.log skip_prs.log All transition writes include `transition_start_time`; clears may add an end stamp.
 failure_prs.log skip_prs.log Existing PCI/file-lock logic left intact; unrelated behavior unchanged.

* **Change scope**

 failure_prs.log skip_prs.log Minimal, targeted diffs.
 failure_prs.log skip_prs.log No background tasks added, no broad refactors beyond transition handling.
 failure_prs.log skip_prs.log Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
mssonicbld added a commit to mssonicbld/sonic-utilities that referenced this pull request Dec 19, 2025
<!--
    Please make sure you've read and understood our contributing guidelines:
    https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

    ** Make sure all your commits include a signature generated with `git commit -s` **

    If this is a bug fix, make sure your description includes "closes #xxxx",
    "fixes #xxxx" or "resolves #xxxx" so that GitHub automatically closes the related
    issue when the PR is merged.

    If you are adding/modifying/removing any command or utility script, please also
    make sure to add/modify/remove any unit tests from the tests
    directory as appropriate.

    If you are modifying or removing an existing 'show', 'config' or 'sonic-clear'
    subcommand, or you are adding a new subcommand, please make sure you also
    update the Command Line Reference Guide (doc/Command-Reference.md) to reflect
    your changes.

    Please provide the following information:
-->

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in sonic-net#4031

This PR adds CLI support and visibility for module-level graceful transitions (startup/shutdown/reboot) to align with the SmartSwitch/DPU lifecycle work.

#### What I did

- Added support to view module transition states (startup, shutdown, reboot) through CLI.
- Integrated with STATE_DB CHASSIS_MODULE_TABLE to display transition status, type, and elapsed time.
- Enhanced user experience with readable durations and exit codes for automation.
- Implemented comprehensive unit tests for transition visibility, parsing, and error handling.

#### How I did it

- Added a helper class to read STATE_DB entries:
    - state_transition_in_progress
    - transition_type
    - transition_start_time
- Implemented robust error handling for missing or malformed DB entries.
- Added pytest-based unit tests using mocked state_db_connector.

#### How to verify it
- Build and install the updated sonic-utilities package on DUT
- Check Redis entries: `redis-cli -n 6 hgetall "CHASSIS_MODULE_TABLE|DPU0"`
- Run the module startup/shutdown commands
- Run unit tests

#### Sample outputs when "state_transition_in_progress"
Errors thrown when the same module transition is already in progress.

$ sudo config chassis modules shutdown DPU2;redis-cli -n 6 hgetall 'CHASSIS_MODULE_TABLE|DPU2';sudo reboot -d DPU2;redis-cli -n 6 hgetall 'CHASSIS_MODULE_TABLE|DPU2'
Shutting down chassis module DPU2
 1) "desc"
 2) "NVIDIA XXXXXX DPU"
 3) "slot"
 4) "N/A"
 5) "oper_status"
 6) "Online"
 7) "serial"
 8) "XXXXXXXXXX"
 9) "transition_in_progress"
10) "True"
11) "transition_type"
12) "shutdown"
13) "transition_start_time"
14) "1763059401"
True
2025-11-13 18:43:22 - User requested rebooting device dpu2 ...
2025-11-13 18:43:23 - INFO: DPU dpu2 is in 'Online' state before reboot.
2025-11-13 18:43:23 - ERROR: state_transition_in_progress flag is already set for dpu2

#### Previous command output (if the output of a command-line utility has changed)

#### New command output (if the output of a command-line utility has changed)
$ reboot -d DPU1
True
2025-11-17 17:56:10 - User requested rebooting device dpu1 ...
2025-11-17 17:56:11 - INFO: DPU dpu1 is in 'Online' state before reboot.
2025-11-17 17:56:12 - INFO: Rebooting dpu1, ip:1X9.XXX.X00.2 gnmi_port:50XXX
2025-11-17 17:56:53 - INFO: dpu1 halted the services successfully
2025-11-17 17:58:50 - INFO: Rebooting dpu1 with reboot_type:DPU...
mssonicbld added a commit that referenced this pull request Jan 6, 2026
…4168)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "closes #xxxx",
 "fixes #xxxx" or "resolves #xxxx" so that GitHub automatically closes the related
 issue when the PR is merged.

 If you are adding/modifying/removing any command or utility script, please also
 make sure to add/modify/remove any unit tests from the tests
 directory as appropriate.

 If you are modifying or removing an existing 'show', 'config' or 'sonic-clear'
 subcommand, or you are adding a new subcommand, please make sure you also
 update the Command Line Reference Guide (doc/Command-Reference.md) to reflect
 your changes.

 Please provide the following information:
-->

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in #4031

This PR adds CLI support and visibility for module-level graceful transitions (startup/shutdown/reboot) to align with the SmartSwitch/DPU lifecycle work.

#### What I did

- Added support to view module transition states (startup, shutdown, reboot) through CLI.
- Integrated with STATE_DB CHASSIS_MODULE_TABLE to display transition status, type, and elapsed time.
- Enhanced user experience with readable durations and exit codes for automation.
- Implemented comprehensive unit tests for transition visibility, parsing, and error handling.

#### How I did it

- Added a helper class to read STATE_DB entries:
 - state_transition_in_progress
 - transition_type
 - transition_start_time
- Implemented robust error handling for missing or malformed DB entries.
- Added pytest-based unit tests using mocked state_db_connector.

#### How to verify it
- Build and install the updated sonic-utilities package on DUT
- Check Redis entries: `redis-cli -n 6 hgetall "CHASSIS_MODULE_TABLE|DPU0"`
- Run the module startup/shutdown commands
- Run unit tests

#### Sample outputs when "state_transition_in_progress"
Errors thrown when the same module transition is already in progress.

$ sudo config chassis modules shutdown DPU2;redis-cli -n 6 hgetall 'CHASSIS_MODULE_TABLE|DPU2';sudo reboot -d DPU2;redis-cli -n 6 hgetall 'CHASSIS_MODULE_TABLE|DPU2'
Shutting down chassis module DPU2
 1) "desc"
 2) "NVIDIA XXXXXX DPU"
 3) "slot"
 4) "N/A"
 5) "oper_status"
 6) "Online"
 7) "serial"
 8) "XXXXXXXXXX"
 9) "transition_in_progress"
10) "True"
11) "transition_type"
12) "shutdown"
13) "transition_start_time"
14) "1763059401"
True
2025-11-13 18:43:22 - User requested rebooting device dpu2 ...
2025-11-13 18:43:23 - INFO: DPU dpu2 is in 'Online' state before reboot.
2025-11-13 18:43:23 - ERROR: state_transition_in_progress flag is already set for dpu2

#### Previous command output (if the output of a command-line utility has changed)

#### New command output (if the output of a command-line utility has changed)
$ reboot -d DPU1
True
2025-11-17 17:56:10 - User requested rebooting device dpu1 ...
2025-11-17 17:56:11 - INFO: DPU dpu1 is in 'Online' state before reboot.
2025-11-17 17:56:12 - INFO: Rebooting dpu1, ip:1X9.XXX.X00.2 gnmi_port:50XXX
2025-11-17 17:56:53 - INFO: dpu1 halted the services successfully
2025-11-17 17:58:50 - INFO: Rebooting dpu1 with reboot_type:DPU...
@KrisNey-MSFT
Copy link
Copy Markdown

hi @rameshraghupathy - is this one ready to go? I see some checks for conflict resolution...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants