Skip to content

[gnoi-shutdown-daemon] Skip gNOI shutdown for already powered-off DPUs#352

Open
vvolam wants to merge 2 commits intosonic-net:masterfrom
vvolam:fix-error-log
Open

[gnoi-shutdown-daemon] Skip gNOI shutdown for already powered-off DPUs#352
vvolam wants to merge 2 commits intosonic-net:masterfrom
vvolam:fix-error-log

Conversation

@vvolam
Copy link
Contributor

@vvolam vvolam commented Mar 6, 2026

What I did

Added an operational status check in gnoi_shutdown_daemon.py _handle_transition() to skip the gNOI Reboot HALT sequence when the DPU is already not Online (e.g. Offline, PoweredDown).

Why I did it

Fixes sonic-net/sonic-buildimage#25889

When DPUs are configured with admin_status: "down" and are already powered off, a config reload or reboot repopulates CONFIG_DB, which triggers gnoi-shutdown-daemon to attempt a gNOI Reboot HALT command on DPUs that are already offline — producing error logs:

ERR gnoi-shutdown-daemon[12171]: DPU0: Reboot command failed
ERR gnoi-shutdown-daemon[12171]: DPU0: Failed to send Reboot command

How I verified it

  • Added unit tests for the new behavior:
    • test_handle_transition_dpu_already_offline — verifies skip when DPU is in Offline state
    • test_handle_transition_dpu_powered_down — verifies skip when DPU is in PoweredDown state
    • test_handle_transition_oper_status_check_exception — verifies graceful fallback when the oper status check raises an exception
  • Updated existing tests to mock get_oper_status() returning Online so they continue testing the normal gNOI shutdown flow
  • Manually verified the logic with mock tests

Details if related

The fix queries get_oper_status() via the platform chassis API at the start of _handle_transition(). If the DPU is not Online, the method logs a notice, clears the halt flag, and returns success — avoiding any gNOI RPC calls to unreachable DPUs. The check is wrapped in try/except so that if the platform API fails, the daemon falls back to the existing behavior.

Check DPU operational status before attempting gNOI Reboot HALT.
If the DPU is not Online (e.g. Offline, PoweredDown), skip the
gNOI shutdown sequence and clear the halt flag directly. This
avoids error logs when config reload or reboot is issued while
DPUs are already in the down state.

Fixes: sonic-net/sonic-buildimage#25889
Signed-off-by: Vasundhara Volam <[email protected]>
@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the gnoi-shutdown-daemon flow to avoid attempting a gNOI Reboot (HALT) on DPUs that are already not operationally Online, reducing noisy error logs during config reload/reboot scenarios.

Changes:

  • Add an early operational-status check in GnoiRebootHandler._handle_transition() to skip the gNOI shutdown sequence when the DPU is not Online.
  • Update existing unit tests to mock get_oper_status() as Online to preserve coverage of the normal shutdown path.
  • Add new unit tests covering Offline, PoweredDown, and exception fallback behavior for the oper-status check.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
scripts/gnoi_shutdown_daemon.py Adds a chassis/module oper-status gate before initiating gNOI shutdown, and keeps fallback behavior on platform API errors.
tests/gnoi_shutdown_daemon_test.py Extends test coverage for the new skip behavior and updates existing tests for the new oper-status dependency.

- Use ModuleBase constants (MODULE_STATUS_OFFLINE, MODULE_STATUS_POWERED_DOWN)
  instead of hardcoded strings for oper_status checks
- Only skip gNOI shutdown for Offline/PoweredDown states; other non-Online
  states like Fault still proceed with the shutdown attempt
- Propagate _clear_halt_flag() return value in early-return path
- Add unit tests for new skip behavior and clear-halt failure propagation

Signed-off-by: Vasundhara Volam <[email protected]>
@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from dgsudharsan March 19, 2026 04:33
Comment on lines +134 to +150
module_index = self._chassis.get_module_index(dpu_name)
if module_index >= 0:
module = self._chassis.get_module(module_index)
if module is not None:
oper_status = module.get_oper_status()
if oper_status in (ModuleBase.MODULE_STATUS_OFFLINE,
ModuleBase.MODULE_STATUS_POWERED_DOWN):
logger.log_notice(
f"{dpu_name}: DPU is already in '{oper_status}' state, "
"skipping gNOI shutdown sequence"
)
cleared = self._clear_halt_flag(dpu_name)
if not cleared:
logger.log_warning(
f"{dpu_name}: Failed to clear halt flag while skipping gNOI shutdown"
)
return cleared
Copy link
Contributor

@hdwhdw hdwhdw Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting these logic into smaller functions like this, it avoid nesting and make it easier to read:

def _should_skip_gnoi_shutdown(self, dpu_name: str) -> Optional[bool]:
    """
    Returns:
      True  -> DPU is known to be offline/powered down; skip gNOI shutdown.
      False -> DPU is known to be online (or some other state where we proceed).
      None  -> Cannot determine status; caller should proceed with gNOI shutdown.
    """
    module_index = self._chassis.get_module_index(dpu_name)
    if module_index < 0:
        return None

    module = self._chassis.get_module(module_index)
    if module is None:
        return None

    oper_status = module.get_oper_status()
    return oper_status in (
        ModuleBase.MODULE_STATUS_OFFLINE,
        ModuleBase.MODULE_STATUS_POWERED_DOWN,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug:[Smartswitch] Error log from gnoi-shutdown-daemon after setting admin state to down on a DPU which is powered off

5 participants