[gnoi-shutdown-daemon] Skip gNOI shutdown for already powered-off DPUs#352
Open
vvolam wants to merge 2 commits intosonic-net:masterfrom
Open
[gnoi-shutdown-daemon] Skip gNOI shutdown for already powered-off DPUs#352vvolam wants to merge 2 commits intosonic-net:masterfrom
vvolam wants to merge 2 commits intosonic-net:masterfrom
Conversation
Check DPU operational status before attempting gNOI Reboot HALT. If the DPU is not Online (e.g. Offline, PoweredDown), skip the gNOI shutdown sequence and clear the halt flag directly. This avoids error logs when config reload or reboot is issued while DPUs are already in the down state. Fixes: sonic-net/sonic-buildimage#25889 Signed-off-by: Vasundhara Volam <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
12 tasks
There was a problem hiding this comment.
Pull request overview
This PR updates the gnoi-shutdown-daemon flow to avoid attempting a gNOI Reboot (HALT) on DPUs that are already not operationally Online, reducing noisy error logs during config reload/reboot scenarios.
Changes:
- Add an early operational-status check in
GnoiRebootHandler._handle_transition()to skip the gNOI shutdown sequence when the DPU is notOnline. - Update existing unit tests to mock
get_oper_status()asOnlineto preserve coverage of the normal shutdown path. - Add new unit tests covering
Offline,PoweredDown, and exception fallback behavior for the oper-status check.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
scripts/gnoi_shutdown_daemon.py |
Adds a chassis/module oper-status gate before initiating gNOI shutdown, and keeps fallback behavior on platform API errors. |
tests/gnoi_shutdown_daemon_test.py |
Extends test coverage for the new skip behavior and updates existing tests for the new oper-status dependency. |
dgsudharsan
requested changes
Mar 17, 2026
- Use ModuleBase constants (MODULE_STATUS_OFFLINE, MODULE_STATUS_POWERED_DOWN) instead of hardcoded strings for oper_status checks - Only skip gNOI shutdown for Offline/PoweredDown states; other non-Online states like Fault still proceed with the shutdown attempt - Propagate _clear_halt_flag() return value in early-return path - Add unit tests for new skip behavior and clear-halt failure propagation Signed-off-by: Vasundhara Volam <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
dgsudharsan
approved these changes
Mar 19, 2026
hdwhdw
reviewed
Mar 19, 2026
Comment on lines
+134
to
+150
| module_index = self._chassis.get_module_index(dpu_name) | ||
| if module_index >= 0: | ||
| module = self._chassis.get_module(module_index) | ||
| if module is not None: | ||
| oper_status = module.get_oper_status() | ||
| if oper_status in (ModuleBase.MODULE_STATUS_OFFLINE, | ||
| ModuleBase.MODULE_STATUS_POWERED_DOWN): | ||
| logger.log_notice( | ||
| f"{dpu_name}: DPU is already in '{oper_status}' state, " | ||
| "skipping gNOI shutdown sequence" | ||
| ) | ||
| cleared = self._clear_halt_flag(dpu_name) | ||
| if not cleared: | ||
| logger.log_warning( | ||
| f"{dpu_name}: Failed to clear halt flag while skipping gNOI shutdown" | ||
| ) | ||
| return cleared |
Contributor
There was a problem hiding this comment.
Consider extracting these logic into smaller functions like this, it avoid nesting and make it easier to read:
def _should_skip_gnoi_shutdown(self, dpu_name: str) -> Optional[bool]:
"""
Returns:
True -> DPU is known to be offline/powered down; skip gNOI shutdown.
False -> DPU is known to be online (or some other state where we proceed).
None -> Cannot determine status; caller should proceed with gNOI shutdown.
"""
module_index = self._chassis.get_module_index(dpu_name)
if module_index < 0:
return None
module = self._chassis.get_module(module_index)
if module is None:
return None
oper_status = module.get_oper_status()
return oper_status in (
ModuleBase.MODULE_STATUS_OFFLINE,
ModuleBase.MODULE_STATUS_POWERED_DOWN,
)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Added an operational status check in
gnoi_shutdown_daemon.py_handle_transition()to skip the gNOI Reboot HALT sequence when the DPU is already not Online (e.g. Offline, PoweredDown).Why I did it
Fixes sonic-net/sonic-buildimage#25889
When DPUs are configured with
admin_status: "down"and are already powered off, a config reload or reboot repopulates CONFIG_DB, which triggersgnoi-shutdown-daemonto attempt a gNOI Reboot HALT command on DPUs that are already offline — producing error logs:How I verified it
test_handle_transition_dpu_already_offline— verifies skip when DPU is inOfflinestatetest_handle_transition_dpu_powered_down— verifies skip when DPU is inPoweredDownstatetest_handle_transition_oper_status_check_exception— verifies graceful fallback when the oper status check raises an exceptionget_oper_status()returningOnlineso they continue testing the normal gNOI shutdown flowDetails if related
The fix queries
get_oper_status()via the platform chassis API at the start of_handle_transition(). If the DPU is notOnline, the method logs a notice, clears the halt flag, and returns success — avoiding any gNOI RPC calls to unreachable DPUs. The check is wrapped in try/except so that if the platform API fails, the daemon falls back to the existing behavior.