Skip to content

Bug:[Smartswitch] Module state transition: nested lock / deadlock #26282

@gpunathilell

Description

@gpunathilell

Is it platform specific

generic

Importance or Severity

Critical

Description of the bug

get_module_state_transition() in module_base.py enters with self._transition_operation_lock() (exclusive fcntl.flock on /var/lock/<module>_transition.lock), then—when the transition flag is stale and past the configured timeout—calls clear_module_state_transition(), which enters the same lock again in a nested with self._transition_operation_lock().

That is non-reentrant: the same thread already holds the exclusive lock from the outer with. The inner open() + flock(LOCK_EX) must wait until the lock is released, but the outer critical section does not finish until clear_module_state_transition() returns. The inner call never completes → self-deadlock and the caller hangs indefinitely.

A common user-visible trigger is the smartswitch CLI path that checks the transition flag before applying config:

  • config chassis modules startup DPU0config/chassis_modules.pyModuleHelper.get_module_state_transition() → platform ModuleBase.get_module_state_transition() (see utilities_common/module.py).

When STATE_DB still has transition_in_progress set with an expired timeout, the timeout branch runs, hits the nested lock, and the command can block forever instead of returning or clearing the flag.

Steps to Reproduce

  1. Use a smartswitch platform where config chassis modules startup <DPU> calls ModuleHelper.get_module_state_transition() (see config/chassis_modules.py, startup command).
  2. In STATE_DB, set CHASSIS_MODULE_TABLE|<DPU>|transition_in_progress to "True" with valid transition_start_time and transition_type such that now − start_time is greater than the timeout for that type (_load_transition_timeouts() / platform.json DPU transition timeouts).
  3. Run: config chassis modules startup DPU0 (adjust name to match your module).
  4. Observe the process hang while executing get_module_state_transition() → timeout path → clear_module_state_transition() (nested lock).

Actual Behavior and Expected Behavior

Actual Expected
CLI / API config chassis modules startup DPU0 (and any caller of get_module_state_transition in the stale-timeout case) can block forever. The check completes: stale transition fields are cleared (or the method returns) and the CLI returns promptly.
Locking Same thread holds the transition lock, then tries to take it again inside clear_module_state_transition()deadlock. Timeout handling deletes the transition hash fields under the existing lock, without re-entering _transition_operation_lock() (e.g. inline hdel in get_module_state_transition, or reuse a private helper that assumes the lock is already held).

Relevant log output

Output of show version, show techsupport

202511_RC latest hash

Attach files (if any)

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions