-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is it platform specific
generic
Importance or Severity
Critical
Description of the bug
get_module_state_transition() in module_base.py enters with self._transition_operation_lock() (exclusive fcntl.flock on /var/lock/<module>_transition.lock), then—when the transition flag is stale and past the configured timeout—calls clear_module_state_transition(), which enters the same lock again in a nested with self._transition_operation_lock().
That is non-reentrant: the same thread already holds the exclusive lock from the outer with. The inner open() + flock(LOCK_EX) must wait until the lock is released, but the outer critical section does not finish until clear_module_state_transition() returns. The inner call never completes → self-deadlock and the caller hangs indefinitely.
A common user-visible trigger is the smartswitch CLI path that checks the transition flag before applying config:
config chassis modules startup DPU0→config/chassis_modules.py→ModuleHelper.get_module_state_transition()→ platformModuleBase.get_module_state_transition()(seeutilities_common/module.py).
When STATE_DB still has transition_in_progress set with an expired timeout, the timeout branch runs, hits the nested lock, and the command can block forever instead of returning or clearing the flag.
Steps to Reproduce
- Use a smartswitch platform where
config chassis modules startup <DPU>callsModuleHelper.get_module_state_transition()(seeconfig/chassis_modules.py,startupcommand). - In STATE_DB, set
CHASSIS_MODULE_TABLE|<DPU>|transition_in_progressto"True"with validtransition_start_timeandtransition_typesuch that now − start_time is greater than the timeout for that type (_load_transition_timeouts()/platform.jsonDPU transition timeouts). - Run:
config chassis modules startup DPU0(adjust name to match your module). - Observe the process hang while executing
get_module_state_transition()→ timeout path →clear_module_state_transition()(nested lock).
Actual Behavior and Expected Behavior
| Actual | Expected | |
|---|---|---|
| CLI / API | config chassis modules startup DPU0 (and any caller of get_module_state_transition in the stale-timeout case) can block forever. |
The check completes: stale transition fields are cleared (or the method returns) and the CLI returns promptly. |
| Locking | Same thread holds the transition lock, then tries to take it again inside clear_module_state_transition() → deadlock. |
Timeout handling deletes the transition hash fields under the existing lock, without re-entering _transition_operation_lock() (e.g. inline hdel in get_module_state_transition, or reuse a private helper that assumes the lock is already held). |
Relevant log output
Output of show version, show techsupport
202511_RC latest hashAttach files (if any)
No response