Skip to content

Bug: Race condition in PHYSICAL_ENTITY_INFO table cleanup during daemon shutdown #25652

@ymd-arista

Description

@ymd-arista

Is it platform specific

generic

Importance or Severity

Medium

Description of the bug

Currently, sensormond and thermalctld delete all entries in the PHYSICAL_ENTITY_INFO table of STATE_DB when their class objects are garbage collected during shutdown. This creates race conditions where one daemon's data can be inadvertently deleted by another daemon.

Meanwhile, psud does not delete its entries from PHYSICAL_ENTITY_INFO at all during shutdown, which was masked with thermalctld/sensormond deleting all entries from the table.

Steps to Reproduce

  1. Modify PSU_INFO_UPDATE_PERIOD_SECS from 3 seconds to 100 seconds (this is to have larger window to illustrate that even with periodically updating PSU_INFO, race condition can still happen)
  2. Restart thermalctld daemon
  3. When thermalctld stops, it deletes all PHYSICAL_ENTITY_INFO entries (including PSU entries owned by psud)
  4. During the 100-second window before psud re-populates its data, SNMP queries will fail to retrieve PSU entity information.

You would see an error with below signature:

ERROR: MIBUpdater.start() caught an unexpected exception during update_data()
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/ax_interface/mib.py", line 51, in start
    self.reinit_data()
  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 347, in reinit_data
    raise Exception(exceptions)
Exception: [ValueError("invalid literal for int() with base 10: ''")]
ERROR: invalid literal for int() with base 10: ''
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 332, in reinit_data
    updater.reinit_data()
  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 703, in reinit_data
    self._update_entity_cache(name)
  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 893, in _update_entity_cache
    psu_position = int(psu_position)
    ^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''

Actual Behavior and Expected Behavior

Expected Behavior:

  1. We should not be seeing exceptions/errors in the syslog when we shutdown any daemons.
  2. Shutting down one daemon shouldn't affect the data populated by other daemons.

Relevant log output

Output of show version, show techsupport

Attach files (if any)

No response

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions