Is it platform specific
generic
Importance or Severity
Medium
Description of the bug
Currently, sensormond and thermalctld delete all entries in the PHYSICAL_ENTITY_INFO table of STATE_DB when their class objects are garbage collected during shutdown. This creates race conditions where one daemon's data can be inadvertently deleted by another daemon.
Meanwhile, psud does not delete its entries from PHYSICAL_ENTITY_INFO at all during shutdown, which was masked with thermalctld/sensormond deleting all entries from the table.
Steps to Reproduce
- Modify
PSU_INFO_UPDATE_PERIOD_SECS from 3 seconds to 100 seconds (this is to have larger window to illustrate that even with periodically updating PSU_INFO, race condition can still happen)
- Restart
thermalctld daemon
- When
thermalctld stops, it deletes all PHYSICAL_ENTITY_INFO entries (including PSU entries owned by psud)
- During the 100-second window before
psud re-populates its data, SNMP queries will fail to retrieve PSU entity information.
You would see an error with below signature:
ERROR: MIBUpdater.start() caught an unexpected exception during update_data()
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/ax_interface/mib.py", line 51, in start
self.reinit_data()
File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 347, in reinit_data
raise Exception(exceptions)
Exception: [ValueError("invalid literal for int() with base 10: ''")]
ERROR: invalid literal for int() with base 10: ''
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 332, in reinit_data
updater.reinit_data()
File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 703, in reinit_data
self._update_entity_cache(name)
File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 893, in _update_entity_cache
psu_position = int(psu_position)
^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Actual Behavior and Expected Behavior
Expected Behavior:
- We should not be seeing exceptions/errors in the syslog when we shutdown any daemons.
- Shutting down one daemon shouldn't affect the data populated by other daemons.
Relevant log output
Output of show version, show techsupport
Attach files (if any)
No response
Is it platform specific
generic
Importance or Severity
Medium
Description of the bug
Currently,
sensormondandthermalctlddelete all entries in thePHYSICAL_ENTITY_INFOtable ofSTATE_DBwhen their class objects are garbage collected during shutdown. This creates race conditions where one daemon's data can be inadvertently deleted by another daemon.Meanwhile,
psuddoes not delete its entries fromPHYSICAL_ENTITY_INFOat all during shutdown, which was masked withthermalctld/sensormonddeleting all entries from the table.Steps to Reproduce
PSU_INFO_UPDATE_PERIOD_SECSfrom 3 seconds to 100 seconds (this is to have larger window to illustrate that even with periodically updatingPSU_INFO, race condition can still happen)thermalctlddaemonthermalctldstops, it deletes allPHYSICAL_ENTITY_INFOentries (including PSU entries owned bypsud)psudre-populates its data, SNMP queries will fail to retrieve PSU entity information.You would see an error with below signature:
Actual Behavior and Expected Behavior
Expected Behavior:
Relevant log output
Output of
show version,show techsupportAttach files (if any)
No response