Skip to content

[Mellanox] Thermal algorithm module enhancement#15

Closed
jianyuewu wants to merge 1 commit intomasterfrom
thermal_updater
Closed

[Mellanox] Thermal algorithm module enhancement#15
jianyuewu wants to merge 1 commit intomasterfrom
thermal_updater

Conversation

@jianyuewu
Copy link
Owner

@jianyuewu jianyuewu commented Aug 26, 2025

Enhance thermal updater algorithm for FW mode.

Why I did it

In parallel to the thermal_updator and the hw-management-sync-service, the thermalctld read also the sensors temperature and sensors thresholds, so we disable hw-management-sync-service, and use thermal_updator to update hw-management thermal related files.

How I did it

  1. Thermal platform API read from /var/run/hw-management/ files directly.
  2. Modified SmartswitchThermalUpdater.start() to use parent start() directly.
  3. Remove SmartswitchThermalUpdater.stop().
  4. Enhanced ThermalUpdater:
    • FW mode: Read from SDK fs.
    • SW mode: Read from EEPROM.
  5. Disable hw-mgmt sync service.
  6. Add a maximum attempt limit to reduce unnecessary retry time.
  7. Add sysfs readiness for thermal updater.
  8. Remove asics_init_done dependency from platform ready check.

How to verify it

Check hw-management/thermal files are updated with normal switch and smart switch:

  1. Test SW mode thermal.
  2. Test FW mode thermal.
  3. Test reboot.
  4. Test config reload.

Which release branch to backport (provide reason below if selected)

  • 202412
  • 202505

Tested branch (Please provide the tested image version)

  • 202412
  • 202505

@jianyuewu jianyuewu changed the base branch from master to test_max_cable_len August 26, 2025 06:42
@jianyuewu jianyuewu changed the base branch from test_max_cable_len to master August 26, 2025 06:42
@jianyuewu jianyuewu force-pushed the thermal_updater branch 5 times, most recently from fb3cf2d to 0ac929d Compare September 4, 2025 03:29
@jianyuewu jianyuewu force-pushed the thermal_updater branch 2 times, most recently from 4d1dfa2 to 68ad0bd Compare September 9, 2025 07:37
@jianyuewu jianyuewu force-pushed the thermal_updater branch 7 times, most recently from a0ae54a to 208c727 Compare September 30, 2025 10:13
@jianyuewu jianyuewu force-pushed the thermal_updater branch 2 times, most recently from ef4f86b to bf5eddb Compare October 9, 2025 08:45
@jianyuewu jianyuewu force-pushed the thermal_updater branch 7 times, most recently from c631a5e to c80aaec Compare October 11, 2025 02:51
Background:
In parallel to the thermal_updator and the hw-managemen-sync-service,
the thermalctrld read also the sensors temperature and sensors thresholds.
All the above entity reading module temperature over I2C, need to be avoided.
Also reading the same information by different entity affect CPU utilization.
In SW mode: Keep as it is.
In FW mode: Sonic also will be responsible for reading SDK and update
hw-management sysfs.

Changes are:
Read module temperature/threshold from sdk sysfs.
Add cache in FW mode when get temperature info.
Disable hw-mgmt sync service.
Thermal platform API read from /var/run/hw-management/ files directly.
Add a maximum attempt limit to reduce unnecessary retry time.
Add sysfs readiness for thermal updater.
Remove asics_init_done dependency from platform ready check.
@jianyuewu jianyuewu closed this Oct 24, 2025
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tically (sonic-net#660)

#### Why I did it
src/sonic-sairedis
```
* 058ed4c - (HEAD -> 202412, origin/HEAD, origin/202412) [code sync] Merge code from sonic-net/sonic-sairedis:202411 to 202412 (#15) (24 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…omatically (sonic-net#684)

#### Why I did it
src/sonic-swss-common
```
* cb7c9d7 - (HEAD -> 202412, origin/HEAD, origin/202412) [code sync] Merge code from sonic-net/sonic-swss-common:202411 to 202412 (#15) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tomatically (sonic-net#683)

#### Why I did it
src/sonic-linux-kernel
```
* 88b7f08 - (HEAD -> 202412, origin/HEAD, origin/202412) [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#15) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…test HEAD automatically (sonic-net#1146)

#### Why I did it
src/sonic-platform-daemons
```
* 72c1f36 - (HEAD -> 202412, origin/202412) [xcvrd] do not wait state change while calling cmis.set_lpmode (#15) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tomatically (sonic-net#1498)

#### Why I did it
src/sonic-gnmi
```
* 3679372 - (HEAD -> 202412, origin/202412) Add SHOW implementation for interface transceiver error-status. (#18) (4 hours ago) [mssonicbld]
* 45d679a - Add show watermark telemetry interval implementation (#16) (19 hours ago) [mssonicbld]
* 57d0b6f - Simplify option support for all SHOW paths (#15) (23 hours ago) [mssonicbld]
* 7dd2615 - Add support for show int error (#14) (24 hours ago) [mssonicbld]
* d8e0216 - Add SHOW implementation for interface counters (#11) (25 hours ago) [mssonicbld]
* 6c56f41 - [202412] Manual cherrypick for adding support for RATES tables in Counters DB so that PRE_FEC/POST_FEC_BER via ST (#13) (26 hours ago) [Zain Budhwani]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants