Skip to content

[Mellanox] Thermal algo changes for optical module#16

Closed
jianyuewu wants to merge 5 commits intothermal_updaterfrom
thermal_algo_vendor
Closed

[Mellanox] Thermal algo changes for optical module#16
jianyuewu wants to merge 5 commits intothermal_updaterfrom
thermal_algo_vendor

Conversation

@jianyuewu
Copy link
Copy Markdown
Owner

@jianyuewu jianyuewu commented Oct 22, 2025

Why I did it

At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

How I did it

Implemented new API for vendor-specific temperature offset adjustments:

  1. New API:

    • Add get_vendor_info() API with caching support.
  2. Smart Module Detection:

    • Cache vendor information (Manufacturer + Part Number) for each module.
    • Skip redundant vendor info updates when the same module is replugged.

How to verify it

  1. Plug in optical module -> Verify vendor info sent to HW-MGMT.
  2. Unplug and replug same module -> Verify no redundant vendor info update.
  3. Replace with different module -> Verify new vendor info sent.

Which release branch to backport (provide reason below if selected)

  • 202412
  • 202505

Tested branch (Please provide the tested image version)

TBD, will test 202412.

A picture of a cute animal (not mandatory but encouraged)

    /\_/\  
   ( o.o ) 
    > ^ <
   /|   |\
  (_|   |_)
   Cool Cat~

@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch 5 times, most recently from 30c5d40 to 136cf8c Compare October 23, 2025 10:10
@jianyuewu jianyuewu marked this pull request as draft October 24, 2025 03:42
@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch from 3ed36fe to 563098c Compare October 27, 2025 06:54
@jianyuewu jianyuewu force-pushed the thermal_updater branch 2 times, most recently from 77d57e2 to 9437107 Compare October 29, 2025 08:23
@jianyuewu jianyuewu force-pushed the thermal_updater branch 3 times, most recently from 3cd39f3 to a5095b5 Compare November 11, 2025 07:30
@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch 2 times, most recently from ef75be8 to 65968b9 Compare November 14, 2025 12:19
@jianyuewu jianyuewu force-pushed the thermal_updater branch 2 times, most recently from b2a2401 to 6668f35 Compare November 17, 2025 07:56
@jianyuewu jianyuewu changed the title Thermal algo changes for optical module [Mellanox] Thermal algo changes for optical module Nov 19, 2025
@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch from 840b3b5 to 9e5ff9a Compare December 1, 2025 08:39
@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch from 9e5ff9a to 280ce2d Compare December 1, 2025 09:31
jianyuewu and others added 5 commits December 2, 2025 14:28
Background:
In parallel to the thermal_updator and the hw-managemen-sync-service,
the thermalctrld read also the sensors temperature and sensors thresholds.
All the above entity reading module temperature over I2C, need to be avoided.
Also reading the same information by different entity affect CPU utilization.
In SW mode: Keep as it is.
In FW mode: Sonic also will be responsible for reading SDK and update
hw-management sysfs.

Changes are:
Read module temperature/threshold from sdk sysfs.
Add cache in FW mode when get temperature info.
Disable hw-mgmt sync service.
Thermal platform API read from /var/run/hw-management/ files directly.
Add a maximum attempt limit to reduce unnecessary retry time.
Add sysfs readiness for thermal updater.
Remove asics_init_done dependency from platform ready check.

Signed-off-by: Jianyue Wu <[email protected]>
On first detection or module replacement, if the serial number (SN) has changed,
call vendor_data_set_module() with the manufacturer (MFG) and part number (PN)
to send the vendor info to hw-management.

Sample output like:
NOTICE pmon#thermalctld: Module 0 vendor info updated \
- manufacturer: NVIDIA part_number: MCP4Y10-N001

Signed-off-by: Jianyue Wu <[email protected]>
Register clean_thermal_data() with atexit in start() instead of calling
it directly. This moves thermal data cleanup from initialization to
termination, ensuring proper cleanup when thermalctld exits.

Signed-off-by: Jianyue Wu <[email protected]>
@jianyuewu jianyuewu force-pushed the thermal_algo_vendor branch from 280ce2d to 6d00a71 Compare December 2, 2025 07:49
@jianyuewu jianyuewu closed this Dec 3, 2025
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tically (sonic-net#673)

#### Why I did it
src/sonic-sairedis
```
* f727bb5 - (HEAD -> 202412, origin/HEAD, origin/202412) [code sync] Merge code from sonic-net/sonic-sairedis:202411 to 202412 (#16) (55 minutes ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…omatically (sonic-net#690)

#### Why I did it
src/sonic-swss-common
```
* e787abe - (HEAD -> 202412, origin/HEAD, origin/202412) [code sync] Merge code from sonic-net/sonic-swss-common:202411 to 202412 (#16) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tomatically (sonic-net#689)

#### Why I did it
src/sonic-linux-kernel
```
* 771ce48 - (HEAD -> 202412, origin/HEAD, origin/202412) [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#16) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…test HEAD automatically (sonic-net#1148)

#### Why I did it
src/sonic-platform-daemons
```
* 5016ded - (HEAD -> 202412, origin/202412) [xcvrd] Optimize module initialization performance (sonic-net#611) (#16) (11 minutes ago) [Junchao-Mellanox]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…tomatically (sonic-net#1498)

#### Why I did it
src/sonic-gnmi
```
* 3679372 - (HEAD -> 202412, origin/202412) Add SHOW implementation for interface transceiver error-status. (#18) (4 hours ago) [mssonicbld]
* 45d679a - Add show watermark telemetry interval implementation (#16) (19 hours ago) [mssonicbld]
* 57d0b6f - Simplify option support for all SHOW paths (#15) (23 hours ago) [mssonicbld]
* 7dd2615 - Add support for show int error (#14) (24 hours ago) [mssonicbld]
* d8e0216 - Add SHOW implementation for interface counters (#11) (25 hours ago) [mssonicbld]
* 6c56f41 - [202412] Manual cherrypick for adding support for RATES tables in Counters DB so that PRE_FEC/POST_FEC_BER via ST (#13) (26 hours ago) [Zain Budhwani]
```
#### How I did it
#### How to verify it
#### Description for the changelog
jianyuewu pushed a commit that referenced this pull request Jan 15, 2026
… sensor errors (sonic-net#24783)

- Why I did it
Fix transient errors during bfb install on smartswitch platform.

ERR pmon#sensord: Error getting sensor data: mp2975/#16: Kernel interface error

- How I did it
Use pre-shutdown procedures before doing a reboot

- How to verify it
Installation of bfb image on dpu from switch shouldn't cause errors

Signed-off-by: Hemanth Kumar Tirupati <[email protected]>
jianyuewu pushed a commit that referenced this pull request Feb 3, 2026
… sensor errors (sonic-net#25276)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
Fix transient errors during bfb install on smartswitch platform.

```
ERR pmon#sensord: Error getting sensor data: mp2975/#16: Kernel interface error
```
##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Use pre-shutdown procedures before doing a reboot

#### How to verify it
Installation of bfb image on dpu from switch shouldn't cause errors
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants