Skip to content

[Mellanox] feed module info to hw-management#24957

Merged
liat-grozovik merged 9 commits intosonic-net:masterfrom
jianyuewu:master_innolight_thermal_algo
Jan 13, 2026
Merged

[Mellanox] feed module info to hw-management#24957
liat-grozovik merged 9 commits intosonic-net:masterfrom
jianyuewu:master_innolight_thermal_algo

Conversation

@jianyuewu
Copy link
Contributor

@jianyuewu jianyuewu commented Dec 31, 2025

Dependency

For 202511 branch, there is still one dependency:

  1. Cherry-pick from master merge conflict will be automatically resolved after merging this PR: [Mellanox] Fix issue: sfp.get_temperature_info cannot detect SFP replacement #24688

HW-MGMT version 7.0050.2930 has been merged into the 202511 branch, so it is no longer an external dependency.

Why I did it

At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

How I did it

Implemented new API for vendor-specific temperature offset adjustments:

  1. New API:

    • Add get_vendor_info() API with caching support.
  2. Smart Module Detection:

    • Cache vendor information (Manufacturer + Part Number) for each module.
    • Skip redundant vendor info updates when the same module is replugged.

How to verify it

  1. Plug in optical module -> Verify vendor info sent to HW-MGMT.
  2. Unplug and replug same module -> Verify no redundant vendor info update.
  3. Replace with different module -> Verify new vendor info sent.

Which release branch to backport (provide reason below if selected)

  • 202412
  • 202511

Tested branch (Please provide the tested image version)

202412

A picture of a cute animal (not mandatory but encouraged)

    /\_/\  
   ( o.o ) 
    > ^ <
   /|   |\
  (_|   |_)
   Cool Cat~

On first detection or module replacement, if the serial number (SN) has changed,
call vendor_data_set_module() with the manufacturer (MFG) and part number (PN)
to send the vendor info to hw-management.

Sample output like:
NOTICE pmon#thermalctld: Module 0 vendor info updated \
- manufacturer: NVIDIA part_number: MCP4Y10-N001

Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
@jianyuewu jianyuewu requested a review from lguohan as a code owner December 31, 2025 06:50
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jianyuewu
Copy link
Contributor Author

++ cat /tmp/tmp.RaKi3fI9Wm
WARNING: Image format was not specified for './sonic-installer.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
kvm: -serial telnet:127.0.0.1:9000,server: info: QEMU waiting for connection on: disconnected:telnet:127.0.0.1:9000,server=on
[W][17:04:19.190765] pw.conf      | [          conf.c: 1182 try_load_conf()] can't load config client.conf: No such file or directory
[E][17:04:19.190780] pw.conf      | [          conf.c: 1215 pw_conf_load_conf_for_context()] can't load config client.conf: No such file or directory
+ on_exit
+ rm -f /tmp/tmp.RaKi3fI9Wm
[  FAIL LOG END  ] [ target/sonic-vs.img.gz ]
make: *** [slave.mk:1450: target/sonic-vs.img.gz] Error 1
make[1]: *** [Makefile.work:621: target/sonic-vs.img.gz] Error 2
make[1]: Leaving directory '/data/vss/_work/1/s'
make: *** [Makefile:51: target/sonic-vs.img.gz] Error 2

##[error]Bash exited with code '2'.
Finishing: Build sonic image

Seems failure is not related with this change, in vm client.conf: No such file or directory

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jianyuewu
Copy link
Contributor Author

@r12f @prgeor Could you help review this PR for Innolight feature? Thank you~ BTW, 202511 branch also depends on #24688.

@liat-grozovik liat-grozovik merged commit 20b3670 into sonic-net:master Jan 13, 2026
13 checks passed
@r12f
Copy link
Contributor

r12f commented Feb 1, 2026

hi @jianyuewu , do you mind to help pick this to 202412?

@jianyuewu
Copy link
Contributor Author

@r12f , it is already in 202412: Azure/sonic-buildimage-msft@732a61d 😊

FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Mar 6, 2026
- Why I did it
At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

- How I did it
Implemented new API for vendor-specific temperature offset adjustments:

New API:
Add get_vendor_info() API with caching support.
Smart Module Detection:

Cache vendor information (Manufacturer + Part Number) for each module.
Skip redundant vendor info updates when the same module is replugged.

- How to verify it
Plug in optical module -> Verify vendor info sent to Nvidia API
Unplug and replug same module -> Verify no redundant vendor info update.
Replace with different module -> Verify new vendor info sent.

Signed-off-by: Feng Pan <fenpan@microsoft.com>
@vmittal-msft
Copy link
Contributor

@jianyuewu @liat-grozovik there is cherry-pick conflict for 202511. please raise a PR for 202511

PriyanshTratiya pushed a commit to PriyanshTratiya/sonic-buildimage that referenced this pull request Mar 16, 2026
- Why I did it
At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

- How I did it
Implemented new API for vendor-specific temperature offset adjustments:

New API:
Add get_vendor_info() API with caching support.
Smart Module Detection:

Cache vendor information (Manufacturer + Part Number) for each module.
Skip redundant vendor info updates when the same module is replugged.

- How to verify it
Plug in optical module -> Verify vendor info sent to Nvidia API
Unplug and replug same module -> Verify no redundant vendor info update.
Replace with different module -> Verify new vendor info sent.

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
@PriyanshTratiya
Copy link

Hi @jianyuewu @liat-grozovik can you help this manual cherry-pick PR: #26212 in our effort to get it in 202511? From your description the dependency PR has been merged: #24688 so this PR should be good to go, there was a conflict which was fixed manually hence needing your review, thanks!

@jianyuewu
Copy link
Contributor Author

Hi @jianyuewu @liat-grozovik can you help this manual cherry-pick PR: #26212 in our effort to get it in 202511? From your description the dependency PR has been merged: #24688 so this PR should be good to go, there was a conflict which was fixed manually hence needing your review, thanks!

@PriyanshTratiya It is very good, thanks indeed👍Review done😊

@jianyuewu jianyuewu deleted the master_innolight_thermal_algo branch March 17, 2026 02:14
@jianyuewu
Copy link
Contributor Author

jianyuewu commented Mar 17, 2026

Ah, so sorry, I missed previous comments about cherry-pick PR to 202511. Thanks indeed for the help👍

vmittal-msft pushed a commit that referenced this pull request Mar 17, 2026
- Why I did it
At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

- How I did it
Implemented new API for vendor-specific temperature offset adjustments:

New API:
Add get_vendor_info() API with caching support.
Smart Module Detection:

Cache vendor information (Manufacturer + Part Number) for each module.
Skip redundant vendor info updates when the same module is replugged.

- How to verify it
Plug in optical module -> Verify vendor info sent to Nvidia API
Unplug and replug same module -> Verify no redundant vendor info update.
Replace with different module -> Verify new vendor info sent.

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Co-authored-by: Jianyue Wu <jianyuew@nvidia.com>
dprital pushed a commit that referenced this pull request Mar 19, 2026
- Why I did it
At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.

- How I did it
Implemented new API for vendor-specific temperature offset adjustments:

New API:
Add get_vendor_info() API with caching support.
Smart Module Detection:

Cache vendor information (Manufacturer + Part Number) for each module.
Skip redundant vendor info updates when the same module is replugged.

- How to verify it
Plug in optical module -> Verify vendor info sent to Nvidia API
Unplug and replug same module -> Verify no redundant vendor info update.
Replace with different module -> Verify new vendor info sent.

Signed-off-by: dprital <drorp@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants