[Mellanox] feed module info to hw-management#24957
[Mellanox] feed module info to hw-management#24957liat-grozovik merged 9 commits intosonic-net:masterfrom
Conversation
On first detection or module replacement, if the serial number (SN) has changed, call vendor_data_set_module() with the manufacturer (MFG) and part number (PN) to send the vendor info to hw-management. Sample output like: NOTICE pmon#thermalctld: Module 0 vendor info updated \ - manufacturer: NVIDIA part_number: MCP4Y10-N001 Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
++ cat /tmp/tmp.RaKi3fI9Wm
WARNING: Image format was not specified for './sonic-installer.img' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
kvm: -serial telnet:127.0.0.1:9000,server: info: QEMU waiting for connection on: disconnected:telnet:127.0.0.1:9000,server=on
[W][17:04:19.190765] pw.conf | [ conf.c: 1182 try_load_conf()] can't load config client.conf: No such file or directory
[E][17:04:19.190780] pw.conf | [ conf.c: 1215 pw_conf_load_conf_for_context()] can't load config client.conf: No such file or directory
+ on_exit
+ rm -f /tmp/tmp.RaKi3fI9Wm
[ FAIL LOG END ] [ target/sonic-vs.img.gz ]
make: *** [slave.mk:1450: target/sonic-vs.img.gz] Error 1
make[1]: *** [Makefile.work:621: target/sonic-vs.img.gz] Error 2
make[1]: Leaving directory '/data/vss/_work/1/s'
make: *** [Makefile:51: target/sonic-vs.img.gz] Error 2
##[error]Bash exited with code '2'.
Finishing: Build sonic imageSeems failure is not related with this change, in vm client.conf: No such file or directory |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @jianyuewu , do you mind to help pick this to 202412? |
|
@r12f , it is already in 202412: Azure/sonic-buildimage-msft@732a61d 😊 |
- Why I did it At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings. This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes. - How I did it Implemented new API for vendor-specific temperature offset adjustments: New API: Add get_vendor_info() API with caching support. Smart Module Detection: Cache vendor information (Manufacturer + Part Number) for each module. Skip redundant vendor info updates when the same module is replugged. - How to verify it Plug in optical module -> Verify vendor info sent to Nvidia API Unplug and replug same module -> Verify no redundant vendor info update. Replace with different module -> Verify new vendor info sent. Signed-off-by: Feng Pan <fenpan@microsoft.com>
|
@jianyuewu @liat-grozovik there is cherry-pick conflict for 202511. please raise a PR for 202511 |
- Why I did it At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings. This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes. - How I did it Implemented new API for vendor-specific temperature offset adjustments: New API: Add get_vendor_info() API with caching support. Smart Module Detection: Cache vendor information (Manufacturer + Part Number) for each module. Skip redundant vendor info updates when the same module is replugged. - How to verify it Plug in optical module -> Verify vendor info sent to Nvidia API Unplug and replug same module -> Verify no redundant vendor info update. Replace with different module -> Verify new vendor info sent. Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
|
Hi @jianyuewu @liat-grozovik can you help this manual cherry-pick PR: #26212 in our effort to get it in 202511? From your description the dependency PR has been merged: #24688 so this PR should be good to go, there was a conflict which was fixed manually hence needing your review, thanks! |
@PriyanshTratiya It is very good, thanks indeed👍Review done😊 |
|
Ah, so sorry, I missed previous comments about cherry-pick PR to 202511. Thanks indeed for the help👍 |
- Why I did it At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings. This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes. - How I did it Implemented new API for vendor-specific temperature offset adjustments: New API: Add get_vendor_info() API with caching support. Smart Module Detection: Cache vendor information (Manufacturer + Part Number) for each module. Skip redundant vendor info updates when the same module is replugged. - How to verify it Plug in optical module -> Verify vendor info sent to Nvidia API Unplug and replug same module -> Verify no redundant vendor info update. Replace with different module -> Verify new vendor info sent. Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com> Co-authored-by: Jianyue Wu <jianyuew@nvidia.com>
- Why I did it At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings. This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes. - How I did it Implemented new API for vendor-specific temperature offset adjustments: New API: Add get_vendor_info() API with caching support. Smart Module Detection: Cache vendor information (Manufacturer + Part Number) for each module. Skip redundant vendor info updates when the same module is replugged. - How to verify it Plug in optical module -> Verify vendor info sent to Nvidia API Unplug and replug same module -> Verify no redundant vendor info update. Replace with different module -> Verify new vendor info sent. Signed-off-by: dprital <drorp@nvidia.com>
Dependency
For 202511 branch, there is still one dependency:
HW-MGMT version 7.0050.2930 has been merged into the 202511 branch, so it is no longer an external dependency.
Why I did it
At 40°C ambient temperature with current FW+SW, some modules have >7.6% probability of reaching 75°C, which triggers false temperature warnings.
This PR implements vendor-specific temperature threshold support to eliminate false warnings while maintaining accurate temperature telemetry for monitoring purposes.
How I did it
Implemented new API for vendor-specific temperature offset adjustments:
New API:
Smart Module Detection:
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
202412
A picture of a cute animal (not mandatory but encouraged)