Skip to content

Add support for liquid cooling inside hardware checker#24124

Merged
qiluo-msft merged 2 commits intosonic-net:masterfrom
yuazhe:master_lc_system_health
Dec 4, 2025
Merged

Add support for liquid cooling inside hardware checker#24124
qiluo-msft merged 2 commits intosonic-net:masterfrom
yuazhe:master_lc_system_health

Conversation

@yuazhe
Copy link
Contributor

@yuazhe yuazhe commented Sep 26, 2025

Why I did it

Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status

Work item tracking
  • Microsoft ADO (number only):

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@yuazhe yuazhe requested a review from lguohan as a code owner September 26, 2025 06:53
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 26, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yuazhe yuazhe force-pushed the master_lc_system_health branch from 9ca9219 to 814ae84 Compare September 29, 2025 05:51
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yuazhe yuazhe force-pushed the master_lc_system_health branch from 814ae84 to 4c6f08c Compare September 29, 2025 05:54
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@keboliu
Copy link
Collaborator

keboliu commented Oct 22, 2025

@judyjoseph would you please help to review this PR?

@judyjoseph
Copy link
Contributor

@qiluo-msft are we still ok to use the eventd infrastructure, remember there was some memory leaks we found ?

@judyjoseph
Copy link
Contributor

@qiluo-msft are we still ok to use the eventd infrastructure, remember there was some memory leaks we found ?

@qiluo-msft can you comment, thanks

@qiluo-msft qiluo-msft requested a review from zbud-msft November 21, 2025 18:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for monitoring liquid cooling devices in the system health checker, implementing functionality defined in the liquid cooling HLD. The implementation follows existing patterns for hardware monitoring (ASIC, PSU, fan) and introduces an opt-in configuration mechanism through include_devices.

Key Changes:

  • Added liquid cooling leak detection monitoring in the hardware checker with event publishing capabilities
  • Introduced include_devices configuration field to enable optional device monitoring features
  • Refactored event publisher lifecycle management to initialize/deinitialize within function scope

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/system-health/health_checker/hardware_checker.py Adds _check_liquid_cooling_status() method to monitor leak sensors and publish events; includes event publisher helper method
src/system-health/health_checker/config.py Adds include_devices configuration field to support opt-in device monitoring
src/system-health/health_checker/service_checker.py Refactors event publisher to initialize/deinitialize within method scope instead of as instance variable
src/system-health/health_checker/system_health_monitoring_config.json Adds include_devices field to production configuration
src/system-health/tests/test_system_health.py Adds test coverage for liquid cooling status checking with multiple leak scenarios and configuration validation
src/system-health/tests/system_health_monitoring_config.json Adds include_devices field with liquid_cooling enabled for test configuration
src/system-health/health_checker/utils.py Removes extraneous blank line (code cleanup)
Comments suppressed due to low confidence (1)

src/system-health/health_checker/hardware_checker.py:12

  • The class docstring is outdated and should include liquid cooling as part of the hardware checks being performed. Consider updating it to: "Check system hardware status. For now, it checks ASIC, PSU, fan and liquid cooling status."
    """
    Check system hardware status. For now, it checks ASIC, PSU and fan status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@judyjoseph
Copy link
Contributor

@qiluo-msft could you review again

@judyjoseph
Copy link
Contributor

@yuazhe can you update the branch

@yuazhe
Copy link
Contributor Author

yuazhe commented Dec 3, 2025

@yuazhe can you update the branch

Update the branch will cause checker to be re-ran, this should be unnecessary as the change can be cleanly merged.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Collaborator

@yuazhe Could you please reduce the force-push in PR? We need to code review between commits to understand what are the new changes. Force-push will destroy history and make code review inefficient.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yuazhe
Copy link
Contributor Author

yuazhe commented Dec 4, 2025

@yuazhe Could you please reduce the force-push in PR? We need to code review between commits to understand what are the new changes. Force-push will destroy history and make code review inefficient.
My bad, I thought it's necessary for me to keep the commit always as one, but turned out during the merge the CI will automatically rebase them into one, I will stop doing that, thanks for the reminder.

Signed-off-by: Yuanzhe, Liu <yualiu@nvidia.com>
@yuazhe yuazhe force-pushed the master_lc_system_health branch from 63e2bf8 to 1efe5e1 Compare December 4, 2025 08:42
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft qiluo-msft merged commit a61dc07 into sonic-net:master Dec 4, 2025
23 checks passed
hdwhdw pushed a commit to hdwhdw/sonic-buildimage that referenced this pull request Dec 4, 2025
Why I did it
Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status
kewei-arista pushed a commit to kewei-arista/sonic-buildimage that referenced this pull request Dec 8, 2025
Why I did it
Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status
xwjiang-ms pushed a commit to xwjiang-ms/sonic-buildimage that referenced this pull request Dec 22, 2025
Why I did it
Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status

Signed-off-by: xiaweijiang <xiaweijiang@microsoft.com>
jasonbridges pushed a commit to jasonbridges/sonic-buildimage that referenced this pull request Jan 22, 2026
Why I did it
Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status
FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Mar 6, 2026
Why I did it
Following this HLD for liquid cooling sonic-net/SONiC#2032, I created this pr to add checker inside hardware checker to monitor liquid cooling device status

Signed-off-by: Feng Pan <fenpan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants