[system-health] Add support for monitoring system health#4835
Merged
liat-grozovik merged 65 commits intosonic-net:masterfrom Oct 12, 2020
Merged
[system-health] Add support for monitoring system health#4835liat-grozovik merged 65 commits intosonic-net:masterfrom
liat-grozovik merged 65 commits intosonic-net:masterfrom
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…le; 2. set system led in library instead of daemon
jleveque
suggested changes
Sep 14, 2020
jleveque
suggested changes
Sep 15, 2020
jleveque
reviewed
Sep 15, 2020
jleveque
previously approved these changes
Sep 15, 2020
Contributor
jleveque
left a comment
There was a problem hiding this comment.
Looks good to me. Please wait for other reviewers.
keboliu
previously approved these changes
Sep 15, 2020
Collaborator
|
@sujinmkang, does this look ok for you? if yes, would please approve? |
sujinmkang
previously approved these changes
Sep 21, 2020
fe2f1be
Collaborator
|
retest this please |
Collaborator
Author
|
retest vs please |
sujinmkang
previously approved these changes
Sep 25, 2020
liat-grozovik
previously approved these changes
Oct 4, 2020
Collaborator
liat-grozovik
left a comment
There was a problem hiding this comment.
and it comes with tests :-)
Collaborator
|
@Junchao-Mellanox could you please resolve conflicts and once all tests pass we can merge |
466f983
Collaborator
Author
|
@liat-grozovik resolved. |
keboliu
approved these changes
Oct 12, 2020
liat-grozovik
approved these changes
Oct 12, 2020
santhosh-kt
pushed a commit
to santhosh-kt/sonic-buildimage
that referenced
this pull request
Feb 25, 2021
) * system health first commit * system health daemon first commit * Finish healthd * Changes due to lower layer logic change * Get ASIC temperature from TEMPERATURE_INFO table * Add system health make rule and service files * fix bugs found during manual test * Change make file to install system-health library to host * Set system LED to blink on bootup time * Caught exceptions in system health checker to make it more robust * fix issue that fan/psu presence will always be true * fix issue for external checker * move system-health service to right after rc-local service * Set system-health service start after database service * Get system up time via /proc/uptime * Provide more information in stat for CLI to use * fix typo * Set default category to External for external checker * If external checker reported OK, save it to stat too * Trim string for external checker output * fix issue: PSU voltage check always return OK * Add unit test cases for system health library * Fix LGTM warnings * fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon * Remove boot_timeout configuration because it will get from monit config file * Fix argument miss * fix unit test failure * fix issue: summary status is not correct * Fix format issues found in code review * rename th to threshold to make it clearer * Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead * Fix unit test failure * Fix LGTM alert * Fix LGTM alert * Fix review comments * Fix review comment * 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker * Ignore check for unknown service type * Fix unit test issue * Rename user define checker to user defined checker * Rename user_define_checkers to user_defined_checkers for configuration file * Renmae file user_define_checker.py -> user_defined_checker.py * Fix typo * Adjust import order for config.py Co-authored-by: Joe LeVeque <[email protected]> * Adjust import order for src/system-health/health_checker/hardware_checker.py Co-authored-by: Joe LeVeque <[email protected]> * Adjust import order for src/system-health/scripts/healthd Co-authored-by: Joe LeVeque <[email protected]> * Adjust import orders in src/system-health/tests/test_system_health.py * Fix typo * Add new line after import * If system health configuration file not exist, healthd should exit * Fix indent and enable pytest coverage * Fix typo * Fix typo * Remove global logger and use log functions inherited from super class * Change info level logger to notice level Co-authored-by: Joe LeVeque <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
- Why I did it
System health feature requires a service to collect system status and manage the system status LED.
- How I did it
- How to verify it
Manual test on SN2700. Regression test on SN2700, SN4600C, SN4700, SN3800, SN2100.
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)