-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[system-health] Add support for monitoring system health #4835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
liat-grozovik
merged 65 commits into
sonic-net:master
from
Junchao-Mellanox:system-health
Oct 12, 2020
Merged
Changes from all commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
f3d3fb5
system health first commit
Junchao-Mellanox 63623a7
system health daemon first commit
Junchao-Mellanox e988130
Finish healthd
Junchao-Mellanox 7ed33df
Changes due to lower layer logic change
Junchao-Mellanox fd301e6
Get ASIC temperature from TEMPERATURE_INFO table
Junchao-Mellanox 77d57cc
Add system health make rule and service files
Junchao-Mellanox ae00266
fix bugs found during manual test
Junchao-Mellanox ad8a740
Change make file to install system-health library to host
Junchao-Mellanox cf861fe
Set system LED to blink on bootup time
Junchao-Mellanox 7eb6082
Caught exceptions in system health checker to make it more robust
Junchao-Mellanox 91c43f0
fix issue that fan/psu presence will always be true
Junchao-Mellanox 509fa5c
fix issue for external checker
Junchao-Mellanox d88515d
move system-health service to right after rc-local service
Junchao-Mellanox a198cc5
Set system-health service start after database service
Junchao-Mellanox 30b4668
Get system up time via /proc/uptime
Junchao-Mellanox 8fea891
Provide more information in stat for CLI to use
Junchao-Mellanox 0134052
fix typo
Junchao-Mellanox f1def48
Set default category to External for external checker
Junchao-Mellanox 7123b8e
If external checker reported OK, save it to stat too
Junchao-Mellanox d68a43c
Trim string for external checker output
Junchao-Mellanox b24c6f8
fix issue: PSU voltage check always return OK
Junchao-Mellanox d9d125d
Add unit test cases for system health library
Junchao-Mellanox 465efa7
Fix LGTM warnings
Junchao-Mellanox 8ca8a26
Merge branch 'master' into system-health
Junchao-Mellanox cd17e6b
fix demo comments: 1. get boot up timeout from monit configuration fi…
Junchao-Mellanox 3fbff53
Remove boot_timeout configuration because it will get from monit conf…
Junchao-Mellanox a9dcb26
Fix argument miss
Junchao-Mellanox da272cc
fix unit test failure
Junchao-Mellanox 622cb3e
fix issue: summary status is not correct
Junchao-Mellanox 084c2e2
Fix format issues found in code review
Junchao-Mellanox f84cdd9
rename th to threshold to make it clearer
Junchao-Mellanox 0a5ed17
Merge branch 'master' into system-health
Junchao-Mellanox 0c1b6ff
Fix review comment: 1. add a .dep file for system health; 2. deprecat…
Junchao-Mellanox e1c62f7
Fix unit test failure
Junchao-Mellanox 1092779
Fix LGTM alert
Junchao-Mellanox 866c0d3
Fix LGTM alert
Junchao-Mellanox c237886
Merge branch 'master' into system-health
Junchao-Mellanox a05ca87
Merge branch 'master' into system-health
Junchao-Mellanox fbfd654
Merge branch 'system-health' of github.com:Junchao-Mellanox/sonic-bui…
Junchao-Mellanox 7dc033b
Fix review comments
Junchao-Mellanox 3c722e1
Fix review comment
Junchao-Mellanox 911b6aa
Merge branch 'master' into system-health
Junchao-Mellanox 035cec9
1. Add relevant comments for system health; 2. rename external_checke…
Junchao-Mellanox 30235fc
Merge branch 'system-health' of github.com:Junchao-Mellanox/sonic-bui…
Junchao-Mellanox 183ddcc
Ignore check for unknown service type
Junchao-Mellanox 451a395
Fix unit test issue
Junchao-Mellanox 011b3af
Rename user define checker to user defined checker
Junchao-Mellanox a30d9b5
Rename user_define_checkers to user_defined_checkers for configuratio…
Junchao-Mellanox 001141c
Renmae file user_define_checker.py -> user_defined_checker.py
Junchao-Mellanox 8ad6dc7
Fix typo
Junchao-Mellanox 12eef05
Adjust import order for config.py
Junchao-Mellanox 14808cf
Adjust import order for src/system-health/health_checker/hardware_che…
Junchao-Mellanox 610fb49
Adjust import order for src/system-health/scripts/healthd
Junchao-Mellanox 6d0ae4c
Adjust import orders in src/system-health/tests/test_system_health.py
Junchao-Mellanox aece158
Fix typo
Junchao-Mellanox 8812061
Add new line after import
Junchao-Mellanox 8ea2ab5
If system health configuration file not exist, healthd should exit
Junchao-Mellanox d4c2df4
Merge branch 'master' into system-health
Junchao-Mellanox 9de4127
Fix indent and enable pytest coverage
Junchao-Mellanox c9f09b0
Fix typo
Junchao-Mellanox ae9a476
Fix typo
Junchao-Mellanox 78a2dc6
Remove global logger and use log functions inherited from super class
Junchao-Mellanox cb7f5d2
Change info level logger to notice level
Junchao-Mellanox fe2f1be
Merge branch 'master' into system-health
Junchao-Mellanox 466f983
Merge branch 'master' into system-health
Junchao-Mellanox File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
8 changes: 4 additions & 4 deletions
8
device/mellanox/x86_64-mlnx_msn2010-r0/system_health_monitoring_config.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,11 +1,11 @@ | ||
| { | ||
| "services_to_ignore": [], | ||
| "devices_to_ignore": ["psu.voltage", "psu.temperature"], | ||
| "external_checkers": [], | ||
| "user_defined_checkers": [], | ||
| "polling_interval": 60, | ||
| "led_color": { | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| } | ||
| } |
8 changes: 4 additions & 4 deletions
8
device/mellanox/x86_64-mlnx_msn2700-r0/system_health_monitoring_config.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,11 +1,11 @@ | ||
| { | ||
| "services_to_ignore": [], | ||
| "devices_to_ignore": ["psu.voltage"], | ||
| "external_checkers": [], | ||
| "user_defined_checkers": [], | ||
| "polling_interval": 60, | ||
| "led_color": { | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| } | ||
| } |
8 changes: 4 additions & 4 deletions
8
device/mellanox/x86_64-mlnx_msn2700_simx-r0/system_health_monitoring_config.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,11 +1,11 @@ | ||
| { | ||
| "services_to_ignore": [], | ||
| "devices_to_ignore": ["psu","asic","fan"], | ||
| "external_checkers": [], | ||
| "user_defined_checkers": [], | ||
| "polling_interval": 60, | ||
| "led_color": { | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| "fault": "orange", | ||
| "normal": "green", | ||
| "booting": "orange_blink" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| [Unit] | ||
| Description=SONiC system health monitor | ||
| Requires=database.service updategraph.service | ||
| After=database.service updategraph.service | ||
|
|
||
| [Service] | ||
| ExecStart=/usr/local/bin/healthd | ||
| Restart=always | ||
|
|
||
| [Install] | ||
| WantedBy=multi-user.target |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| SPATH := $($(SYSTEM_HEALTH)_SRC_PATH) | ||
| DEP_FILES := $(SONIC_COMMON_FILES_LIST) rules/system-health.mk rules/system-health.dep | ||
| DEP_FILES += $(SONIC_COMMON_BASE_FILES_LIST) | ||
| DEP_FILES += $(shell git ls-files $(SPATH)) | ||
|
|
||
| $(SYSTEM_HEALTH)_CACHE_MODE := GIT_CONTENT_SHA | ||
| $(SYSTEM_HEALTH)_DEP_FLAGS := $(SONIC_COMMON_FLAGS_LIST) | ||
| $(SYSTEM_HEALTH)_DEP_FILES := $(DEP_FILES) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # system health python2 wheel | ||
|
|
||
| SYSTEM_HEALTH = system_health-1.0-py2-none-any.whl | ||
| $(SYSTEM_HEALTH)_SRC_PATH = $(SRC_PATH)/system-health | ||
| $(SYSTEM_HEALTH)_PYTHON_VERSION = 2 | ||
| $(SYSTEM_HEALTH)_DEPENDS = $(SONIC_PY_COMMON_PY2) $(SWSSSDK_PY2) $(SONIC_CONFIG_ENGINE) | ||
| SONIC_PYTHON_WHEELS += $(SYSTEM_HEALTH) | ||
|
|
||
| export system_health_py2_wheel_path="$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| */deb_dist/ | ||
| */dist/ | ||
| */build/ | ||
| */*.tar.gz | ||
| */*.egg-info | ||
| */.cache/ | ||
| *.pyc | ||
| */__pycache__/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| from . import hardware_checker | ||
| from . import service_checker |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| import json | ||
| import os | ||
|
|
||
| from sonic_py_common import device_info | ||
|
|
||
|
|
||
| class Config(object): | ||
| """ | ||
| Manage configuration of system health. | ||
| """ | ||
|
|
||
| # Default system health check interval | ||
| DEFAULT_INTERVAL = 60 | ||
|
|
||
| # Default boot up timeout. When reboot system, system health will wait a few seconds before starting to work. | ||
| DEFAULT_BOOTUP_TIMEOUT = 300 | ||
|
|
||
| # Default LED configuration. Different platform has different LED capability. This configuration allow vendor to | ||
| # override the default behavior. | ||
| DEFAULT_LED_CONFIG = { | ||
| 'fault': 'red', | ||
| 'normal': 'green', | ||
| 'booting': 'orange_blink' | ||
| } | ||
|
|
||
| # System health configuration file name | ||
| CONFIG_FILE = 'system_health_monitoring_config.json' | ||
|
|
||
| # Monit service configuration file path | ||
| MONIT_CONFIG_FILE = '/etc/monit/monitrc' | ||
|
|
||
| # Monit service start delay configuration entry | ||
| MONIT_START_DELAY_CONFIG = 'with start delay' | ||
|
|
||
| def __init__(self): | ||
| """ | ||
| Constructor. Initialize all configuration entry to default value in case there is no configuration file. | ||
| """ | ||
| self.platform_name = device_info.get_platform() | ||
| self._config_file = os.path.join('/usr/share/sonic/device/', self.platform_name, Config.CONFIG_FILE) | ||
| self._last_mtime = None | ||
| self.config_data = None | ||
| self.interval = Config.DEFAULT_INTERVAL | ||
| self.ignore_services = None | ||
| self.ignore_devices = None | ||
| self.user_defined_checkers = None | ||
|
|
||
| def config_file_exists(self): | ||
| return os.path.exists(self._config_file) | ||
|
|
||
| def load_config(self): | ||
| """ | ||
| Load the configuration file from disk. | ||
| 1. If there is no configuration file, current config entries will reset to default value | ||
| 2. Only read the configuration file is last_mtime changes for better performance | ||
| 3. If there is any format issues in configuration file, current config entries will reset to default value | ||
| :return: | ||
| """ | ||
| if not self.config_file_exists(): | ||
| if self._last_mtime is not None: | ||
| self._reset() | ||
| return | ||
|
|
||
| mtime = os.stat(self._config_file) | ||
| if mtime != self._last_mtime: | ||
| try: | ||
| self._last_mtime = mtime | ||
| with open(self._config_file, 'r') as f: | ||
| self.config_data = json.load(f) | ||
|
|
||
| self.interval = self.config_data.get('polling_interval', Config.DEFAULT_INTERVAL) | ||
| self.ignore_services = self._get_list_data('services_to_ignore') | ||
| self.ignore_devices = self._get_list_data('devices_to_ignore') | ||
| self.user_defined_checkers = self._get_list_data('user_defined_checkers') | ||
| except Exception as e: | ||
| self._reset() | ||
|
|
||
| def _reset(self): | ||
| """ | ||
| Reset current configuration entry to default value | ||
| :return: | ||
| """ | ||
| self._last_mtime = None | ||
| self.config_data = None | ||
| self.interval = Config.DEFAULT_INTERVAL | ||
| self.ignore_services = None | ||
| self.ignore_devices = None | ||
| self.user_defined_checkers = None | ||
|
|
||
| def get_led_color(self, status): | ||
| """ | ||
| Get desired LED color according to the input status | ||
| :param status: System health status | ||
| :return: StringLED color | ||
| """ | ||
| if self.config_data and 'led_color' in self.config_data: | ||
| if status in self.config_data['led_color']: | ||
| return self.config_data['led_color'][status] | ||
|
|
||
| return self.DEFAULT_LED_CONFIG[status] | ||
|
|
||
| def get_bootup_timeout(self): | ||
| """ | ||
| Get boot up timeout from monit configuration file. | ||
| 1. If monit configuration file does not exist, return default value | ||
| 2. If there is any exception while parsing monit config, return default value | ||
| :return: Integer timeout value | ||
| """ | ||
| if not os.path.exists(Config.MONIT_CONFIG_FILE): | ||
| return self.DEFAULT_BOOTUP_TIMEOUT | ||
|
|
||
| try: | ||
| with open(Config.MONIT_CONFIG_FILE) as f: | ||
| lines = f.readlines() | ||
| for line in lines: | ||
| if not line: | ||
| continue | ||
|
|
||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
|
|
||
| pos = line.find('#') | ||
| if pos == 0: | ||
| continue | ||
|
|
||
| line = line[:pos] | ||
| pos = line.find(Config.MONIT_START_DELAY_CONFIG) | ||
| if pos != -1: | ||
| return int(line[pos + len(Config.MONIT_START_DELAY_CONFIG):].strip()) | ||
| except Exception: | ||
| return self.DEFAULT_BOOTUP_TIMEOUT | ||
|
|
||
| def _get_list_data(self, key): | ||
| """ | ||
| Get list type configuration data by key and remove duplicate element. | ||
| :param key: Key of the configuration entry | ||
| :return: A set of configuration data if key exists | ||
| """ | ||
| if key in self.config_data: | ||
| data = self.config_data[key] | ||
| if isinstance(data, list): | ||
| return set(data) | ||
| return None |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.