-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Adjust system health HLD due to output of 'monit summary -B' command change #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,58 +5,53 @@ | |
| | Rev | Date | Author | Change Description | | ||
| |:---:|:-----------:|:------------------:|-----------------------------------| | ||
| | 0.1 | | Kebo Liu | Initial version | | ||
|
|
||
| | 0.2 | | Junchao Chen | Check service status without monit| | ||
|
|
||
|
|
||
| ## 1. Overview of the system health monitor | ||
|
|
||
| System health monitor is intended to monitor both critical services and peripheral device status and leverage system log, system status LED to and CLI command output to indicate the system status. | ||
|
|
||
| In current SONiC implementation, already have Monit which is monitoring the critical services status and also have a set of daemons(psud, thermaltcld, etc.) inside PMON collecting the peripheral devices status. | ||
|
|
||
| System health monitoring service will not monitor the critical services or devices directly, it will reuse the result of Monit and PMON daemons to summary the current status and decide the color of the system health LED. | ||
|
|
||
| ### 1.1 Services under Monit monitoring | ||
|
|
||
| For the Monit, now below services and file system is under monitoring: | ||
|
|
||
| admin@sonic# monit summary -B | ||
| Monit 5.20.0 uptime: 1h 6m | ||
| Service Name Status Type | ||
| sonic Running System | ||
| rsyslog Running Process | ||
| telemetry Running Process | ||
| dialout_client Running Process | ||
| syncd Running Process | ||
| orchagent Running Process | ||
| portsyncd Running Process | ||
| neighsyncd Running Process | ||
| vrfmgrd Running Process | ||
| vlanmgrd Running Process | ||
| intfmgrd Running Process | ||
| portmgrd Running Process | ||
| buffermgrd Running Process | ||
| nbrmgrd Running Process | ||
| vxlanmgrd Running Process | ||
| snmpd Running Process | ||
| snmp_subagent Running Process | ||
| sflowmgrd Running Process | ||
| lldpd_monitor Running Process | ||
| lldp_syncd Running Process | ||
| lldpmgrd Running Process | ||
| redis_server Running Process | ||
| zebra Running Process | ||
| fpmsyncd Running Process | ||
| bgpd Running Process | ||
| staticd Running Process | ||
| bgpcfgd Running Process | ||
| root-overlay Accessible Filesystem | ||
| var-log Accessible Filesystem | ||
|
|
||
|
|
||
| By default any above services or file systems is not in good status will be considered as fault condition. | ||
|
|
||
| ### 1.2 Peripheral devices status which could impact the system health status | ||
| System health monitor is intended to monitor both critical services/processes and peripheral device status and leverage system log, system status LED to and CLI command output to indicate the system status. | ||
|
|
||
| In current SONiC implementation, monit service can monitor the file system as well as customized check script status, system health monitor can rely on Monit service to monit these items. There are also a set of daemons(psud, thermaltcld, etc.) inside PMON collecting the peripheral devices status. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| System health monitor needs to monitor the critical service/processes status and reuse the result of Monit service/PMON daemons to summary the current status and decide the color of the system health LED. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### 1.1 Monitor critical services/processes | ||
|
|
||
| #### 1.1.1 Monitor critical services | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have a script
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. System health borrows some of the code in "container_checker". What do you mean by borrow? |
||
|
|
||
| 1. Read FEATURE table in CONFIG_DB, any feature with state "enabled" or "always_enabled" is expected to run in the system | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 2. Get running services via docker tool (Use python docker library to get running containers) | ||
| 3. Compare result of #1 and result of #2, any difference will be considered as fault condition | ||
|
|
||
| #### 1.1.2 Monitor critical processes | ||
|
|
||
| 1. Read FEATURE table in CONFIG_DB, any feature with state "enabled" or "always_enabled" is expected to run in the system | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 2. Get critical process list for each running service by reading file /etc/supervisor/critical_processes (Use `docker inspect <container_name> --format "{{.GraphDriver.Data.MergedDir}}"` to get base director for a container) | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 3. For each container, use "supervisorctl status" to get its critical process status, any critical process is not in "RUNNING" status will be considered as fault condition. | ||
|
|
||
| ### 1.2 Services under Monit monitoring | ||
|
|
||
| For the Monit, now below programs and file systems are under monitoring: | ||
|
|
||
| ``` | ||
| admin@sonic:~$ sudo monit summary -B | ||
| Monit 5.20.0 uptime: 22h 56m | ||
| Service Name Status Type | ||
| sonic Running System | ||
| rsyslog Running Process | ||
| root-overlay Accessible Filesystem | ||
| var-log Accessible Filesystem | ||
| routeCheck Status ok Program | ||
| diskCheck Status ok Program | ||
| container_checker Status ok Program | ||
| vnetRouteCheck Status ok Program | ||
| container_memory_telemetry Status ok Program | ||
| ``` | ||
|
|
||
| By default any above programs or file systems is not in good status will be considered as fault condition. | ||
|
|
||
Junchao-Mellanox marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ### 1.3 Peripheral devices status which could impact the system health status | ||
|
|
||
| - Any fan is missing/broken | ||
| - Fan speed is below minimal range | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
@@ -65,9 +60,9 @@ By default any above services or file systems is not in good status will be cons | |
| - PSU is in bad status | ||
| - ASIC temperature is too hot | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### 1.3 Customization of monitored critical services and devices | ||
| ### 1.4 Customization of monitored critical services and devices | ||
|
|
||
| #### 1.3.1 Ignore some of monitored critical services and devices | ||
| #### 1.4.1 Ignore some of monitored critical services and devices | ||
| The list of monitored critical services and devices can be customized by a configuration file, the user can rule out some services or device sensors status from the monitor list. System health monitor will load this configuration file at next run and ignore the services or devices during the routine check. | ||
| ```json | ||
| { | ||
|
|
@@ -91,12 +86,12 @@ The filter string is case sensitive. Currently, it support following filters: | |
| - <psu_name>.temperature: ignore temperature check for a specific PSU | ||
| - <psu_name>.voltage: ignore voltage check for a specific PSU | ||
|
|
||
| The default filter is to filter nothing. Unknown filters will be silently ignored. The "serivces_to_ignore" and "devices_to_ignore" section must be an string array or it will use default filter. | ||
| The default filter is to filter nothing. Unknown filters will be silently ignored. The "services_to_ignore" and "devices_to_ignore" section must be an string array or it will use default filter. | ||
|
|
||
| This configuration file will be platform specific and shall be added to the platform folder(/usr/share/sonic/device/{platform_name}/system_health_monitoring_config.json). | ||
|
|
||
| #### 1.3.2 Extend the monitoring with adding user specific program to Monit | ||
| Monit support to check program(scripts) exit status, if user want to monitor something that beyond critical serives or some special device not included in the above list, they can provide a specific scripts and add it to Monit check list, then the result can also be collected by the system health monitor. It requires 2 steps to add an external checker. | ||
| #### 1.4.2 Extend the monitoring with adding user specific program to Monitor | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Monit support to check program(scripts) exit status, if user want to monitor something that beyond critical services or some special device not included in the above list, they can provide a specific scripts and add it to Monit check list, then the result can also be collected by the system health monitor. It requires 2 steps to add an external checker. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. Prepare program whose command line output must qualify: | ||
|
|
||
|
|
@@ -130,9 +125,9 @@ The configuration shall be: | |
| } | ||
| ``` | ||
|
|
||
| ### 1.4 system status LED color definition | ||
| ### 1.5 system status LED color definition | ||
|
|
||
| default system status LED color definition is like | ||
| default system status LED color definition is like | ||
|
|
||
| | Color | Status | Description | | ||
| |:----------------:|:-------------:|:-----------------------:| | ||
|
|
@@ -156,12 +151,14 @@ Considering that different vendors platform may have different LED color capabil | |
|
|
||
| ## 2. System health monitor service business logic | ||
|
|
||
| System health monitor daemon will running on the host, periodically(every 60s) check the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with the services monitored by monit or peripheral devices, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status. | ||
| System health monitor daemon will running on the host, periodically(every 60s) check critical services/processes status, the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with them, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status. | ||
|
||
|
|
||
| Before the switch boot up finish, the system health monitoring service shall be able to know the switch is in boot up status(see open question 1). | ||
| Before the switch boot up finish, the system health monitoring service shall get the monit service startup delay and make sure monit service run first. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| If monit service is not avalaible, will consider system in fault condition. | ||
| FAN/PSU/ASIC data not available will also considered as fault conditon. | ||
| Empty FEATURE table will be considered as fault condition. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we define the term
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| A service with invalid critical_processes file will be considered as fault condition. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| If monit service is not available, will consider system in fault condition. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| FAN/PSU/ASIC data not available will also considered as fault condition. | ||
Junchao-Mellanox marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Incomplete data in the DB will also be considered as fault condition, e.g., PSU voltage data is there but threshold data not available. | ||
|
|
||
| Monit, thermalctld and psud will raise syslog when fault condition encountered, so system health monitor will only generate some general syslog on these situation to avoid redundant. For example, when fault condition meet, "system health status change to fault" can be print out, "system health status change to normal" when it recovered. | ||
|
|
@@ -173,7 +170,7 @@ this service will be started after system boot up(after database.service and upd | |
| System health service will populate system health data to STATE db. A new table "SYSTEM_HEALTH_INFO" will be created to STATE db. | ||
|
|
||
| ; Defines information for a system health | ||
| key = SYSTEM_HEALTH_INFO ; health information for the switch | ||
| key = SYSTEM_HEALTH_INFO ; health information for the switch | ||
| ; field = value | ||
| summary = STRING ; summary status for the switch | ||
| <item_name> = STRING ; an entry for a service or device | ||
|
|
@@ -244,7 +241,7 @@ Add a new "show system-health" command line to the system | |
| system-health Show system health status | ||
| ... | ||
|
|
||
| "show system-health" CLI has three sub command, "summary" and "detail" and "monitor-list". With command "summary" will give brief outpt of system health status while "detail" will be more verbose. | ||
| "show system-health" CLI has three sub command, "summary" and "detail" and "monitor-list". With command "summary" will give brief output of system health status while "detail" will be more verbose. | ||
| "monitor-list" command will list all the services and devices under monitoring. | ||
|
|
||
| admin@sonic# show system-health ? | ||
|
|
@@ -281,7 +278,7 @@ When something is wrong | |
|
|
||
| for the "detail" sub command output, it will give out all the services and devices status which is under monitoring, and also the ignored service/device list will also be displayed. | ||
|
|
||
| "moniter-list" will give a name list of services and devices exclude the ones in the ignore list. | ||
| "monitor-list" will give a name list of services and devices exclude the ones in the ignore list. | ||
|
|
||
| When the CLI been called, it will directly analyze the "monit summary" output and the state DB entries to present a summary about the system health status. The status analyze logic of the CLI shall be aligned/shared with the logic in the system health service. | ||
|
|
||
|
|
@@ -300,20 +297,8 @@ Fault condition and CLI output string table | |
| | FAN data is not available in the DB|FAN data is not available| | ||
| | ASIC data is not available in the DB|ASIC data is not available| | ||
|
|
||
| See open question 2 for adding configuration CLIs. | ||
|
|
||
| ## 6. System health monitor test plan | ||
|
|
||
| 1. If some critical service missed, check the CLI output, the LED color and error shall be as expected. | ||
| 2. Simulate PSU/FAN/ASIC and related sensor failure via mock sysfs and check the CLI output, the LED color and error shall be as expected. | ||
| 3. Change the monitor service/device list then check whether the system health monitor service works as expected; also check whether the result of "show system-health monitor-list" aligned. | ||
|
|
||
| ## 7. Open Questions | ||
|
|
||
| 1. How to determine the SONiC system is in boot up stage? The current design is to compare the system up time with a "boot_timeout" value. The system up time is got from "cat /proc/uptime". The default "boot_timeout" is 300 seconds and can be configured by configuration. System health service will not do any check until SONiC system finish booting. | ||
|
|
||
| ```json | ||
| { | ||
| "boot_timeout": 300 | ||
| } | ||
| ``` | ||
| 3. Change the monitor service/device list then check whether the system health monitor service works as expected; also check whether the result of "show system-health monitor-list" aligned. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is intended to monitor both critical services/processes and peripheral device status--->is designed to monitor critical services, critical processes and peripheral device status?and leverage system log, system status LED to--->by analyzing system log, system status LED and?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, system health does not analyze "sytstem log, system status LED", instead, it triggers error log to system log and let user know what is happening, it also changes system status LED color according to the system status.