-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 23 commits
15c53ce
689c5a7
2b31fef
6a2c01a
ac56da8
e294a9c
9546882
8f157ec
752dad0
38d6cab
df37188
6d04987
9724d9e
fe17999
5d3bdfa
c948aa2
c5c0191
4023874
7a84612
9941852
58c1f79
da03448
9884fc2
e0f0d96
1dc3a96
0124b94
0774344
a852c35
a5d094b
93826e4
0e84f87
965fc61
5c69e6e
a040c34
a28459a
8b270be
7710e1b
6d73a9d
d4c4fd4
f36f5ef
2edc3d4
d041d78
f94d019
02bd31c
fa20bea
e4d9a8d
7c917c0
3fad48f
eb30432
a84bfdf
8a908c2
7056a9f
9b30502
dc80bcb
7ed89b7
702e4d8
91c5d9b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,276 @@ | ||
| # Monitoring and Auto-Mitigating the Unhealthy of Docker Containers in SONiC | ||
|
|
||
| # High Level Design Document | ||
| #### Rev 0.1 | ||
|
|
||
| # Table of Contents | ||
| * [List of Tables](#list-of-tables) | ||
| * [Revision](#revision) | ||
| * [About this Manual](#about-this-manual) | ||
| * [Scope](#scope) | ||
| * [Defintions/Abbreviation](#definitionsabbreviation) | ||
| * [1 Feature Overview](#1-feature-overview) | ||
| - [1.1 Requirements](#11-requirements) | ||
| - [1.1.1 Functional Requirements](#111-functional-requirements) | ||
| - [1.1.2 Configuration and Management Requirements](#112-configuration-and-management-requirements) | ||
| - [1.1.3 Scalability Requirements](#113-scalability-requirements) | ||
| - [1.2 Design](#12-design) | ||
| - [1.2.1 Basic Approach](#121-basic-approach) | ||
| * [2 Functionality](#2-functionality) | ||
| - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) | ||
| - [2.2 Functional Description](#22-functional-description) | ||
| - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) | ||
| - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) | ||
| - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) | ||
| - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) | ||
| - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart) | ||
| - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart) | ||
| - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container_feature-table) | ||
|
|
||
| # List of Tables | ||
| * [Table 1: Abbreviations](#definitionsabbreviation) | ||
|
|
||
| # Revision | ||
| | Rev | Date | Author | Change Description | | ||
| |:---:|:----------:|:----------------------:|---------------------------| | ||
| | 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | | ||
|
|
||
| # About this Manual | ||
| This document presents the design and implementation of feature to monitor and auto-mitigate | ||
| the unhealthy of docker containers in SONiC. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Scope | ||
| This document describes the high level design of feature to monitor and auto-mitigate | ||
| the unhealthy of docker containers. | ||
|
|
||
| # Definitions/Abbreviation | ||
| | Abbreviation | Description | | ||
| |--------------|------------------------------| | ||
| | Config DB | SONiC Configuration Database | | ||
| | CLI | Command Line Interface | | ||
|
|
||
| # 1 Feature Overview | ||
| SONiC is a collection of various switch applications which are held in docker containers | ||
| such as BGP container and SNMP container. Each application usually includes several processes which are | ||
| working together to provide and receive the services from other modules. As such, the healthy of | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| critical processes in each docker container are the key not only for the docker | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| container working correctly but also for the intended functionalities of entire SONiC switch. | ||
| On the other hand, profiling the resource usages and performance of each docker | ||
| container are also important for us to understand whether this container is in healthy state | ||
| or not and furtherly to provide us with deep insight about networking traffic. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The main purpose of this feature includes two parts: the first part is to monitor the | ||
| running status of each process and critical resource usage such as CPU, memory and disk | ||
| of each docker container. | ||
| The second part is docker containers can be automatically shut down and | ||
| restarted if one of critical processes running in the container exits unexpectedly. Restarting | ||
| the entire container ensures that configuration is reloaded and all processes in the container | ||
| get restarted, thus increasing the likelihood of entering a healthy state. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We implemented this feature by employing the existing Monit and supervisord system tools. | ||
| 1. We used Monit system tool to detect whether a process is running or not and whether | ||
| the resource usage of a docker container is beyond the pre-defined threshold. | ||
| 2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container | ||
| if one of its critical processes exited unexpectedly. | ||
| 3. We also added a knob to make this auto-restart feature dynamically configurable. | ||
| Specifically users can run CLI to configure this feature residing in Config_DB as | ||
| enabled/disabled status. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## 1.1 Requirements | ||
|
|
||
| ### 1.1.1 Functional Requirements | ||
| 1. The Monit must provide the ability to generate an alert when a critical process is not | ||
| running. | ||
| 2. The Monit must provide the ability to generate an alert when the resource usage of | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| a docker contaier is larger than the pre-defined threshold. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 3. The event listener in supervisord must receive the signal when a critical process in | ||
| a docker container crashed or exited unexpectedly and then restart this docker | ||
| container. | ||
| 4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker | ||
| container.. | ||
| 5. Users can access this auto-restart information via the CLI utility | ||
| 1. Users can see current auto-restart status for docker containers. | ||
| 2. Users can configure auto-restart status for a specific docker container. | ||
|
|
||
| ### 1.1.2 Configuration and Management Requirements | ||
| Configuration of the auto-restart feature can be done via: | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 1. init_cfg.json | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 2. CLI | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### 1.1.3 Scalability Requirements | ||
| `Place holder` | ||
|
|
||
| ## 1.2 Design | ||
|
|
||
| ### 1.2.1 Basic Approach | ||
| Monitoring the running status of critical processes and resource usage of docker containers | ||
| are heavily depended on the Monit system tool. Since Monit already provided the mechanism | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| to check whether a process is running or not, it will be straightforward to integrate this into monitoring | ||
| the critical processes in SONiC. However, Monit only gives the method to monitor the resource | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| usage per process level not container level. As such, monitoring the resource usage of a docker | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| container will be an interesting and challenging problem. In our design, we adopted the way | ||
| that Monit will check the returned value of a script which reads the resource usage of docker | ||
| container, compares it with pre-defined threshold and then exited. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We employed the mechanism of event listener in supervisord to achieve auto-restarting of docker | ||
| container. Currently supervisord will monitor the running status of each process in SONiC | ||
| docker containers. If one critical process exited unexpectedly, supervisord will catch such signal | ||
| and send it to event listener. Then event listener will kill the process supervisord and | ||
| the entire docker container will be shut down and restarted. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # 2 Functionality | ||
| ## 2.1 Target Deployment Use Cases | ||
| This feature is used to perform the following functions: | ||
| 1. Monit will write an alert message into syslog if one if critical process exited unexpectedly. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 2. Monit will write an alert message into syslog if the usage of memory is larger than the | ||
| pre-defined threshold for a docker container. | ||
| 3. A docker container will auto-restart if one of its critical processes crashed or exited | ||
| unexpectedly. | ||
|
|
||
| ## 2.2 Functional Description | ||
|
|
||
|
|
||
| ### 2.2.1 Monitoring Critical Processes | ||
| Monit has implemented the mechanism to monitor whether a process is running or not. In detail, | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Monit will periodically read the target processes from configuration file and tries to match | ||
| those process with the processes tree in Linux kernel. | ||
|
|
||
| Below is an example of Monit configuration file to monitor the critical processes in lldp | ||
| container. | ||
|
|
||
| */etc/monit/conf.d/monit_lldp* | ||
| ```bash | ||
| ############################################################################### | ||
| # Monit configuration file for lldp container | ||
| # Process list: | ||
| # lldpd | ||
| # lldp_syncd | ||
| # lldpmgrd | ||
| ############################################################################### | ||
| check process lldp_monitor matching "lldpd: " | ||
| if does not exit for 5 times within 5 cycles then alert | ||
| check process lldp_syncd matching "python2 -m lldp_syncd" | ||
| if does not exit for 5 times within 5 cycles then alert | ||
| check process lldpmgrd matching "python /usr/bin/lldpmgrd" | ||
| if does not exit for 5 times within 5 cycles then alert | ||
| ``` | ||
|
|
||
| ### 2.2.2 Monitoring Critical Resource Usage | ||
| Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage | ||
| such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring | ||
| in the container level. Thus we propose a new design to achieve such monitoring based on Monit. | ||
| Specifically Monit will monitor a script and check its exit status. This script | ||
|
||
| will correspondingly read the resource usage of docker containers, compare it with | ||
| pre-defined threshold and then return a value. The value 0 signified that | ||
| the resource usage is less than threshold and non-zero means Monit will send an alert since | ||
| current usage is larger than threshold. | ||
|
|
||
| Below is an example of Monit configuration file for lldp container to pass the pre-defined | ||
| threshold (bytes) to the script and check the exiting value. | ||
|
|
||
| ```bash | ||
| check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" | ||
| if status != 0 then alert | ||
| ``` | ||
|
|
||
| ### 2.2.3 Auto-restart Docker Container | ||
| The design principle behind this auto-restart feature is docker containers can be automatically shut down and | ||
| restarted if one of critical processes running in the container exits unexpectedly. Restarting | ||
| the entire container ensures that configuration is reloaded and all processes in the container | ||
| get restarted, thus increasing the likelihood of entering a healthy state. | ||
|
|
||
| Currently SONiC used superviord system tool to manage the processes in each | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| docker container. Actually auto-restarting docker container is based on the process | ||
| monitoring/notification framework. Specifically | ||
| if the state of process changes for example from running to exited, | ||
| an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. | ||
| This event will be received by event listener. If the exited process is critical | ||
| one, then the event listener will terminate supervisord and the container will be shut down | ||
| and restarted. | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We also introduced a knob which can enable or disable this auto-restart feature | ||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| dynamically according to the requirement of users. In detail, we created a table | ||
| named `CONTAINER_FEATURE` in Config_DB and this table includes the status of | ||
| auto-restart feature for each docker container. Users can easily use CLI to | ||
| check and configure the corresponding docker container status. | ||
|
|
||
|
|
||
| #### 2.2.3.1 CLI (and usage example) | ||
| The CLI tool will provide the following functionality: | ||
| 1. Show current status of auto-restart feature for docker containers. | ||
| 2. Configure the status of a specific docker container. | ||
|
|
||
| ##### 2.2.3.1.1 Show the Status of Auto-restart | ||
| ``` | ||
| admin@sonic:~$ show container feature autorestart | ||
| Container Name Status | ||
| -------------------- -------- | ||
| database disabled | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can database container is consistent with data after auto restart ? |
||
| lldp disabled | ||
| radv disabled | ||
| pmon disabled | ||
| sflow enabled | ||
| snmp enabled | ||
| telemetry enabled | ||
| bgp disabled | ||
| dhcp_relay disabled | ||
| rest-api enabled | ||
| teamd disabled | ||
| syncd enabled | ||
| swss disabled | ||
| ``` | ||
|
|
||
| ##### 2.2.3.1.2 Configure the Status of Auto-restart | ||
| ``` | ||
| admin@sonic:~$ sudo config container feature autorestart database enabled | ||
| ``` | ||
|
|
||
|
|
||
| ##### 2.2.3.1.3 CONTAINER_FEATURE Table | ||
| Example: | ||
| ``` | ||
| { | ||
| "CONTAINER_FEATURE": { | ||
| "database": { | ||
| "auto_restart": "enabled", | ||
jleveque marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| }, | ||
| "lldp": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "radv": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "pmon": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "sflow": { | ||
| "auto_restart": "enabled", | ||
| }, | ||
| "snmp": { | ||
| "auto_restart": "enabled", | ||
| }, | ||
| "telemetry": { | ||
| "auto_restart": "enabled", | ||
| }, | ||
| "bgp": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "dhcp_relay": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "rest-api": { | ||
| "auto_restart": "enabled", | ||
| }, | ||
| "teamd": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
| "syncd": { | ||
| "auto_restart": "enabled", | ||
| }, | ||
| "swss": { | ||
| "auto_restart": "disabled", | ||
| }, | ||
|
|
||
jleveque marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.