[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC#564
[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC#564yozhao101 wants to merge 57 commits intosonic-net:masterfrom
Conversation
the running status of critical process and resource usage. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
feature. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
jleveque
left a comment
There was a problem hiding this comment.
One new review comment added and one old comment is still unaddressed.
Resource Usage. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
auto-restart and warm re-boot. Add a paragraph to introduce how can we use Monit to monitor multiple processes with the same command. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
| }, | ||
| "lldp": { | ||
| "auto_restart": "disabled", | ||
| "high_mem_alert": "104857600", |
There was a problem hiding this comment.
Can thresholds should be human readable? Can it be possible to calculate threshold in % values ?
| }, | ||
| "snmp": { | ||
| "auto_restart": "enabled", | ||
| "high_mem_alert": "157286400", |
There was a problem hiding this comment.
How do you determine how much thresholds should configure? do you have anay recommendations?
| admin@sonic:~$ show container feature autorestart | ||
| Container Name Status | ||
| -------------------- -------- | ||
| database disabled |
There was a problem hiding this comment.
Can database container is consistent with data after auto restart ?
| container stops, the systemd service which manages the container will also stop, but it is | ||
| configured to automatically restart the service, thus it will restart the container. | ||
|
|
||
| We also introduced a configuration option which can enable or disable this auto-restart feature |
There was a problem hiding this comment.
How does auto-restart works with dockers loaded dynamically?
| Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage | ||
| such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring | ||
| in the container level. Thus we propose a new design to achieve such monitoring based on Monit. | ||
| Specifically Monit will monitor a script and check its exit status. This script |
There was a problem hiding this comment.
Does this script be able to detect any hang/loop or deadlock situation for the processes or
threads inside the container?
| 1. Monit must provide the ability to generate an alert when a critical process has not | ||
| been alive for 5 minutes. | ||
| 2. Monit must provide the ability to generate an alert when the resource usage of | ||
| a docker container is larger than the pre-defined threshold. |
There was a problem hiding this comment.
Can this be saved in the DB for doing trend analysis of containers on the resource utilization?
|
What's the plan for this feature? Is it proceeding? |
|
@ben-gale: Yes, it is proceeding. Most of the infrastructure is already in place in the master and 201911 branches. |
Thanks Joe - timeline for the code PRs to master? |
|
@yozhao101: Can you please add a comment here with links to all the related PRs in sonic-buildimage and sonic-utilities thus far? |
Yes, I will update with link of PRs. |
This document introduced three features which we plan to deploy into SONiC: 1.We proposed to employ Monit to monitor the running status of critical processes in docker containers. The PRs of this proposal in the public SONiC repo are as following: sonic-net/sonic-buildimage#3940 2.We proposed to employ process monitoring/notification framework of supervisord to implement the auto-restart feature of docker containers. The PRs of this proposal in the public SONiC repo are as following: [process monitoring/notification framework] https://github.com/Azure/sonic-buildimage/pull/2852/files [Syncd] https://github.com/Azure/sonic-buildimage/pull/3534/files [CLI to check the state of autorestart feature of each container] |
by Supervisord and high memory restart. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
8498931 to
8837dc2
Compare
This document will introduce the motivation and design for monitoring, auto-mitigating the unhealthy of docker containers in SONiC.