[config] Stop/restart Monit when stopping/restarting services#1063
[config] Stop/restart Monit when stopping/restarting services#1063jleveque wants to merge 2 commits intosonic-net:masterfrom jleveque:restart_monit_reload
Conversation
|
|
||
| # First, we stop Monit to prevent any false alarms as other services stop | ||
| click.echo("Stopping Monit ...") | ||
| clicommon.run_command("systemctl stop monit", display_cmd=False) |
There was a problem hiding this comment.
Please read my findings in the PR dscription.
|
I still have some concerns on this approach. i think it might be better to put monit pause/unpause into each service files prestart or prestop script so that before we stop a job, we pause the monit and after we have started the job then we unpause the monit job. |
|
@lguohan: One issue with the service file approach is that we monitor individual processes inside the containers. Therefore, the service file would need to pause/resume monitoring for each process. This would not only involve a lot of maintenance, but it might also pose issues if someone wants to swap a container (i.e., FRR BGP for Quagga BGP), as the processes which need to monitor may differ. However, one potential solution I can think of is to use the "service groups" concept in Monit. We can collect all processes from a container as a group, and name the group after the container, then we can pause/resume monitoring for the entire group with one command in the service file. The idea of using Monit groups was recently suggested by @abdosi here. @yozhao101 is going to research the use of Monit groups. In the meantime, I believe I should close this PR and open a new PR to reload Monit config at the end of the config (re)load procedure to ensure Monit can pick up a new hostname in the event it was changed during the config (re)load operation. Agreed? |
When performing
config load,config reloadorconfig load_minigraph, services which are monitored by Monit are stopped, causing the potential for Monit to falsely alert that a critical process is not running. Thus, we need to stop or pause Monit while the services are stopped and ensure Monit begins monitoring once the services are started again.I also investigated the
monit [un]monitor allcommands. However, when callingmonit monitor all, Monit begins monitoring immediately, yet many services/containers are still starting up. Therefore, there is still the potential for false alarms. Therefore, I settled on restarting the Monit service which causes Monit to respect the configured 5-minute delay before it begins monitoring, and has the added benefit of allowing Monit to pick up the current hostname, in case the hostname changed during the config reload, which ensures Monit has up-to-date information, which allows the commandmonit status $HOSTNAMEto properly show system stats for the device.Note that we cannot use the existing
execute_systemctl()function for Monit, because it checks the service name against the list of SONiC generated services.