Skip to content

[config] Stop/restart Monit when stopping/restarting services#1063

Closed
jleveque wants to merge 2 commits intosonic-net:masterfrom
jleveque:restart_monit_reload
Closed

[config] Stop/restart Monit when stopping/restarting services#1063
jleveque wants to merge 2 commits intosonic-net:masterfrom
jleveque:restart_monit_reload

Conversation

@jleveque
Copy link
Contributor

@jleveque jleveque commented Aug 21, 2020

When performing config load, config reload or config load_minigraph, services which are monitored by Monit are stopped, causing the potential for Monit to falsely alert that a critical process is not running. Thus, we need to stop or pause Monit while the services are stopped and ensure Monit begins monitoring once the services are started again.

I also investigated the monit [un]monitor all commands. However, when calling monit monitor all, Monit begins monitoring immediately, yet many services/containers are still starting up. Therefore, there is still the potential for false alarms. Therefore, I settled on restarting the Monit service which causes Monit to respect the configured 5-minute delay before it begins monitoring, and has the added benefit of allowing Monit to pick up the current hostname, in case the hostname changed during the config reload, which ensures Monit has up-to-date information, which allows the command monit status $HOSTNAME to properly show system stats for the device.

Note that we cannot use the existing execute_systemctl() function for Monit, because it checks the service name against the list of SONiC generated services.

@jleveque jleveque changed the title [config] Stop and restart Monit when stopping/restarting services [config] Stop/restart Monit when stopping/restarting services Aug 25, 2020

# First, we stop Monit to prevent any false alarms as other services stop
click.echo("Stopping Monit ...")
clicommon.run_command("systemctl stop monit", display_cmd=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is pause ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read my findings in the PR dscription.

@lguohan
Copy link
Contributor

lguohan commented Sep 18, 2020

I still have some concerns on this approach. i think it might be better to put monit pause/unpause into each service files prestart or prestop script so that before we stop a job, we pause the monit and after we have started the job then we unpause the monit job.

@jleveque
Copy link
Contributor Author

jleveque commented Sep 18, 2020

@lguohan: One issue with the service file approach is that we monitor individual processes inside the containers. Therefore, the service file would need to pause/resume monitoring for each process. This would not only involve a lot of maintenance, but it might also pose issues if someone wants to swap a container (i.e., FRR BGP for Quagga BGP), as the processes which need to monitor may differ.

However, one potential solution I can think of is to use the "service groups" concept in Monit. We can collect all processes from a container as a group, and name the group after the container, then we can pause/resume monitoring for the entire group with one command in the service file. The idea of using Monit groups was recently suggested by @abdosi here. @yozhao101 is going to research the use of Monit groups.

In the meantime, I believe I should close this PR and open a new PR to reload Monit config at the end of the config (re)load procedure to ensure Monit can pick up a new hostname in the event it was changed during the config (re)load operation. Agreed?

@jleveque
Copy link
Contributor Author

I have opened a new PR to reload the Monit config after any potential hostname changes here: #1132.

I am closing this PR. @yozhao will investigate using the Monit "service groups" concept and we will see if we can stop/start monitoring a service group in the respective service files.

@jleveque jleveque closed this Sep 26, 2020
@jleveque jleveque deleted the restart_monit_reload branch September 26, 2020 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants