[services] Restart SwSS service upon unexpected critical process exit#2845
Merged
lguohan merged 11 commits intosonic-net:masterfrom May 1, 2019
jleveque:restart_swss
Merged
[services] Restart SwSS service upon unexpected critical process exit#2845lguohan merged 11 commits intosonic-net:masterfrom jleveque:restart_swss
lguohan merged 11 commits intosonic-net:masterfrom
jleveque:restart_swss
Conversation
added 11 commits
April 29, 2019 18:00
…ore than 3 times in 20 minutes
…stalls systemd 232 (>= v230)
…es' file inside container
lguohan
approved these changes
May 1, 2019
prsunny
reviewed
May 1, 2019
| vrfmgrd | ||
| nbrmgrd | ||
| vxlanmgrd | ||
| intfsyncd |
Contributor
There was a problem hiding this comment.
intfsyncd is no longer present.
Contributor
Author
There was a problem hiding this comment.
Thanks! I meant to delete that line, but forgot. It won't cause any issues, but I'll open a new PR to remove it soon.
MichelMoriniaux
pushed a commit
to criteo-forks/sonic-buildimage
that referenced
this pull request
May 28, 2019
…sonic-net#2845) * [service] Restart SwSS Docker container if orchagent exits unexpectedly * Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes * Move supervisor-proc-exit-listener script * [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB * Ensure dependent services stop/start/restart with SwSS * Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230) * Also update journald.conf options * Remove 'PartOf' option from unit files * Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile * Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container * Update critical_processes file for swss container
jleveque
added a commit
that referenced
this pull request
Jul 30, 2019
seiferteric
pushed a commit
to project-arlo/sonic-buildimage
that referenced
this pull request
Oct 14, 2019
message from community commit below: [services] Restart SwSS service upon unexpected critical process exit (sonic-net#2845) * [service] Restart SwSS Docker container if orchagent exits unexpectedly * Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes * Move supervisor-proc-exit-listener script * [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB * Ensure dependent services stop/start/restart with SwSS * Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230) * Also update journald.conf options * Remove 'PartOf' option from unit files * Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile * Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container * Update critical_processes file for swss container Change-Id: Ifd2383a4a3f6edfdf4d1ceffbd60e879673d7647
lguohan
pushed a commit
that referenced
this pull request
Nov 9, 2019
…cal process in syncd container exits unexpectedly (#3534) Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
zhenggen-xu
pushed a commit
to zhenggen-xu/sonic-buildimage
that referenced
this pull request
Jan 10, 2020
…cal process in syncd container exits unexpectedly (sonic-net#3534) Add the same mechanism I developed for the SwSS service in sonic-net#2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
sonic-otn
pushed a commit
to sonic-otn/sonic-buildimage
that referenced
this pull request
Sep 20, 2023
…lly (sonic-net#15785) #### Why I did it src/sonic-swss ``` * 776af62 - (HEAD -> master, origin/master, origin/HEAD) [CodeQL]: Use dependencies with relevant versions in azp template. (sonic-net#2845) (4 hours ago) [Nazarii Hnydyn] ``` #### How I did it #### How to verify it #### Description for the changelog
mssonicbld
added a commit
that referenced
this pull request
Sep 25, 2023
…lly (#16642) #### Why I did it src/sonic-swss ``` * 0584d35b - (HEAD -> 202305, origin/202305) Revert "Support type7 encoded CAK key for macsec in config_db (#2892)" (3 minutes ago) [stormliang] * 7097cf2b - Revert "[teamd]: Clean teamd process if LAG creation fails (#2888)" (3 days ago) [stormliang] * a0eb0d07 - Support type7 encoded CAK key for macsec in config_db (#2892) (4 days ago) [judyjoseph] * c7e5f10e - [teamd]: Clean teamd process if LAG creation fails (#2888) (4 days ago) [Lawrence Lee] * f30b6107 - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (4 days ago) [Nazarii Hnydyn] ``` #### How I did it #### How to verify it #### Description for the changelog
yxieca
pushed a commit
that referenced
this pull request
Oct 5, 2023
…lly (#16532) src/sonic-swss * de7186c6 - (HEAD -> 202205, origin/202205) [202205][CodeQL]: Use dependencies with relevant versions in azp template. (#2905) (13 days ago) [Nazarii Hnydyn] * 106dd9ed - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (3 weeks ago) [Nazarii Hnydyn]
lixiaoyuner
pushed a commit
to lixiaoyuner/sonic-buildimage
that referenced
this pull request
Feb 6, 2024
…sonic-buildimage into internal Fix conflict for rsyslog. Skip partial DNS unit test in internal branch after confirmed with Gang. Related work items: sonic-net#113, sonic-net#131, sonic-net#132, sonic-net#134, sonic-net#321, sonic-net#331, sonic-net#381, sonic-net#382, sonic-net#2525, sonic-net#2676, sonic-net#2698, sonic-net#2737, sonic-net#2789, sonic-net#2839, sonic-net#2845, sonic-net#2850, sonic-net#2882, sonic-net#2885, sonic-net#2887, sonic-net#2890, sonic-net#2895, sonic-net#13338, sonic-net#14105, sonic-net#15142, sonic-net#15223, sonic-net#15456, sonic-net#15487, sonic-net#15520, sonic-net#15726, sonic-net#15727, sonic-net#15758, sonic-net#15764, sonic-net#15765, sonic-net#15772, sonic-net#15779, sonic-net#15782, sonic-net#15785, sonic-net#15797, sonic-net#15798, sonic-net#15810, sonic-net#15811, sonic-net#15821
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
- What I did
Restart SwSS service (and also restart dependent services) if any critical processes running in the swss container exit abnormally.
- How I did it
supervisor-proc-exit-listenerevent listener plugin for Supervisor in SwSS Docker container which in turn loads a list of critical processes for which to monitor for unexpected exits.systemctl reset-failed[we should probably also call this command inconfig load_minigraphbefore restarting services]systemctl stop swss.serviceandsystemctl restart swss.service). However this will not cause them to start with SwSS (when callingsystemctl start swss.service). This functionality is enabled with the addition of the "WantedBy=" option.ipcommands, now check STATE_DB for interface entries with "state" == "ok"supervisor-proc-exit-listenerscript resides in files/scripts/ so that the same script can be installed in multiple Docker containers. To add this solution to another container, one simply needs to do the following:/etc/supervisor/critical_processesfile to the container specifying all critical processes, one per line- How it Works
supervisor-proc-exit-listenerwill send aSIGTERMsignal to Supervisor, causing it to exit also- How to verify it
Send a signal to one of the critical processes to cause it to appear to exit abnormally (e.g.,
pkill -11 orchagent). Ensure the swss, syncd, teamd, snmp, dhcp_relay, radv and telemetry services get restarted per the above details.NOTE: My updates to systemd dependencies in this PR also fixes #2752