Teamd :: fix for cleaning up the teamd processes correctly on teamd docker stop#1159
Teamd :: fix for cleaning up the teamd processes correctly on teamd docker stop#1159judyjoseph merged 7 commits intosonic-net:masterfrom
Conversation
When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process.
|
@judyjoseph - this change would address bunch of issues we've been seeing in teamsyncd during config-reload and docker-restart. Thanks again. |
pavel-shirshov
left a comment
There was a problem hiding this comment.
please check comments
I did testing with config reload and config loadminigraph and seen things are ok. These commands does docker stop and docker start explicitly in a sequence starting with swss docker. What is the use case of docker restart ? are you referring to "docker restart teamd"? Is this command supported ? Currently I see swss, syncd, teamd tightly bound and think it will go to an inconsistent state if we just restart teamd alone. |
I was referring to "systemctl restart teamd". Teamd-docker upgrade (sonic_installer upgrade_docker teamd docker-teamd.gz --cleanup_image -y) issues this command internally. |
… and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now.
Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped.
pavel-shirshov
left a comment
There was a problem hiding this comment.
Can you please answer on my questions?
cfgmgr/teammgr.cpp
Outdated
| } | ||
| else | ||
| { | ||
| SWSS_LOG_WARN("Cound not find PID corresponding to LaG %s ", it.c_str()); |
There was a problem hiding this comment.
Can you please make it as an error?
Also It's better to say "Can't send TERM signal to LAG %s. PID wasn't found"
pavel-shirshov
left a comment
There was a problem hiding this comment.
looks good for me
|
|
||
| bool received_sigterm = false; | ||
|
|
||
| void sig_handler(int signo) |
There was a problem hiding this comment.
signo [](start = 21, length = 5)
check signo == SIGTERM first?
There was a problem hiding this comment.
The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.
|
|
||
| bool received_sigterm = false; | ||
|
|
||
| void sig_handler(int signo) |
There was a problem hiding this comment.
signo [](start = 21, length = 5)
the same
There was a problem hiding this comment.
The Signal handler is only registered only for SIGTERM now. The handler will be called only on SIGTERM signal.
| { | ||
| if(received_sigterm) | ||
| { | ||
| sync.cleanTeamSync(); |
There was a problem hiding this comment.
[](start = 16, length = 2)
Inconsistent indentation.
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
…ocker stop (#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
Test make sure cleanup happens of Port-channel Kernel devices. This test case track the fixes done by PR: sonic-net/sonic-swss#1407 sonic-net/sonic-swss#1159 Signed-off-by: Abhishek Dosi <[email protected]>
* Added the test case for Port Channel cleanup. Test make sure cleanup happens of Port-channel Kernel devices. This test case track the fixes done by PR: sonic-net/sonic-swss#1407 sonic-net/sonic-swss#1159 Signed-off-by: Abhishek Dosi <[email protected]> * Address Review Comments Signed-off-by: Abhishek Dosi <[email protected]>
sonic-net#1159) * Global and Interface commands for IPv6 Link local feature * SONiC CLI per interface configuration command to enable and disable the IPv6 link-local address mode when addresses are not configured manually. Signed-off-by: Akhilesh Samineni <[email protected]>
…ocker stop (sonic-net#1159) * Send explicit signal to the teamd processes whenthe teamd docker exits. When the teamd docker receives a stop signal, only the processes started by supervisord gets the SIGTERM, so this fix is to propogate the signal to teamd processes via the signal handler in teamsyncd process. * Updates to take care of boundary conditions in the teamsyncd signal handler. * Better way of signal Handling by setting a flag in the signal handler and checking for the flag in the main loop. This way the cleanUp handler is not run in the signal Handler context and can add more Logs as we need not care for signal safety now. * Updated the logic so that teammgrd controls the lifecycle of teamd. Teammgrd tracks the PID's of the teamd processes and sents the SIGTERM signal when the teamd docker is stopped. * Minor change in the function defenition * Updates based on the comments * Minor update in teammgr.cpp
What I did
The changes implemented in the final fix are,
(1) In teammgrd: While adding LAG, save the mapping of lag <-> teamd PID
(2) In teammgrd: While removing LAG, remove the mapping of lag <-> teamd PID
(3) Introduce SIGTERM handlers in both teammgrd and teamsyncd
Why I did it
(i) Teamsyncd segfault
(ii) Teamsyncd getting the NETLINK messages with the older IFINDEXs
(iii) "TeamPortSync: Unable to init team socket" error messages
Root cause
“docker stop” sends two signals
SIGTERM to the ENTRYPOINT process in our case it is supervisord process
After 10 sec send SIGKILL to all processes to stop the container.
So when docker stop is issued, supervisord send SIGTERM to all processes ( teamsyncd/temmgrd gets the signal correctly – but I find none of the teamd processes get the signal either because they are daemonized and running in background OR because teamd is not started directly by supervisord).
The teamd processes are killed later with SIGKILL send as a last resort to stop docker. SIGKILL cannot be handled and hence no cleanup – portchanel interfaces remains in kernel.
How I verified it
Tried the config reload back to back multiple times -- verified that the Portchannel interface are cleaned up in kernel, teamd processes are killed correctly.
When the teamd docker is started again, I don't see random NETLINK create/delete messages with the older Portchannel ifindices. ( which was seen earlier without this fix )
Other possible fixes which I tried out, but decided not to take
The other possible fix options which was tried were,
(1) Use the fork/exec way to spawn the teamd processes
(a) Seeing instabilities in all teamd instances NOT coming up when we do “config reload” back to back.
(b) Even with teamd staying in foreground, those processes were not getting the SIGTERM on “docker stop”. So doesn’t serve the purpose of trying NOT to daemonize the teamd.
(2) In the teammgrd Signal handler directly invoke an existing API teammgrd:removeLag() which
cleans up the teamd processes. This was working fine, but the cleanup was taking longer
duration like around 5sec for cleaning up all the teamd processes, which very high
considering it is a signal handler.