[ledd] Use select() with timeout for AppDB notifications#16
[ledd] Use select() with timeout for AppDB notifications#16jleveque merged 1 commit intosonic-net:masterfrom OrdnanceNetworks:fix-ledd-signal-handling
Conversation
sonic-ledd/scripts/ledd
Outdated
| if state != swsscommon.Select.OBJECT: | ||
| log_warning("sel.select() did not return swsscommon.Select.OBJECT") | ||
| if state != swsscommon.Select.TIMEOUT: | ||
| log_warning("sel.select() did not return swsscommon.Select.OBJECT") |
There was a problem hiding this comment.
Can you please add a comment describing why you need the changes? Otherwise it's hard to understand why do we need them.
There was a problem hiding this comment.
You mean entire change or just this selected part?
If we didn't check for swsscommon.Select.TIMEOUT log will be flooded with useless messages "did not return swsscommon.Select.OBJECT" for timeout case.
Select will timeout frequently as there will be no event for port.
Previous behavior was to return from select only when port event occurs. That blocks signals preventing ledd from graceful exit from SIGTERM.
There was a problem hiding this comment.
My main concern here is that it's not clear from the code why you needed to introduce timeout here?
What was the reason to have timeout. I think it's better to have a comment explaining that we introduced the timeout because of the select blocking UNIX signals.
There was a problem hiding this comment.
Then why C++ code that uses timeout does not tell that in comment?
Is it not enough commit message description? Anyone working on code can use standard approach (e.g. git-log(1), git-blame(1), git-describe(1), etc) to find commit and detailed description of the problem.
If we really want to add comment it might be better to do near function call, because checking return serves completely different purpose: skip unwanted logging messages.
There was a problem hiding this comment.
I think a simple comment above the select() call on line 209 mentioning that we call select() with a timeout value to to prevent indefinite blocking an enable graceful shutdown via SIGTERM should suffice.
@pavel-shirshov: Do you agree?
There was a problem hiding this comment.
I checked sonic-swss repo and I found that all select TIMEOUTs there are used to do some actions.
In the code you introduces it's not clear why do we need TIMEOUTs. I would not by surprised if someone would remove the TIMEOUT code as redundant.
I'm sorry I put my comment on the wrong line in the code.
There was a problem hiding this comment.
Accepted. Thanks.
| # Do not flood log when select times out | ||
| if state != swsscommon.Select.TIMEOUT: | ||
| log_warning("sel.select() did not return swsscommon.Select.OBJECT") | ||
| continue |
There was a problem hiding this comment.
Suggest refactoring:
if state == swsscommon.Select.TIMEOUT:
continue
elif state != swsscommon.Select.OBJECT:
log_warning("sel.select() did not return swsscommon.Select.OBJECT")
continue
``` #Closed
There was a problem hiding this comment.
Accepted. Thanks.
Otherwise ledd ignores signals other than SIGKILL making impossible to use __del__() destructors in LedClass implementations and delays pmon docker container shutdown up to 10s. Here is output from /var/log/supervisor/supervisord.log after "systemctl stop pmon": 2018-05-26 10:40:36,323 WARN received SIGTERM indicating exit request 2018-05-26 10:40:36,323 INFO waiting for rsyslogd, ledd to die 2018-05-26 10:40:39,327 INFO waiting for rsyslogd, ledd to die 2018-05-26 10:40:42,330 INFO waiting for rsyslogd, ledd to die 2018-05-26 10:40:45,335 INFO waiting for rsyslogd, ledd to die Note that according to docker-stop(1) default time to wait before retry with KILL signal is 10s. Steps to reproduce: # docker exec -ti pmon bash # kill -TERM $(pgrep ledd) # kill -INT $(pgrep ledd) # kill -0 $(pgrep ledd) && echo 'alive' alive # kill -KILL $(pgrep ledd) Process survives TERM and INT signals, and killed only by KILL. Other C++ code already uses SELECT_TIMEOUT = 1000 to return control into main loop and checks for state. Signed-off-by: Sergey Popovich <[email protected]>
This reverts commit 3b1f0ef.
Otherwise ledd ignores signals other than SIGKILL making impossible to
use __del__() destructors in LedClass implementations and delays pmon
docker container shutdown up to 10s.
Here is output from /var/log/supervisor/supervisord.log after
"systemctl stop pmon":
2018-05-26 10:40:36,323 WARN received SIGTERM indicating exit request
2018-05-26 10:40:36,323 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:39,327 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:42,330 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:45,335 INFO waiting for rsyslogd, ledd to die
Note that according to docker-stop(1) default time to wait before retry
with KILL signal is 10s.
Steps to reproduce:
# docker exec -ti pmon bash
# kill -TERM $(pgrep ledd)
# kill -INT $(pgrep ledd)
# kill -0 $(pgrep ledd) && echo 'alive'
alive
# kill -KILL $(pgrep ledd)
Process survives TERM and INT signals, and killed only in 5) by KILL.
Other C++ code already uses SELECT_TIMEOUT = 1000 to return control
into main loop and checks for state.
Signed-off-by: Sergey Popovich [email protected]