Skip to content

[ledd] Use select() with timeout for AppDB notifications#16

Merged
jleveque merged 1 commit intosonic-net:masterfrom
OrdnanceNetworks:fix-ledd-signal-handling
May 31, 2018
Merged

[ledd] Use select() with timeout for AppDB notifications#16
jleveque merged 1 commit intosonic-net:masterfrom
OrdnanceNetworks:fix-ledd-signal-handling

Conversation

@serhepopovych
Copy link
Contributor

Otherwise ledd ignores signals other than SIGKILL making impossible to
use __del__() destructors in LedClass implementations and delays pmon
docker container shutdown up to 10s.

Here is output from /var/log/supervisor/supervisord.log after
"systemctl stop pmon":

2018-05-26 10:40:36,323 WARN received SIGTERM indicating exit request
2018-05-26 10:40:36,323 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:39,327 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:42,330 INFO waiting for rsyslogd, ledd to die
2018-05-26 10:40:45,335 INFO waiting for rsyslogd, ledd to die

Note that according to docker-stop(1) default time to wait before retry
with KILL signal is 10s.

Steps to reproduce:

# docker exec -ti pmon bash

# kill -TERM $(pgrep ledd)
# kill -INT $(pgrep ledd)

# kill -0 $(pgrep ledd) && echo 'alive'
alive

# kill -KILL $(pgrep ledd)

Process survives TERM and INT signals, and killed only in 5) by KILL.

Other C++ code already uses SELECT_TIMEOUT = 1000 to return control
into main loop and checks for state.

Signed-off-by: Sergey Popovich [email protected]

if state != swsscommon.Select.OBJECT:
log_warning("sel.select() did not return swsscommon.Select.OBJECT")
if state != swsscommon.Select.TIMEOUT:
log_warning("sel.select() did not return swsscommon.Select.OBJECT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment describing why you need the changes? Otherwise it's hard to understand why do we need them.

Copy link
Contributor Author

@serhepopovych serhepopovych May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean entire change or just this selected part?

If we didn't check for swsscommon.Select.TIMEOUT log will be flooded with useless messages "did not return swsscommon.Select.OBJECT" for timeout case.

Select will timeout frequently as there will be no event for port.

Previous behavior was to return from select only when port event occurs. That blocks signals preventing ledd from graceful exit from SIGTERM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that it's not clear from the code why you needed to introduce timeout here?
What was the reason to have timeout. I think it's better to have a comment explaining that we introduced the timeout because of the select blocking UNIX signals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why C++ code that uses timeout does not tell that in comment?

Is it not enough commit message description? Anyone working on code can use standard approach (e.g. git-log(1), git-blame(1), git-describe(1), etc) to find commit and detailed description of the problem.

If we really want to add comment it might be better to do near function call, because checking return serves completely different purpose: skip unwanted logging messages.

Copy link
Contributor

@jleveque jleveque May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple comment above the select() call on line 209 mentioning that we call select() with a timeout value to to prevent indefinite blocking an enable graceful shutdown via SIGTERM should suffice.

@pavel-shirshov: Do you agree?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked sonic-swss repo and I found that all select TIMEOUTs there are used to do some actions.
In the code you introduces it's not clear why do we need TIMEOUTs. I would not by surprised if someone would remove the TIMEOUT code as redundant.
I'm sorry I put my comment on the wrong line in the code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleveque yes, I agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted. Thanks.

# Do not flood log when select times out
if state != swsscommon.Select.TIMEOUT:
log_warning("sel.select() did not return swsscommon.Select.OBJECT")
continue
Copy link
Contributor

@qiluo-msft qiluo-msft May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest refactoring:

if state == swsscommon.Select.TIMEOUT:
    continue
elif state != swsscommon.Select.OBJECT:
    log_warning("sel.select() did not return swsscommon.Select.OBJECT")
    continue
``` #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted. Thanks.

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comment.

Otherwise ledd ignores signals other than SIGKILL making impossible to
use __del__() destructors in LedClass implementations and delays pmon
docker container shutdown up to 10s.

Here is output from /var/log/supervisor/supervisord.log after
"systemctl stop pmon":

  2018-05-26 10:40:36,323 WARN received SIGTERM indicating exit request
  2018-05-26 10:40:36,323 INFO waiting for rsyslogd, ledd to die
  2018-05-26 10:40:39,327 INFO waiting for rsyslogd, ledd to die
  2018-05-26 10:40:42,330 INFO waiting for rsyslogd, ledd to die
  2018-05-26 10:40:45,335 INFO waiting for rsyslogd, ledd to die

Note that according to docker-stop(1) default time to wait before retry
with KILL signal is 10s.

Steps to reproduce:

  # docker exec -ti pmon bash

  # kill -TERM $(pgrep ledd)
  # kill -INT $(pgrep ledd)

  # kill -0 $(pgrep ledd) && echo 'alive'
  alive

  # kill -KILL $(pgrep ledd)

Process survives TERM and INT signals, and killed only by KILL.

Other C++ code already uses SELECT_TIMEOUT = 1000 to return control
into main loop and checks for state.

Signed-off-by: Sergey Popovich <[email protected]>
@jleveque jleveque merged commit ce83d58 into sonic-net:master May 31, 2018
vdahiya12 pushed a commit to vdahiya12/sonic-platform-daemons that referenced this pull request Apr 4, 2022
vvolam pushed a commit to vvolam/sonic-platform-daemons that referenced this pull request Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants