[hostcfgd] Configure service auto-restart in hostcfgd.#5744
[hostcfgd] Configure service auto-restart in hostcfgd.#5744renukamanavalan merged 23 commits intosonic-net:masterfrom
Conversation
|
retest mellanox please |
|
retest vsimage please |
e729226 to
e4447ee
Compare
Before this change, a process runnning inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided wether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. * hostcfgd refactoring - move feature handling in another class. * override systemd service Restart= setting from hostcfgd. * remove code that deals with FEATURE table from supervisor-proc-exit-listener. * remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
e4447ee to
6f47365
Compare
|
This pull request introduces 1 alert when merging 6f47365 into 261a81d - view on LGTM.com new alerts:
|
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
| start_cmds.append("sudo systemctl start {}.{}".format(feature_name_suffix, feature_suffixes[-1])) | ||
| for cmd in start_cmds: | ||
| syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd)) | ||
| try: |
There was a problem hiding this comment.
Can we enhance run_cmd() use it to return error code as well as log the error read from the exception
There was a problem hiding this comment.
Thanks, yes, please check the enhanced run_cmd()
| stop_cmds.append("sudo systemctl mask {}.{}".format(feature_name_suffix, suffix)) | ||
| for cmd in stop_cmds: | ||
| syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd)) | ||
| try: |
There was a problem hiding this comment.
Same comment as before to use run_cmd()
…it in feature handler code Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
|
@stepanblyschak: This PR appears to change the container behavior upon critical process exit. Currently, |
|
@jleveque |
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
d918e64 to
79c2866
Compare
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
|
@lguohan: A few Azure Pipelines check builds are stuck in the "Expected — Waiting for status to be reported" state, and I cannot re-trigger these tests. There are a few PRs like this and I've even tried closing/reopening the PRs to no avail. Can you please help here? |
|
/AzurePipleines run |
|
Closing and reopening PR in hopes of getting stuck Azure Pipelines jobs running. |
| syslog.syslog(syslog.LOG_INFO, "Feature '{}' service is '{}'" | ||
| .format(feature_name, invariant_state)) | ||
| entry = self.config_db.get_entry('FEATURE', feature_name) | ||
| entry['state'] = invariant_state |
There was a problem hiding this comment.
@stepanblyschak Following my previous comment, if the state at here is always_disabled and invariant_state is always_enabled, the code at here will update the state field of feature to invariant_state. However, the code from line 761 ~ 764 will disable this feature. So I think the code at line 758 should be entry['state'] = state, right?
There was a problem hiding this comment.
No sure I get this comment, are you saying there is a bug in original code?
Could you please point me to a document describing feature state transitions?
There was a problem hiding this comment.
@yozhao101 Do you have this comment still?
There was a problem hiding this comment.
@yozhao101 could you please check if the last commit addresses your comment?
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
a6f9c94 to
774781d
Compare
| if cached_feature.state is None: | ||
| enable = feature.state in ("always_enabled", "enabled") | ||
| disable = feature.state in ("always_disabled", "disabled") | ||
| elif cached_feature.state == ("always_enabled", "always_disabled"): |
There was a problem hiding this comment.
in instead of ==: elif cached_feature.state in ("always_enabled", "always_disabled"):
There was a problem hiding this comment.
Thanks for noticing this!
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@tahmed-dev can you please provide your approval? |
|
@yozhao101 can you please provide your review/approval ASAP? |
|
@jleveque, can you please provide your review or approval? |
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
jleveque
left a comment
There was a problem hiding this comment.
Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?
Thanks, Joe and Renuka ! I am checking ... |
| except Exception as err: | ||
| if log_err: | ||
| syslog.syslog(syslog.LOG_ERR, "{} - failed: return code - {}, output:\n{}" | ||
| .format(err.cmd, err.returncode, err.output)) |
There was a problem hiding this comment.
It looks like we only have output from child process if it was captured by run() or check_output(). Otherwise, None. Please see: https://docs.python.org/3/library/subprocess.html#subprocess.CalledProcessError.output
There was a problem hiding this comment.
This issue is relevant for existing code as well - https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L66. This PR has no intend to fix this issue.
|
@yozhao101 can you please resolve/provide your reviews and approval ? |
|
Waiting for build to succeed. @stepanblyschak, if you can, please ping me, when build succeeds. |
Same error appears again - https://dev.azure.com/mssonic/build/_build/results?buildId=19697&view=logs&j=88ce9a53-729c-5fa9-7b6e-3d98f2488e3f&t=8d99be27-49d0-54d0-99b1-cfc0d47f0318&l=527 |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature. hostcfgd refactoring - move feature handling in another class. override systemd service Restart= setting from hostcfgd. remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak stepanb@nvidia.com - Why I did it Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS. - How I did it hostcfgd configures 'Restart=' value for systemd service. - How to verify it root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled root@r-tigon-11:/home/admin# show feature status | grep lldp lldp enabled enabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 20 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 5 seconds lldp root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 35 seconds lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 3 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 39 seconds ago lldp root@r-tigon-11:/home/admin#
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.
Signed-off-by: Stepan Blyshchak stepanb@nvidia.com
- Why I did it
Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.
- How I did it
hostcfgd configures 'Restart=' value for systemd service.
- How to verify it
- Which release branch to backport (provide reason below if selected)
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)