[hostcfgd] Configure service auto-restart in hostcfgd. by stepanblyschak · Pull Request #5744 · sonic-net/sonic-buildimage

stepanblyschak · 2020-10-29T14:12:07Z

Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.

hostcfgd refactoring - move feature handling in another class.
override systemd service Restart= setting from hostcfgd.
remove default systemd Restart=always.

Signed-off-by: Stepan Blyshchak stepanb@nvidia.com

- Why I did it

Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.

- How I did it

hostcfgd configures 'Restart=' value for systemd service.

- How to verify it

root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled
root@r-tigon-11:/home/admin# show feature status | grep lldp
lldp            enabled   enabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 20 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 5 seconds                            lldp
root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled
root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Up 35 seconds                           lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 3 seconds ago                       lldp
root@r-tigon-11:/home/admin# docker ps -a | grep lldp
65058396277c        docker-lldp:latest                   "/usr/bin/docker-lld…"   2 days ago          Exited (0) 39 seconds ago                       lldp
root@r-tigon-11:/home/admin#

- Which release branch to backport (provide reason below if selected)

201811
201911
202006
202012

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

stepanblyschak · 2020-10-30T00:14:21Z

retest mellanox please

stepanblyschak · 2020-10-30T00:14:27Z

retest vsimage please

files/image_config/hostcfgd/hostcfgd

Before this change, a process runnning inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided wether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. * hostcfgd refactoring - move feature handling in another class. * override systemd service Restart= setting from hostcfgd. * remove code that deals with FEATURE table from supervisor-proc-exit-listener. * remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

lgtm-com · 2020-11-16T21:48:57Z

This pull request introduces 1 alert when merging 6f47365 into 261a81d - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

rajendra-dendukuri · 2021-01-15T16:08:57Z

src/sonic-host-services/scripts/hostcfgd

+                start_cmds.append("sudo systemctl start {}.{}".format(feature_name_suffix, feature_suffixes[-1]))
+                for cmd in start_cmds:
+                    syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd))
+                    try:


Can we enhance run_cmd() use it to return error code as well as log the error read from the exception

Thanks, yes, please check the enhanced run_cmd()

rajendra-dendukuri · 2021-01-15T16:09:18Z

src/sonic-host-services/scripts/hostcfgd

+                    stop_cmds.append("sudo systemctl mask {}.{}".format(feature_name_suffix, suffix))
+                for cmd in stop_cmds:
+                    syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd))
+                    try:


Same comment as before to use run_cmd()

…it in feature handler code Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

stepanblyschak · 2021-01-22T16:26:59Z

@lguohan @jleveque

jleveque · 2021-01-22T19:45:15Z

@stepanblyschak: This PR appears to change the container behavior upon critical process exit. Currently, supervisor-proc-exit-listener will only stop the container if auto-restart is enabled. With this PR, it appears the container will always stop, but it will only be restarted if auto-restart is enabled. Can you explain why you believe this behavior is preferred?

stepanblyschak · 2021-01-25T12:16:16Z

@jleveque
This has been discussed before over emails and with App.Ext sub-group community.
Considering the possibility to run 3rd party dockerized applications on SONiC the auto-restart functionality is highly benefitial, although the current approach limits the use of this feature only to SONiC containers which developers explicitely coded support for it. Thus, the choice was to make auto-restart functionality support for free for any container that is managed by systemd. On the other hand there is a change in container lifetime when critical process dies. Consider running an arbitrary container the user or container orchestrator has no control over its entrypoint process exits unexpectedly. Usually in microservices there is only one process running in container and if this process dies so does the container. Given that, I do not see a benefit from having the container running while one or more critical daemons have exited other than debug purpose, but for that a simple tweak in critical_processes file is enough to make the container still running. If you still have a requirement to control this behaviour from CONFIG DB I suggest that we have a CONTAINER_FEATURE table to control this behaviour. We can declare that to support this functionality 3rd party dockers have to explicitly add support for it, but not for auto-restart.

…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

src/sonic-host-services/scripts/hostcfgd

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

jleveque · 2021-02-17T00:16:17Z

@lguohan: A few Azure Pipelines check builds are stuck in the "Expected — Waiting for status to be reported" state, and I cannot re-trigger these tests. There are a few PRs like this and I've even tried closing/reopening the PRs to no avail. Can you please help here?

liat-grozovik · 2021-02-25T10:55:41Z

/AzurePipleines run

liat-grozovik · 2021-03-12T07:37:07Z

@lguohan tests are stuck for 2weeks. can you reset it? AzurepPipelines run is not working.
@jleveque any further comments or this can be approved?

jleveque · 2021-03-12T18:19:23Z

Closing and reopening PR in hopes of getting stuck Azure Pipelines jobs running.

yozhao101 · 2021-05-05T01:41:11Z

src/sonic-host-services/scripts/hostcfgd

-                syslog.syslog(syslog.LOG_INFO, "Feature '{}' service is '{}'"
-                              .format(feature_name, invariant_state))
-                entry = self.config_db.get_entry('FEATURE', feature_name)
-                entry['state'] = invariant_state


@stepanblyschak Following my previous comment, if the state at here is always_disabled and invariant_state is always_enabled, the code at here will update the state field of feature to invariant_state. However, the code from line 761 ~ 764 will disable this feature. So I think the code at line 758 should be entry['state'] = state, right?

No sure I get this comment, are you saying there is a bug in original code?
Could you please point me to a document describing feature state transitions?

@yozhao101 Do you have this comment still?

@yozhao101 could you please check if the last commit addresses your comment?

…restart_cfg

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

renukamanavalan · 2021-05-10T16:12:49Z

/AzurePipelines run

azure-pipelines · 2021-05-10T16:13:09Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

tahmed-dev · 2021-05-21T17:32:52Z

src/sonic-host-services/scripts/hostcfgd

+        if cached_feature.state is None:
+            enable = feature.state in ("always_enabled", "enabled")
+            disable = feature.state in ("always_disabled", "disabled")
+        elif cached_feature.state == ("always_enabled", "always_disabled"):


in instead of ==: elif cached_feature.state in ("always_enabled", "always_disabled"):

Thanks for noticing this!

src/sonic-host-services/scripts/hostcfgd

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

liat-grozovik · 2021-06-07T04:25:11Z

/azp run

azure-pipelines · 2021-06-07T04:25:29Z

Azure Pipelines successfully started running 1 pipeline(s).

renukamanavalan · 2021-06-14T16:27:15Z

@tahmed-dev can you please provide your approval?

renukamanavalan · 2021-06-14T16:28:09Z

@yozhao101 can you please provide your review/approval ASAP?

renukamanavalan · 2021-06-14T16:30:17Z

@jleveque, can you please provide your review or approval?

src/sonic-host-services/scripts/hostcfgd

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

jleveque

Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?

yozhao101 · 2021-06-15T17:50:41Z

Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?

Thanks, Joe and Renuka ! I am checking ...

yozhao101 · 2021-06-15T18:51:53Z

src/sonic-host-services/scripts/hostcfgd

+    except Exception as err:
+        if log_err:
+            syslog.syslog(syslog.LOG_ERR, "{} - failed: return code - {}, output:\n{}"
+                  .format(err.cmd, err.returncode, err.output))


It looks like we only have output from child process if it was captured by run() or check_output(). Otherwise, None. Please see: https://docs.python.org/3/library/subprocess.html#subprocess.CalledProcessError.output

This issue is relevant for existing code as well - https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L66. This PR has no intend to fix this issue.

src/sonic-host-services/scripts/hostcfgd

renukamanavalan · 2021-06-24T14:14:11Z

@yozhao101 can you please resolve/provide your reviews and approval ?

renukamanavalan · 2021-06-24T16:36:04Z

Waiting for build to succeed. @stepanblyschak, if you can, please ping me, when build succeeds.

stepanblyschak · 2021-06-25T16:35:22Z

tests/hostcfgd/hostcfgd_radius_test.py:10: in <module>
    from parameterized import parameterized
E   ModuleNotFoundError: No module named 'parameterized

Same error appears again - https://dev.azure.com/mssonic/build/_build/results?buildId=19697&view=logs&j=88ce9a53-729c-5fa9-7b6e-3d98f2488e3f&t=8d99be27-49d0-54d0-99b1-cfc0d47f0318&l=527

liat-grozovik · 2021-06-28T15:33:38Z

/azp run

azure-pipelines · 2021-06-28T15:33:56Z

Azure Pipelines successfully started running 1 pipeline(s).

Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature. hostcfgd refactoring - move feature handling in another class. override systemd service Restart= setting from hostcfgd. remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak stepanb@nvidia.com - Why I did it Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS. - How I did it hostcfgd configures 'Restart=' value for systemd service. - How to verify it root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled root@r-tigon-11:/home/admin# show feature status | grep lldp lldp enabled enabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 20 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 5 seconds lldp root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 35 seconds lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 3 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 39 seconds ago lldp root@r-tigon-11:/home/admin#

lguohan reviewed Oct 30, 2020

View reviewed changes

files/image_config/hostcfgd/hostcfgd Outdated Show resolved Hide resolved

lguohan requested a review from jleveque October 30, 2020 11:16

stepanblyschak force-pushed the auto_restart_cfg branch from e729226 to e4447ee Compare November 2, 2020 09:32

stepanblyschak added 2 commits November 16, 2020 23:25

add feature config parsing test

6f47365

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

stepanblyschak force-pushed the auto_restart_cfg branch from e4447ee to 6f47365 Compare November 16, 2020 21:32

stepanblyschak added 2 commits November 16, 2020 23:54

[supervisor-proc-exit-listener] remove unused import

bfbcfa8

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

Merge branch 'master' of github.com:azure/sonic-buildimage into auto_…

7e6e8f5

…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

lguohan force-pushed the master branch from 3690c1a to 512eb6b Compare December 25, 2020 18:34

Merge branch 'master' of github.com:azure/sonic-buildimage into auto_…

6e91c39

…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

stepanblyschak mentioned this pull request Jan 8, 2021

SONiC Application Extension PR Tracker #6398

Open

rajendra-dendukuri suggested changes Jan 15, 2021

View reviewed changes

[hostcfgd] enhance run_cmd with ability to raise exception and reuse …

285f7a3

…it in feature handler code Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

Merge branch 'master' of github.com:azure/sonic-buildimage into auto_…

79c2866

…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>

stepanblyschak force-pushed the auto_restart_cfg branch from d918e64 to 79c2866 Compare January 27, 2021 12:21

stepanblyschak added the appext Application Extension label Jan 27, 2021

jleveque suggested changes Jan 29, 2021

View reviewed changes

src/sonic-host-services/scripts/hostcfgd Outdated Show resolved Hide resolved

[hostcfgd] remove unneded feature_handle method

9a82a82

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

jleveque requested a review from lguohan February 11, 2021 21:18

jleveque closed this Mar 12, 2021

yozhao101 reviewed May 5, 2021

View reviewed changes

stepanblyschak added 2 commits May 5, 2021 12:35

Merge branch 'master' of github.com:azure/sonic-buildimage into auto_…

12ca5ce

…restart_cfg

fix review comments

f9c53f1

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

fix feature states transitions

774781d

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

stepanblyschak force-pushed the auto_restart_cfg branch from a6f9c94 to 774781d Compare May 21, 2021 15:23

tahmed-dev reviewed May 21, 2021

View reviewed changes

fix condition

1c6472c

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

jleveque reviewed Jun 14, 2021

View reviewed changes

src/sonic-host-services/scripts/hostcfgd Outdated Show resolved Hide resolved

rename feature -> feature_handler

32df167

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

jleveque approved these changes Jun 15, 2021

View reviewed changes

yozhao101 reviewed Jun 15, 2021

View reviewed changes

src/sonic-host-services/scripts/hostcfgd Show resolved Hide resolved

tahmed-dev approved these changes Jun 15, 2021

View reviewed changes

yozhao101 approved these changes Jun 24, 2021

View reviewed changes

renukamanavalan merged commit 9ce7c6d into sonic-net:master Jun 29, 2021

yozhao101 mentioned this pull request May 24, 2022

[hostcfgd] Initialize Restart= in feature's systemd config by the value of auto_restart in CONFIG_DB #10915

Merged

6 tasks

Conversation

stepanblyschak commented Oct 29, 2020 • edited by liat-grozovik Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stepanblyschak commented Oct 30, 2020

Uh oh!

stepanblyschak commented Oct 30, 2020

Uh oh!

Uh oh!

lgtm-com bot commented Nov 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stepanblyschak commented Jan 22, 2021

Uh oh!

jleveque commented Jan 22, 2021

Uh oh!

stepanblyschak commented Jan 25, 2021

Uh oh!

Uh oh!

jleveque commented Feb 17, 2021

Uh oh!

liat-grozovik commented Feb 25, 2021

Uh oh!

liat-grozovik commented Mar 12, 2021

Uh oh!

jleveque commented Mar 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

renukamanavalan commented May 10, 2021

Uh oh!

azure-pipelines bot commented May 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liat-grozovik commented Jun 7, 2021

Uh oh!

azure-pipelines bot commented Jun 7, 2021

Uh oh!

renukamanavalan commented Jun 14, 2021

Uh oh!

renukamanavalan commented Jun 14, 2021

Uh oh!

renukamanavalan commented Jun 14, 2021

Uh oh!

Uh oh!

jleveque left a comment

Choose a reason for hiding this comment

Uh oh!

yozhao101 commented Jun 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

renukamanavalan commented Jun 24, 2021

Uh oh!

renukamanavalan commented Jun 24, 2021

Uh oh!

stepanblyschak commented Jun 25, 2021

Uh oh!

liat-grozovik commented Jun 28, 2021

Uh oh!

azure-pipelines bot commented Jun 28, 2021

Uh oh!

stepanblyschak commented Oct 29, 2020 •

edited by liat-grozovik

Loading