Skip to content

Add Process Reboot Cause Service as Upholds of Database Service#18772

Closed
xincunli-sonic wants to merge 4 commits intosonic-net:masterfrom
xincunli-sonic:xincun/add-upholds-for-process-reboot
Closed

Add Process Reboot Cause Service as Upholds of Database Service#18772
xincunli-sonic wants to merge 4 commits intosonic-net:masterfrom
xincunli-sonic:xincun/add-upholds-for-process-reboot

Conversation

@xincunli-sonic
Copy link
Contributor

Why I did it

Addressing this issue: Fixed determine/process reboot-cause service dependency

Work item tracking
  • Microsoft ADO (number only): 26209780

How I did it

Add Upholds for process-reboot-cause.service

How to verify it

  1. Before the change
admin@str3-msn4600c-acs-05:~$ show reboot-cause history 
Name                 Cause                                              Time                             User    Comment
-------------------  -------------------------------------------------  -------------------------------  ------  ---------
2024_04_23_19_09_23  reboot                                             Tue 23 Apr 2024 07:07:23 PM UTC  admin   N/A
2024_04_23_07_02_00  Unknown (First boot of SONiC version 20231110.09)  N/A                              N/A     N/A
2024_04_23_06_48_50  reboot                                             Tue 23 Apr 2024 06:48:02 AM UTC  admin   N/A
2024_04_23_06_43_13  reboot                                             Tue 23 Apr 2024 06:42:25 AM UTC  admin   N/A
2024_04_23_06_08_24  Unknown                                            N/A                              N/A     N/A
2024_04_22_21_29_03  reboot                                             Mon 22 Apr 2024 09:28:16 PM UTC  admin   N/A
2024_04_22_21_22_52  reboot                                             Mon 22 Apr 2024 09:22:05 PM UTC  admin   N/A
2024_04_22_21_16_52  reboot                                             Mon 22 Apr 2024 09:16:05 PM UTC  admin   N/A
2024_04_22_21_10_47  Watchdog                                           N/A                              N/A     Unknown
2024_04_22_21_04_44  soft-reboot                                        Mon 22 Apr 2024 09:04:24 PM UTC  admin   N/A
  1. Add Upholds in database
admin@str3-msn4600c-acs-05:~$ systemctl cat database.service 
# /lib/systemd/system/database.service
[Unit]
Description=Database container

Wants=database-chassis.service
After=database-chassis.service
Requires=docker.service
After=docker.service
After=rc-local.service
Upholds=process-reboot-cause.service
StartLimitIntervalSec=1200
StartLimitBurst=3

[Service]
User=root
ExecStartPre=/usr/local/bin/database.sh start
ExecStart=/usr/local/bin/database.sh wait
ExecStop=/usr/local/bin/database.sh stop
RestartSec=30

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/database.service.d/auto_restart.conf
[Service]
Restart=always
  1. Stop database
admin@str3-msn4600c-acs-05:~$ docker stop database 
database

admin@str3-msn4600c-acs-05:~$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS                     PORTS     NAMES
1de2b92d4c59   docker-snmp:latest                   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           snmp
ba6a737312f9   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           mgmt-framework
8a4389c45bdf   docker-lldp:latest                   "/usr/bin/docker-lld…"   2 hours ago   Up 2 hours                           lldp
90a875824801   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           gnmi
ce19ec771ac5   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                           pmon
c456c8548ee6   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                           radv
c521123a450c   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           syncd
03b4d87822fa   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                           bgp
226ae26a0614   docker-teamd:latest                  "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           teamd
bc689d1e75c5   docker-orchagent:latest              "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                           swss
5cdc86b679f8   docker-eventd:latest                 "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           eventd
34fe36f3428f   docker-database:latest               "/usr/local/bin/dock…"   2 hours ago   Exited (0) 6 seconds ago             database

admin@str3-msn4600c-acs-05:~$ show reboot-cause history
Traceback (most recent call last):
  File "/usr/local/bin/show", line 5, in <module>
    from show.main import cli
  File "/usr/local/lib/python3.11/dist-packages/show/main.py", line 325, in <module>
    if is_gearbox_configured():
       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/show/main.py", line 266, in is_gearbox_configured
    app_db.connect(app_db.APPL_DB)
  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1986, in connect
    return _swsscommon.SonicV2Connector_Native_connect(self, db_name, retry_on)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Unable to connect to redis: Cannot assign requested address
admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.
Unit process-reboot-cause..service could not be found.
admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.service 
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-04-23 21:27:53 UTC; 41s ago
   Duration: 62ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 83673 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 83673 (code=exited, status=0/SUCCESS)
  1. Restart Database
admin@str3-msn4600c-acs-05:~$ docker start database 
database
admin@str3-msn4600c-acs-05:~$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS          PORTS     NAMES
1de2b92d4c59   docker-snmp:latest                   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                snmp
ba6a737312f9   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                mgmt-framework
8a4389c45bdf   docker-lldp:latest                   "/usr/bin/docker-lld…"   2 hours ago   Up 2 hours                lldp
90a875824801   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                gnmi
ce19ec771ac5   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                pmon
c456c8548ee6   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                radv
03b4d87822fa   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                bgp
226ae26a0614   docker-teamd:latest                  "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                teamd
bc689d1e75c5   docker-orchagent:latest              "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                swss
34fe36f3428f   docker-database:latest               "/usr/local/bin/dock…"   2 hours ago   Up 22 seconds             database
admin@str3-msn4600c-acs-05:~$ show reboot-cause history
Name                 Cause                                              Time                             User    Comment
-------------------  -------------------------------------------------  -------------------------------  ------  ---------
2024_04_23_19_09_23  reboot                                             Tue 23 Apr 2024 07:07:23 PM UTC  admin   N/A
2024_04_23_07_02_00  Unknown (First boot of SONiC version 20231110.09)  N/A                              N/A     N/A
2024_04_23_06_48_50  reboot                                             Tue 23 Apr 2024 06:48:02 AM UTC  admin   N/A
2024_04_23_06_43_13  reboot                                             Tue 23 Apr 2024 06:42:25 AM UTC  admin   N/A
2024_04_23_06_08_24  Unknown                                            N/A                              N/A     N/A
2024_04_22_21_29_03  reboot                                             Mon 22 Apr 2024 09:28:16 PM UTC  admin   N/A
2024_04_22_21_22_52  reboot                                             Mon 22 Apr 2024 09:22:05 PM UTC  admin   N/A
2024_04_22_21_16_52  reboot                                             Mon 22 Apr 2024 09:16:05 PM UTC  admin   N/A
2024_04_22_21_10_47  Watchdog                                           N/A                              N/A     Unknown
2024_04_22_21_04_44  soft-reboot                                        Mon 22 Apr 2024 09:04:24 PM UTC  admin   N/A

admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.service 
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-04-23 21:27:53 UTC; 1min 32s ago
   Duration: 62ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 83673 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 83673 (code=exited, status=0/SUCCESS)

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202405
  • 202411

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@xincunli-sonic xincunli-sonic requested a review from lguohan as a code owner April 23, 2024 21:47
@prgeor
Copy link
Contributor

prgeor commented Apr 23, 2024

@anamehra please review for T2 chassis

@xincunli-sonic
Copy link
Contributor Author

@saiarcot895 Would you mind review this change, it was wrongly use upholds in this PR: sonic-net/sonic-host-services#100

@liushilongbuaa
Copy link
Contributor

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 18772 in repo sonic-net/sonic-buildimage

@xumia
Copy link
Collaborator

xumia commented Apr 25, 2024

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anamehra
Copy link
Contributor

anamehra commented May 1, 2024

@anamehra please review for T2 chassis

Hi @prgeor , is this changes tested on multi-asic?

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage (Test kvmtest-multi-asic-t1-lag by Elastictest)

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage (Test kvmtest-multi-asic-t1-lag by Elastictest)

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@anamehra
Copy link
Contributor

anamehra commented May 2, 2024

@prgeor , I am seeing the following error if add this Upholds string in database.service file on LC. I did not build a fresh image but edited a router and rebooted. That should not be any different.

root@sfd-t2-lc2:/home/cisco# systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
Active: failed (Result: start-limit-hit) since Thu 2024-05-02 20:49:45 UTC; 5s ago
Duration: 48ms
TriggeredBy: ● process-reboot-cause.timer
Process: 1609294 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
Main PID: 1609294 (code=exited, status=0/SUCCESS)

May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database.service upholds it, but not starting since we tried this too often recently. Will retry later.

Could you please try any multi-asic system at your end?
Thanks

@anamehra
Copy link
Contributor

Hi @prgeor , any input on my comment above? Thanks

@anamehra
Copy link
Contributor

anamehra commented Jun 5, 2024

@abdosi , for your viz
This is needed for chassis

@abdosi
Copy link
Contributor

abdosi commented Jun 11, 2024

@xincunli-sonic / @prgeor this change is not working for chassis. After making this change as mentioned by @anamehra seeing below issue post LC reboot. I feel this is not straight forward to fix for multi-asic as their are multiple database service.

Can we merge this PR for master/202405 #17406 .This has been tested for 202305 and 202205 image and looks stable fix.

@anamehra wondering this issue is coming because timer service of process-reboot-cause. In your PR :#17406 it seems time service is removed. Wondering do we need to do same here in context of this PR ?

admin@str2-xxxx-lc1-2:/var/log$ sudo systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-06-11 07:25:50 UTC; 4s ago
   Duration: 57ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 2868400 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 2868400 (code=exited, status=0/SUCCESS)

Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database@2.service upholds it, but not starting since we tried this too often recently. Will retry later.

@anamehra
Copy link
Contributor

Hi @abdosi , removing timer does not help as well.

@arlakshm
Copy link
Contributor

@xincunli-sonic, can you please resolve the comments on this PR. Also please confirm if these changes will work on multi-asic platforms

@prgeor
Copy link
Contributor

prgeor commented Jul 18, 2024

@xincunli-sonic / @prgeor this change is not working for chassis. After making this change as mentioned by @anamehra seeing below issue post LC reboot. I feel this is not straight forward to fix for multi-asic as their are multiple database service.

Can we merge this PR for master/202405 #17406 .This has been tested for 202305 and 202205 image and looks stable fix.

@anamehra wondering this issue is coming because timer service of process-reboot-cause. In your PR :#17406 it seems time service is removed. Wondering do we need to do same here in context of this PR ?

admin@str2-xxxx-lc1-2:/var/log$ sudo systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-06-11 07:25:50 UTC; 4s ago
   Duration: 57ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 2868400 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 2868400 (code=exited, status=0/SUCCESS)

Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database@2.service upholds it, but not starting since we tried this too often recently. Will retry later.

@anamehra @abdosi can you tell me if you are testing this change in master image for multi-asic? Which sonic version are you testing on chassis platform?

@abdosi
Copy link
Contributor

abdosi commented Oct 23, 2024

this change is not working on multi-asic/chassis subsystem. We had another approach to fix this which is merged in master also. so closing this PR.

@abdosi abdosi closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status
Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants