Skip to content

[multi-asic][warm-reboot] Support warm-reboot on Multi-ASIC systems#100

Closed
stepanblyschak wants to merge 21 commits intomasterfrom
masic-warm-reboot
Closed

[multi-asic][warm-reboot] Support warm-reboot on Multi-ASIC systems#100
stepanblyschak wants to merge 21 commits intomasterfrom
masic-warm-reboot

Conversation

@stepanblyschak
Copy link
Owner

@stepanblyschak stepanblyschak commented Oct 17, 2025

What I did

Implement warm-reboot script support for Multi-ASIC systems.

How I did it

Modified warm-reboot script.

How to verify it

Verified on Multi-ASIC KVM with 4 ASICs:

admin@sonic:~$ sudo warm-reboot -vf
Fri Oct 24 01:29:24 PM UTC 2025 Starting warm-reboot
Fri Oct 24 01:29:25 PM UTC 2025 Saving counters folder before warmboot...
Fri Oct 24 01:29:27 PM UTC 2025 Loading kernel without secure boot
Fri Oct 24 01:29:28 PM UTC 2025 Cleared reboot states
Fri Oct 24 01:29:28 PM UTC 2025 asic0: Cleared reboot states
Fri Oct 24 01:29:28 PM UTC 2025 asic1: Cleared reboot states
Fri Oct 24 01:29:28 PM UTC 2025 asic3: Cleared reboot states
Fri Oct 24 01:29:28 PM UTC 2025 asic2: Cleared reboot states
Fri Oct 24 01:29:28 PM UTC 2025 asic0: Pausing orchagent ...
Fri Oct 24 01:29:28 PM UTC 2025 asic1: Pausing orchagent ...
Fri Oct 24 01:29:28 PM UTC 2025 asic2: Pausing orchagent ...
Fri Oct 24 01:29:28 PM UTC 2025 asic3: Pausing orchagent ...
Fri Oct 24 01:29:28 PM UTC 2025 asic2: Orchagent paused successfully
Fri Oct 24 01:29:28 PM UTC 2025 asic1: Orchagent paused successfully
Fri Oct 24 01:29:28 PM UTC 2025 asic3: Orchagent paused successfully
Fri Oct 24 01:29:28 PM UTC 2025 asic0: Orchagent paused successfully
Fri Oct 24 01:29:28 PM UTC 2025 Collecting logs to check ssd health before warm-reboot...
Fri Oct 24 01:29:28 PM UTC 2025 Stopping lldp ...
Fri Oct 24 01:29:28 PM UTC 2025 Stopped lldp
Fri Oct 24 01:29:29 PM UTC 2025 asic0: Stopping lldp@0 ...
Fri Oct 24 01:29:29 PM UTC 2025 asic3: Stopping lldp@3 ...
Fri Oct 24 01:29:29 PM UTC 2025 asic2: Stopping lldp@2 ...
Fri Oct 24 01:29:29 PM UTC 2025 asic1: Stopping lldp@1 ...
Fri Oct 24 01:29:29 PM UTC 2025 asic2: Stopped lldp@2
Fri Oct 24 01:29:29 PM UTC 2025 asic3: Stopped lldp@3
Fri Oct 24 01:29:29 PM UTC 2025 asic0: Stopped lldp@0
Fri Oct 24 01:29:29 PM UTC 2025 asic1: Stopped lldp@1
Fri Oct 24 01:29:29 PM UTC 2025 Stopping radv ...
Fri Oct 24 01:29:30 PM UTC 2025 Stopped radv
Fri Oct 24 01:29:30 PM UTC 2025 asic0: Stopping bgp@0 ...
Fri Oct 24 01:29:30 PM UTC 2025 asic3: Stopping bgp@3 ...
Fri Oct 24 01:29:30 PM UTC 2025 asic1: Stopping bgp@1 ...
Fri Oct 24 01:29:30 PM UTC 2025 asic2: Stopping bgp@2 ...
Fri Oct 24 01:29:34 PM UTC 2025 asic1: Stopped bgp@1
Fri Oct 24 01:29:34 PM UTC 2025 asic2: Stopped bgp@2
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Stopped bgp@3
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Stopped bgp@0
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Stopping swss@3 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Stopping swss@2 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Stopping swss@1 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Stopping swss@0 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Stopped swss@1
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Stopped swss@0
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Stopped swss@3
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Stopped swss@2
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Initialize pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Initialize pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Initialize pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Initialize pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Requesting pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Requesting pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Requesting pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Requesting pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Waiting for pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Waiting for pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Waiting for pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Waiting for pre-shutdown ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri Oct 24 01:29:35 PM UTC 2025 asic0: Stopping teamd@0 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic1: Stopping teamd@1 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic3: Stopping teamd@3 ...
Fri Oct 24 01:29:35 PM UTC 2025 asic2: Stopping teamd@2 ...
Fri Oct 24 01:29:36 PM UTC 2025 asic1: Stopped teamd@1
Fri Oct 24 01:29:36 PM UTC 2025 asic3: Stopped teamd@3
Fri Oct 24 01:29:36 PM UTC 2025 asic0: Stopped teamd@0
Fri Oct 24 01:29:36 PM UTC 2025 asic2: Stopped teamd@2
Fri Oct 24 01:29:36 PM UTC 2025 asic0: Stopping syncd@0 ...
Fri Oct 24 01:29:36 PM UTC 2025 asic2: Stopping syncd@2 ...
Fri Oct 24 01:29:36 PM UTC 2025 asic1: Stopping syncd@1 ...
Fri Oct 24 01:29:36 PM UTC 2025 asic3: Stopping syncd@3 ...
Fri Oct 24 01:29:39 PM UTC 2025 asic2: Stopped syncd@2
Fri Oct 24 01:29:39 PM UTC 2025 asic0: Stopped syncd@0
Fri Oct 24 01:29:40 PM UTC 2025 asic1: Stopped syncd@1
Fri Oct 24 01:29:40 PM UTC 2025 asic3: Stopped syncd@3
Fri Oct 24 01:29:41 PM UTC 2025 asic0: Backing up database ...
Fri Oct 24 01:29:41 PM UTC 2025 asic3: Backing up database ...
Fri Oct 24 01:29:41 PM UTC 2025 asic2: Backing up database ...
Fri Oct 24 01:29:41 PM UTC 2025 asic1: Backing up database ...
Fri Oct 24 01:29:41 PM UTC 2025 Backing up database ...
Successfully copied 60.9kB to /host/warmboot2
Successfully copied 60.9kB to /host/warmboot3
Successfully copied 62kB to /host/warmboot0
Successfully copied 62.5kB to /host/warmboot1
Successfully copied 15.4kB to /host/warmboot
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
Fri Oct 24 01:29:56 PM UTC 2025 Enabling Watchdog before warm-reboot
Fri Oct 24 01:29:56 PM UTC 2025 Rebooting with /sbin/kexec -e to SONiC-OS-master.0-87f0a41a4 ...

On boot SAI started in warm boot mode:

admin@sonic:~$ sudo zless /var/log/syslog | grep SAI_BOOT_TYPE
2025 Oct 24 13:30:35.481528 sonic NOTICE syncd2#syncd: :- profileGetValue: SAI_BOOT_TYPE: 1
2025 Oct 24 13:30:35.488698 sonic NOTICE syncd3#syncd: :- profileGetValue: SAI_BOOT_TYPE: 1
2025 Oct 24 13:30:35.694937 sonic NOTICE syncd0#syncd: :- profileGetValue: SAI_BOOT_TYPE: 1
2025 Oct 24 13:30:35.702988 sonic NOTICE syncd1#syncd: :- profileGetValue: SAI_BOOT_TYPE: 1

Tested on single-ASIC real HW:

root@sonic:/home/admin# sudo warm-reboot -v
Fri Oct 24 04:41:37 PM IDT 2025 Starting warm-reboot
Fri Oct 24 04:41:50 PM IDT 2025 Prepare MLNX ASIC to fastfast-reboot: install new FW if required
Fri Oct 24 04:41:54 PM IDT 2025 Loading kernel without secure boot
Fri Oct 24 04:41:55 PM IDT 2025 Cleared reboot states
Fri Oct 24 04:41:55 PM IDT 2025 Starting lag_keepalive to send LACPDUs ...
ERROR: There are port channels/peer devices that failed the probe: ['PortChannel102', 'PortChannel105', 'PortChannel103', 'PortChannel106', 'PortChannel108', 'PortChannel104', 'PortChannel107', 'PortChannel101']
Fri Oct 24 04:42:11 PM IDT 2025 Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available
Fri Oct 24 04:42:11 PM IDT 2025 Pausing orchagent ...
Fri Oct 24 04:42:11 PM IDT 2025 Orchagent paused successfully
Fri Oct 24 04:42:11 PM IDT 2025 Collecting logs to check ssd health before fastfast-reboot...
Fri Oct 24 04:42:11 PM IDT 2025 Stopping lldp ...
Fri Oct 24 04:42:13 PM IDT 2025 Stopped lldp
Fri Oct 24 04:42:13 PM IDT 2025 Stopping lldp ...
Fri Oct 24 04:42:14 PM IDT 2025 Stopped lldp
Fri Oct 24 04:42:14 PM IDT 2025 Stopping pmon ...
Fri Oct 24 04:42:23 PM IDT 2025 Stopped pmon
Fri Oct 24 04:42:23 PM IDT 2025 Stopping radv ...
Fri Oct 24 04:42:24 PM IDT 2025 Stopped radv
Fri Oct 24 04:42:24 PM IDT 2025 Stopping what-just-happened ...
Fri Oct 24 04:42:25 PM IDT 2025 Stopped what-just-happened
Fri Oct 24 04:42:25 PM IDT 2025 Stopping bgp ...
Fri Oct 24 04:42:33 PM IDT 2025 Stopped bgp
Fri Oct 24 04:42:33 PM IDT 2025 Stopping swss ...
Fri Oct 24 04:42:34 PM IDT 2025 Stopped swss
Fri Oct 24 04:42:34 PM IDT 2025 Initialize pre-shutdown ...
Fri Oct 24 04:42:35 PM IDT 2025 Requesting pre-shutdown ...
Fri Oct 24 04:42:35 PM IDT 2025 Waiting for pre-shutdown ...
Fri Oct 24 04:42:38 PM IDT 2025 Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Fri Oct 24 04:42:38 PM IDT 2025 Stopping teamd ...
Fri Oct 24 04:42:39 PM IDT 2025 Stopped teamd
Fri Oct 24 04:42:39 PM IDT 2025 Stopping syncd ...
Fri Oct 24 04:42:44 PM IDT 2025 Stopped syncd
Fri Oct 24 04:42:44 PM IDT 2025 Backing up database ...
Successfully copied 3MB to /host/warmboot
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
Fri Oct 24 04:42:53 PM IDT 2025 Enabling Watchdog before fastfast-reboot
Watchdog armed for 180 seconds
Fri Oct 24 04:42:53 PM IDT 2025 Rebooting with /sbin/kexec -e to SONiC-OS-master.0-87f0a41a4 ...

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

debug "Stopping $service_name ..."

# TODO: These exceptions for nat, sflow, lldp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to remove this TODO?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv TODO existed before, don't plan to handle it here

timeout --foreground 30 python3 ${LAG_KEEPALIVE_SCRIPT} --fork-into-background --namespace "$NETNS"

# give the lag_keepalive script a chance to send some LACPDUs
sleep 5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we know that 5 secs is enough?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv This sleep existed before, you are right, it might be not enough, don't plan to change it here

@oleksandrivantsiv
Copy link

Beside some minor comments lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants