[platform/reboot] Fix the reboot stuck issue#1104
Closed
wangxin wants to merge 1 commit intosonic-net:masterfrom
wangxin:reboot-stuck
Closed
[platform/reboot] Fix the reboot stuck issue#1104wangxin wants to merge 1 commit intosonic-net:masterfrom wangxin:reboot-stuck
wangxin wants to merge 1 commit intosonic-net:masterfrom
wangxin:reboot-stuck
Conversation
Two issues caused rebooting stuck: 1. Some switch may need more time to go down after the reboot command is issued. 2. If switch does not go down in time, terminating the multiprocessing.Process could stuck. The terminate() call sends SIGTERM which is not able to terminate the asynchronous process. The fix: 1. Increase the timeout to wait for switch to go down. 2. Replace multiprocessing.Process with multiprocessing.pool.ThreadPool. 3. Try to kill any reboot process on the switch if switch does not go down in time. 4. On switch that reboot fast, the wait_for module may fail to detect that the switch was down for rebooting. Ignore the error of waiting for switch to go down. 5. Use uptime to verify whether switch has rebooted. Signed-off-by: Xin Wang <xinw@mellanox.com>
Collaborator
Author
|
This PR may conflict with #1079. Close it for now. I will create a new one based on PR#1079 after it is merged. |
kazinator-arista
pushed a commit
to kazinator-arista/sonic-mgmt
that referenced
this pull request
Mar 4, 2026
…submodule head (sonic-net#11761) linkmgrd: * 476f85e 2022-08-17 | Update linkmgr health after getting default route update (sonic-net#117) (HEAD -> 202205, github/202205) [Longxiang Lyu] * fc589e9 2022-08-17 | Use `table` to toggle peer forwarding state (sonic-net#108) (sonic-net#120) [Longxiang Lyu] * bcb5a56 2022-08-17 | Fix azure pipeline (sonic-net#118) (sonic-net#121) [Longxiang Lyu] swss: * ef3a601 2022-08-17 | [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (sonic-net#2415) (HEAD -> 202205) [Nikola Dancejic] sairedis: * aed01cd 2022-08-12 | Fix: missing sonic-db-cli in docker-sonic-vs image (sonic-net#1072) (sonic-net#1104) (github/202205) [Hua Liu] platform-daemon: * 5a68073 2022-08-01 | Xcvrd changes to support 400G ZR configuration (sonic-net#270) (HEAD -> 202205) [Prince George] swsssdk: * ca785a2 2022-06-01 | Remove sonic-db-cli (sonic-net#122) (HEAD -> 202205, origin/202205) [Hua Liu] Signed-off-by: Ying Xie <ying.xie@microsoft.com> Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista
pushed a commit
to kazinator-arista/sonic-mgmt
that referenced
this pull request
Mar 4, 2026
Submodule src/sonic-swss 2529d79..15652b2: > [mirrororch]: Add retry logic when deleting referenced mirror session (sonic-net#1104) Submodule src/sonic-utilities 0cfa942..c049e54: > [neighbor_advertiser]: Add sleep in setting mirror session and ACL rules (sonic-net#714) > [warm/fast reboot] continue executing when killing docker failed (sonic-net#713) > [ecnconfig] Validate input WRED parameters (sonic-net#579) Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista
pushed a commit
to kazinator-arista/sonic-mgmt
that referenced
this pull request
Mar 4, 2026
update the environment variable in the teardown (sonic-net#1101) Fix for show interface portchannel now working on 201911 (sonic-net#1105) [201911]show interface counters for multi ASIC devices (sonic-net#1104) Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
kazinator-arista
pushed a commit
to kazinator-arista/sonic-mgmt
that referenced
this pull request
Mar 4, 2026
update the environment variable in the teardown (sonic-net#1101) Fix for show interface portchannel now working on 201911 (sonic-net#1105) Revert "Pfcstat (sonic-net#1097)" Revert " [201911]show interface counters for multi ASIC devices (sonic-net#1104)" Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
kazinator-arista
pushed a commit
to kazinator-arista/sonic-mgmt
that referenced
this pull request
Mar 4, 2026
Revert "Revert " [201911]show interface counters for multi ASIC devices (sonic-net#1104)"" Revert "Revert "Pfcstat (sonic-net#1097)"" [show] Fix 'show int neighbor expected' (sonic-net#1106) Update argument for docker exec it->i (sonic-net#1118) Update to make config load/reload backward compatible. (sonic-net#1115) Handling deletion of Port Channel before deletion of its members (sonic-net#1062) Skip default route present in ASIC-DB but not in APP-DB. (sonic-net#1107) [CLI][PFCWD][Multi-ASIC] Added multi ASIC support to 'pfcwd' CLI (sonic-net#1102) [201911] Multi asic platform config interface portchannel, show transceiver (sonic-net#1087) [drop counters] Fix configuration for counters with lowercase names (sonic-net#1103) Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
Fixes # (issue)
The test_reboot.py script may stuck like for ever while rebooting the SONiC switch.
Reason:
Some switch may need more time to go down after the reboot
command is issued. If switch does not go down in time, terminating the
multiprocessing.Process could stuck. The terminate() call sends SIGTERM
which is not able to terminate the asynchronous process.
The fix:
in time.
that the switch was down for rebooting. Ignore the error of waiting
for switch to go down.
Type of change
Approach
How did you do it?
in time.
that the switch was down for rebooting. Ignore the error of waiting
for switch to go down.
How did you verify/test it?
Tested on Mellanox platform.
Any platform specific information?
No.
Supported testbed topology if it's a new test case?
NA
Documentation