Skip to content

[platform/reboot] Fix the reboot stuck issue#1104

Closed
wangxin wants to merge 1 commit intosonic-net:masterfrom
wangxin:reboot-stuck
Closed

[platform/reboot] Fix the reboot stuck issue#1104
wangxin wants to merge 1 commit intosonic-net:masterfrom
wangxin:reboot-stuck

Conversation

@wangxin
Copy link
Collaborator

@wangxin wangxin commented Sep 9, 2019

Description of PR

Summary:
Fixes # (issue)
The test_reboot.py script may stuck like for ever while rebooting the SONiC switch.

Reason:
Some switch may need more time to go down after the reboot
command is issued. If switch does not go down in time, terminating the
multiprocessing.Process could stuck. The terminate() call sends SIGTERM
which is not able to terminate the asynchronous process.

The fix:

  1. Increase the timeout to wait for switch to go down.
  2. Replace multiprocessing.Process with multiprocessing.pool.ThreadPool.
  3. Try to kill any reboot process on the switch if switch does not go down
    in time.
  4. On switch that reboot fast, the wait_for module may fail to detect
    that the switch was down for rebooting. Ignore the error of waiting
    for switch to go down.
  5. Use uptime to verify whether switch has rebooted.

Type of change

  • Bug fix
  • [] Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

How did you do it?

  1. Increase the timeout to wait for switch to go down.
  2. Replace multiprocessing.Process with multiprocessing.pool.ThreadPool.
  3. Try to kill any reboot process on the switch if switch does not go down
    in time.
  4. On switch that reboot fast, the wait_for module may fail to detect
    that the switch was down for rebooting. Ignore the error of waiting
    for switch to go down.
  5. Use uptime to verify whether switch has rebooted.

How did you verify/test it?

Tested on Mellanox platform.

Any platform specific information?

No.

Supported testbed topology if it's a new test case?

NA

Documentation

Two issues caused rebooting stuck:
1. Some switch may need more time to go down after the reboot
command is issued.
2. If switch does not go down in time, terminating the
multiprocessing.Process could stuck. The terminate() call sends SIGTERM
which is not able to terminate the asynchronous process.

The fix:
1. Increase the timeout to wait for switch to go down.
2. Replace multiprocessing.Process with multiprocessing.pool.ThreadPool.
3. Try to kill any reboot process on the switch if switch does not go down
in time.
4. On switch that reboot fast, the wait_for module may fail to detect
that the switch was down for rebooting. Ignore the error of waiting
for switch to go down.
5. Use uptime to verify whether switch has rebooted.

Signed-off-by: Xin Wang <xinw@mellanox.com>
@wangxin
Copy link
Collaborator Author

wangxin commented Sep 10, 2019

This PR may conflict with #1079. Close it for now. I will create a new one based on PR#1079 after it is merged.

@wangxin wangxin closed this Sep 10, 2019
@wangxin wangxin deleted the reboot-stuck branch September 26, 2019 12:36
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…submodule head (sonic-net#11761)

linkmgrd:
* 476f85e 2022-08-17 | Update linkmgr health after getting default route update (sonic-net#117) (HEAD -> 202205, github/202205) [Longxiang Lyu]
* fc589e9 2022-08-17 | Use `table` to toggle peer forwarding state (sonic-net#108) (sonic-net#120) [Longxiang Lyu]
* bcb5a56 2022-08-17 | Fix azure pipeline (sonic-net#118) (sonic-net#121) [Longxiang Lyu]

swss:
* ef3a601 2022-08-17 | [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (sonic-net#2415) (HEAD -> 202205) [Nikola Dancejic]

sairedis:
* aed01cd 2022-08-12 | Fix: missing sonic-db-cli in docker-sonic-vs image (sonic-net#1072) (sonic-net#1104) (github/202205) [Hua Liu]

platform-daemon:
* 5a68073 2022-08-01 | Xcvrd changes to support 400G ZR configuration (sonic-net#270) (HEAD -> 202205) [Prince George]

swsssdk:
* ca785a2 2022-06-01 | Remove sonic-db-cli (sonic-net#122) (HEAD -> 202205, origin/202205) [Hua Liu]

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
Submodule src/sonic-swss 2529d79..15652b2:
  > [mirrororch]: Add retry logic when deleting referenced mirror session (sonic-net#1104)

Submodule src/sonic-utilities 0cfa942..c049e54:
  > [neighbor_advertiser]: Add sleep in setting mirror session and ACL rules (sonic-net#714)
  > [warm/fast reboot] continue executing when killing docker failed (sonic-net#713)
  > [ecnconfig] Validate input WRED parameters (sonic-net#579)

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
 update the environment variable in the teardown (sonic-net#1101)
 Fix for show interface portchannel now working on 201911 (sonic-net#1105)
 [201911]show interface counters for multi ASIC devices (sonic-net#1104)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
update the environment variable in the teardown (sonic-net#1101)
Fix for show interface portchannel now working on 201911 (sonic-net#1105)
Revert "Pfcstat (sonic-net#1097)"
Revert " [201911]show interface counters for multi ASIC devices
      (sonic-net#1104)"

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
Revert "Revert " [201911]show interface counters for multi ASIC devices
(sonic-net#1104)""
 Revert "Revert "Pfcstat (sonic-net#1097)""
  [show] Fix 'show int neighbor expected' (sonic-net#1106)
   Update argument for docker exec it->i (sonic-net#1118)
     Update to make config load/reload backward compatible. (sonic-net#1115)
     Handling deletion of Port Channel before deletion of its members
     (sonic-net#1062)
    Skip default route present in ASIC-DB but not in APP-DB. (sonic-net#1107)
     [CLI][PFCWD][Multi-ASIC] Added multi ASIC support to 'pfcwd' CLI
     (sonic-net#1102)
       [201911]  Multi asic platform config interface portchannel, show
       transceiver  (sonic-net#1087)
       [drop counters] Fix configuration for counters with lowercase
       names (sonic-net#1103)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant