[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work by lizhijianrd · Pull Request #13974 · sonic-net/sonic-mgmt

lizhijianrd · 2024-08-04T10:21:31Z

Description of PR

Summary:
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

Approach

What is the motivation for this PR?

On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?

If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?

Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED                                                                                                                                                                  [100%]

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…sn't work

lizhijianrd · 2024-08-05T03:00:14Z

@yxieca @bingwang-ms Can you please help add approval tag for 202311 and 202405 branch, thanks!

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

mssonicbld · 2024-08-05T15:51:38Z

Cherry-pick PR to 202405: #13985

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

mssonicbld · 2024-08-05T16:19:41Z

Cherry-pick PR to 202311: #13986

…sn't work (#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

lizhijianrd · 2024-08-07T05:26:15Z

Manually backport 202311 to resolve PR test issue: #14011

…boot doesn't work (#14011) * Add module `platform_tests/test_kdump.py` into PR test (#12732) What is the motivation for this PR? To adapt to kvm testbed, there are two isssues in previous pdu_controller init.py script: * conn_graph_facts is None on kvm testbed, and it will generate KeyError when trying to get the value using method conn_graph_facts["xxx"]. * inv_mgr has no function called get_host_list So in this PR, I fix these issues * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * The code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info is unnecessary here. Because, we use conn_graph_facts to get pdu links and pdu info first. if the hostname not in device_pdu_links or hostname not in device_pdu_info here means the host doesn't not exist in the csv file, so we don't have to get info from inventory. So remove the code in this branch in this PR. How did you do it? * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * Remove the code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info in this PR. * [test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work (#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED --------- Co-authored-by: Yutong Zhang <[email protected]>

…oot (#14343) What is the motivation for this PR? In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

…oot (#14343) What is the motivation for this PR? In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doe…

ad2ab21

…sn't work

lizhijianrd requested review from Blueve and liuh-80 August 4, 2024 10:21

lizhijianrd added Request for 202311 branch Request for 202405 branch labels Aug 4, 2024

Blueve approved these changes Aug 5, 2024

View reviewed changes

wangxin approved these changes Aug 5, 2024

View reviewed changes

wangxin merged commit 92b4e79 into sonic-net:master Aug 5, 2024

lizhijianrd deleted the test-ro-disk-pdu-reboot branch August 5, 2024 02:59

bingwang-ms added the Approved for 202405 branch label Aug 5, 2024

mssonicbld added the Created PR to 202405 branch label Aug 5, 2024

mssonicbld mentioned this pull request Aug 5, 2024

[action] [PR:13974] [test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work #13985

Merged

8 tasks

yxieca added the Approved for 202311 branch label Aug 5, 2024

mssonicbld added the Created PR to 202311 branch label Aug 5, 2024

mssonicbld mentioned this pull request Aug 5, 2024

[action] [PR:13974] [test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work #13986

Closed

8 tasks

mssonicbld added Included in 202405 branch and removed Created PR to 202405 branch labels Aug 5, 2024

lizhijianrd mentioned this pull request Aug 7, 2024

[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work #14011

Merged

8 tasks

lizhijianrd added Included in 202311 branch and removed Created PR to 202311 branch labels Aug 8, 2024

lizhijianrd mentioned this pull request Aug 30, 2024

Enhance test_ro_disk to ensure device finally be recovered by PDU reboot #14343

Merged

8 tasks

mssonicbld mentioned this pull request Sep 1, 2024

[action] [PR:14343] Enhance test_ro_disk to ensure device finally be recovered by PDU reboot #14351

Merged

8 tasks

mssonicbld mentioned this pull request Sep 1, 2024

[action] [PR:14343] Enhance test_ro_disk to ensure device finally be recovered by PDU reboot #14352

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#13974

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#13974
wangxin merged 1 commit intosonic-net:masterfrom
lizhijianrd:test-ro-disk-pdu-reboot

lizhijianrd commented Aug 4, 2024 •

edited

Loading

Uh oh!

lizhijianrd commented Aug 5, 2024

Uh oh!

mssonicbld commented Aug 5, 2024

Uh oh!

mssonicbld commented Aug 5, 2024

Uh oh!

lizhijianrd commented Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lizhijianrd commented Aug 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

lizhijianrd commented Aug 5, 2024

Uh oh!

mssonicbld commented Aug 5, 2024

Uh oh!

mssonicbld commented Aug 5, 2024

Uh oh!

lizhijianrd commented Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lizhijianrd commented Aug 4, 2024 •

edited

Loading