[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#13974
Merged
wangxin merged 1 commit intosonic-net:masterfrom Aug 5, 2024
Merged
Conversation
Blueve
approved these changes
Aug 5, 2024
wangxin
approved these changes
Aug 5, 2024
Contributor
Author
|
@yxieca @bingwang-ms Can you please help add approval tag for 202311 and 202405 branch, thanks! |
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Aug 5, 2024
…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
Collaborator
|
Cherry-pick PR to 202405: #13985 |
8 tasks
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Aug 5, 2024
…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
Collaborator
|
Cherry-pick PR to 202311: #13986 |
8 tasks
mssonicbld
pushed a commit
that referenced
this pull request
Aug 5, 2024
…sn't work (#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
lizhijianrd
added a commit
to lizhijianrd/sonic-mgmt
that referenced
this pull request
Aug 7, 2024
…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
Merged
8 tasks
Contributor
Author
|
Manually backport 202311 to resolve PR test issue: #14011 |
StormLiangMS
pushed a commit
that referenced
this pull request
Aug 8, 2024
…boot doesn't work (#14011) * Add module `platform_tests/test_kdump.py` into PR test (#12732) What is the motivation for this PR? To adapt to kvm testbed, there are two isssues in previous pdu_controller init.py script: * conn_graph_facts is None on kvm testbed, and it will generate KeyError when trying to get the value using method conn_graph_facts["xxx"]. * inv_mgr has no function called get_host_list So in this PR, I fix these issues * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * The code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info is unnecessary here. Because, we use conn_graph_facts to get pdu links and pdu info first. if the hostname not in device_pdu_links or hostname not in device_pdu_info here means the host doesn't not exist in the csv file, so we don't have to get info from inventory. So remove the code in this branch in this PR. How did you do it? * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * Remove the code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info in this PR. * [test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work (#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED --------- Co-authored-by: Yutong Zhang <[email protected]>
8 tasks
StormLiangMS
pushed a commit
that referenced
this pull request
Sep 1, 2024
…oot (#14343) What is the motivation for this PR? In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Sep 1, 2024
…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
Merged
8 tasks
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Sep 1, 2024
…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
Merged
8 tasks
mssonicbld
pushed a commit
that referenced
this pull request
Sep 1, 2024
…oot (#14343) What is the motivation for this PR? In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld
pushed a commit
that referenced
this pull request
Sep 2, 2024
…oot (#14343) What is the motivation for this PR? In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
hdwhdw
pushed a commit
to hdwhdw/sonic-mgmt
that referenced
this pull request
Sep 20, 2024
…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
arista-hpandya
pushed a commit
to arista-hpandya/sonic-mgmt
that referenced
this pull request
Oct 2, 2024
…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
arista-hpandya
pushed a commit
to arista-hpandya/sonic-mgmt
that referenced
this pull request
Oct 2, 2024
…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
vikshaw-Nokia
pushed a commit
to vikshaw-Nokia/sonic-mgmt
that referenced
this pull request
Oct 23, 2024
…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED
vikshaw-Nokia
pushed a commit
to vikshaw-Nokia/sonic-mgmt
that referenced
this pull request
Oct 23, 2024
…oot (sonic-net#14343) What is the motivation for this PR? In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered. In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail. How did you do it? Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function. Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT. How did you verify/test it? Verified by run test_ro_disk on Nokia-7215 testbeds.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.
Type of change
Back port request
Approach
What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.
How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.
How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation