Skip to content

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#13974

Merged
wangxin merged 1 commit intosonic-net:masterfrom
lizhijianrd:test-ro-disk-pdu-reboot
Aug 5, 2024
Merged

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#13974
wangxin merged 1 commit intosonic-net:masterfrom
lizhijianrd:test-ro-disk-pdu-reboot

Conversation

@lizhijianrd
Copy link
Copy Markdown
Contributor

@lizhijianrd lizhijianrd commented Aug 4, 2024

Description of PR

Summary:
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?

If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?

Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED                                                                                                                                                                  [100%]

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@wangxin wangxin merged commit 92b4e79 into sonic-net:master Aug 5, 2024
@lizhijianrd lizhijianrd deleted the test-ro-disk-pdu-reboot branch August 5, 2024 02:59
@lizhijianrd
Copy link
Copy Markdown
Contributor Author

@yxieca @bingwang-ms Can you please help add approval tag for 202311 and 202405 branch, thanks!

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Aug 5, 2024
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202405: #13985

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Aug 5, 2024
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202311: #13986

mssonicbld pushed a commit that referenced this pull request Aug 5, 2024
…sn't work (#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
lizhijianrd added a commit to lizhijianrd/sonic-mgmt that referenced this pull request Aug 7, 2024
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
@lizhijianrd
Copy link
Copy Markdown
Contributor Author

Manually backport 202311 to resolve PR test issue: #14011

StormLiangMS pushed a commit that referenced this pull request Aug 8, 2024
…boot doesn't work (#14011)

* Add module `platform_tests/test_kdump.py` into PR test (#12732)

What is the motivation for this PR?
To adapt to kvm testbed, there are two isssues in previous pdu_controller init.py script:
* conn_graph_facts is None on kvm testbed, and it will generate KeyError when trying to get the value using method conn_graph_facts["xxx"].
* inv_mgr has no function called get_host_list

So in this PR, I fix these issues
* Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict.
* The code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info is unnecessary here. Because, we use conn_graph_facts to get pdu links and pdu info first. if the hostname not in device_pdu_links or hostname not in device_pdu_info here means the host doesn't not exist in the csv file, so we don't have to get info from inventory. So remove the code in this branch in this PR.

How did you do it?
* Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict.
* Remove the code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info in this PR.

* [test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work (#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED

---------

Co-authored-by: Yutong Zhang <[email protected]>
StormLiangMS pushed a commit that referenced this pull request Sep 1, 2024
…oot (#14343)

What is the motivation for this PR?
In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Sep 1, 2024
…oot (sonic-net#14343)

What is the motivation for this PR?
In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Sep 1, 2024
…oot (sonic-net#14343)

What is the motivation for this PR?
In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld pushed a commit that referenced this pull request Sep 1, 2024
…oot (#14343)

What is the motivation for this PR?
In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
mssonicbld pushed a commit that referenced this pull request Sep 2, 2024
…oot (#14343)

What is the motivation for this PR?
In PR #13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
hdwhdw pushed a commit to hdwhdw/sonic-mgmt that referenced this pull request Sep 20, 2024
…oot (sonic-net#14343)

What is the motivation for this PR?
In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
arista-hpandya pushed a commit to arista-hpandya/sonic-mgmt that referenced this pull request Oct 2, 2024
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
arista-hpandya pushed a commit to arista-hpandya/sonic-mgmt that referenced this pull request Oct 2, 2024
…oot (sonic-net#14343)

What is the motivation for this PR?
In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
vikshaw-Nokia pushed a commit to vikshaw-Nokia/sonic-mgmt that referenced this pull request Oct 23, 2024
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
vikshaw-Nokia pushed a commit to vikshaw-Nokia/sonic-mgmt that referenced this pull request Oct 23, 2024
…oot (sonic-net#14343)

What is the motivation for this PR?
In PR sonic-net#13974, I introduced PDU reboot to recover the DUT from RO-Disk state when regular sudo reboot fail to do that. However, the do_reboot function may raise pytest_ansible.errors.AnsibleConnectionFailure which is not be handled. In this case, the PDU reboot part cannot be executed and DUT cannot be recovered.
In this PR, I enhance the testcase to ensure PDU reboot is always executed when regular reboot fail.

How did you do it?
Handle pytest_ansible.errors.AnsibleConnectionFailure in do_reboot function.
Add try-except block for do_reboot to ensure no matter what Exception is raised, PDU reboot can always be executed to recover the DUT.
How did you verify/test it?
Verified by run test_ro_disk on Nokia-7215 testbeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants