[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work by lizhijianrd · Pull Request #14011 · sonic-net/sonic-mgmt

lizhijianrd · 2024-08-07T05:25:18Z

Backport #13974 and #12732 to 202311.

What is the motivation for this PR?

On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?

If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?

Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED                                                                                                                                                                  [100%]

Description of PR

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

What is the motivation for this PR? To adapt to kvm testbed, there are two isssues in previous pdu_controller init.py script: * conn_graph_facts is None on kvm testbed, and it will generate KeyError when trying to get the value using method conn_graph_facts["xxx"]. * inv_mgr has no function called get_host_list So in this PR, I fix these issues * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * The code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info is unnecessary here. Because, we use conn_graph_facts to get pdu links and pdu info first. if the hostname not in device_pdu_links or hostname not in device_pdu_info here means the host doesn't not exist in the csv file, so we don't have to get info from inventory. So remove the code in this branch in this PR. How did you do it? * Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict. * Remove the code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info in this PR.

…sn't work (sonic-net#13974) What is the motivation for this PR? On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT. How did you do it? If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT. How did you verify/test it? Verified on Nokia-7215 M0 testbed. Get test passed with below logs: tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4] -------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------- 10:02:17 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3 10:04:02 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3 10:05:24 test_ro_disk.do_reboot L0089 ERROR | DUT did not go down, exception: run module command failed, Ansible Results => {"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3 10:05:44 test_ro_disk.do_reboot L0095 ERROR | Failed to reboot DUT after 3 retries 10:05:44 test_ro_disk.test_ro_disk L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state PASSED

lizhijianrd · 2024-08-07T09:08:08Z

@yxieca Can you please help merge this backport PR? Thanks!

StormLiangMS

LGTM

yutongzhang-microsoft and others added 2 commits August 7, 2024 05:21

lizhijianrd mentioned this pull request Aug 7, 2024

[test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work #13974

Merged

8 tasks

lizhijianrd requested review from Blueve, liuh-80 and yxieca August 7, 2024 05:43

StormLiangMS approved these changes Aug 8, 2024

View reviewed changes

StormLiangMS merged commit ccdd916 into sonic-net:202311 Aug 8, 2024

lizhijianrd deleted the backport-202311-test-ro-disk-240807 branch August 8, 2024 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#14011

[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#14011
StormLiangMS merged 2 commits intosonic-net:202311from
lizhijianrd:backport-202311-test-ro-disk-240807

lizhijianrd commented Aug 7, 2024

Uh oh!

lizhijianrd commented Aug 7, 2024

Uh oh!

StormLiangMS left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lizhijianrd commented Aug 7, 2024

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

lizhijianrd commented Aug 7, 2024

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants