Skip to content

[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#14011

Merged
StormLiangMS merged 2 commits intosonic-net:202311from
lizhijianrd:backport-202311-test-ro-disk-240807
Aug 8, 2024
Merged

[202311][test_ro_disk] Recover DUT to RW state by power-cycle when reboot doesn't work#14011
StormLiangMS merged 2 commits intosonic-net:202311from
lizhijianrd:backport-202311-test-ro-disk-240807

Conversation

@lizhijianrd
Copy link
Copy Markdown
Contributor

Backport #13974 and #12732 to 202311.

What is the motivation for this PR?

On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?

If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?

Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED                                                                                                                                                                  [100%]

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

yutongzhang-microsoft and others added 2 commits August 7, 2024 05:21
What is the motivation for this PR?
To adapt to kvm testbed, there are two isssues in previous pdu_controller init.py script:
* conn_graph_facts is None on kvm testbed, and it will generate KeyError when trying to get the value using method conn_graph_facts["xxx"].
* inv_mgr has no function called get_host_list

So in this PR, I fix these issues
* Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict.
* The code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info is unnecessary here. Because, we use conn_graph_facts to get pdu links and pdu info first. if the hostname not in device_pdu_links or hostname not in device_pdu_info here means the host doesn't not exist in the csv file, so we don't have to get info from inventory. So remove the code in this branch in this PR.

How did you do it?
* Use method get to get the value in dict conn_graph_facts to avoid KeyError and set it {} if the key not exists in the dict.
* Remove the code in branch if hostname not in device_pdu_links or hostname not in device_pdu_info in this PR.
…sn't work (sonic-net#13974)

What is the motivation for this PR?
On some platforms, DUT cannot be recovered from RO-disk state by reboot. (e.g., On Nokia-7215, we saw the reboot is blocked by systemd-journald.service) To avoid DUT stuck at RO disk state, this PR introduce power-cycle as the final approach to recover DUT.

How did you do it?
If reboot failed to recover DUT from RO disk state, try power-cycle to recover the DUT.

How did you verify/test it?
Verified on Nokia-7215 M0 testbed. Get test passed with below logs:

tacacs/test_ro_disk.py::test_ro_disk[dut-7215-4]
-------------------------------------------------------------------------------- live log call --------------------------------------------------------------------------------
10:02:17 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:0/3
10:04:02 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:1/3
10:05:24 test_ro_disk.do_reboot                   L0089 ERROR  | DUT did not go down, exception: run module command failed, Ansible Results =>
{"failed": true, "msg": "Timeout (62s) waiting for privilege escalation prompt: "} attempt:2/3
10:05:44 test_ro_disk.do_reboot                   L0095 ERROR  | Failed to reboot DUT after 3 retries
10:05:44 test_ro_disk.test_ro_disk                L0262 WARNING| Failed to reboot dut-7215-4, try PDU reboot to restore disk RW state
PASSED
@lizhijianrd
Copy link
Copy Markdown
Contributor Author

@yxieca Can you please help merge this backport PR? Thanks!

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit ccdd916 into sonic-net:202311 Aug 8, 2024
@lizhijianrd lizhijianrd deleted the backport-202311-test-ro-disk-240807 branch August 8, 2024 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants