Skip to content

Add more retry times when upgrade image#6810

Closed
ZhaohuiS wants to merge 2 commits intosonic-net:masterfrom
ZhaohuiS:fix/add_retry_times_upgrade_image
Closed

Add more retry times when upgrade image#6810
ZhaohuiS wants to merge 2 commits intosonic-net:masterfrom
ZhaohuiS:fix/add_retry_times_upgrade_image

Conversation

@ZhaohuiS
Copy link
Contributor

@ZhaohuiS ZhaohuiS commented Nov 11, 2022

Signed-off-by: Zhaohui Sun zhaohuisun@microsoft.com

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

upgrade image may fail because upgrade image takes more than 5m.
For example, upgrade mellanox testbed with master image may take more than 5m sometimes.
But not sure why default timeout is 5m.

2022-11-11T11:14:05.9143210Z Friday 11 November 2022  11:14:05 +0000 (0:00:00.026)       0:00:01.760 ******* 
2022-11-11T11:19:06.4983686Z fatal: [str-msn2700-02]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 10.3.147.45 closed.", "unreachable": true}

How did you do it?

Refer to ansible document:
https://docs.ansible.com/ansible/latest/user_guide/playbooks_async.html#:~:text=If%20you%20want%20to%20run,longer%20than%20its%20async%20value
Use async and poll to run script asynchronously. Set timeout to 500s.

How did you verify/test it?

Run ansible-playbook upgrade_sonic.yml

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@wangxin
Copy link
Collaborator

wangxin commented Nov 11, 2022

If adding retry from 5 to 10 can work around this, it indicates that there is a chance upgrading can complete in 5 minutes if we retry more times.

It looks like the more fundamental issue is that if any ansible module needs more than 5 minutes to complete, ansible will force terminate execution of this task before it is completed. This sounds like a new fundamental issue.

So, I don't think simply increasing retry times is the correct way to fix this issue.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@ZhaohuiS
Copy link
Contributor Author

The root reason is introduced by this PR sonic-net/sonic-buildimage#12109.
SSH default timeout time is updated from 15m to 5m.
Close this PR as it will have some other fix for this issue.

@ZhaohuiS ZhaohuiS closed this Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants