[action] [PR:21407] Workaround for "dead worker" after docker-sonic-mgmt upgrade by mssonicbld · Pull Request #21414 · sonic-net/sonic-mgmt

mssonicbld · 2025-11-25T00:42:38Z

Description of PR

Summary:
Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

After the docker-sonic-mgmt image is upgraded, some tests failed while creating nbrhosts fixture with below error:

self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

 def _wait_on_pending_results(self, iterator):
 '''
 Wait for the shared counter to drop to zero, using a short sleep
 between checks to ensure we don't spin lock
 '''

 ret_results = []

 display.debug("waiting for pending results...")
 while self._pending_results > 0 and not self._tqm._terminated:

 if self._tqm.has_dead_workers():
> raise AnsibleError("A worker was found in a dead state")
E ansible.errors.AnsibleError: A worker was found in a dead state

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18. While creating the nbrhosts fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The nbrhosts fixture needs to initialize multiple SonicHost objects. In init of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts. For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

How did you do it?

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts. I'll try to dig deeper to find out the exact root cause and come back with a better fix.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…et#21407) After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error: ``` self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0> iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80> def _wait_on_pending_results(self, iterator): ''' Wait for the shared counter to drop to zero, using a short sleep between checks to ensure we don't spin lock ''' ret_results = [] display.debug("waiting for pending results...") while self._pending_results > 0 and not self._tqm._terminated: if self._tqm.has_dead_workers(): > raise AnsibleError("A worker was found in a dead state") E ansible.errors.AnsibleError: A worker was found in a dead state ``` The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18. While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts. For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads. Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts. Signed-off-by: Xin Wang <xiwang5@microsoft.com>

mssonicbld · 2025-11-25T00:42:45Z

Original PR: #21407

mssonicbld · 2025-11-25T00:42:46Z

/azp run

azure-pipelines · 2025-11-25T00:42:58Z

Azure Pipelines successfully started running 1 pipeline(s).

StormLiangMS · 2025-11-25T05:21:28Z

/azp run

azure-pipelines · 2025-11-25T05:21:39Z

Azure Pipelines successfully started running 1 pipeline(s).

…et#21407) (sonic-net#21414) After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error: ``` self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0> iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80> def _wait_on_pending_results(self, iterator): ''' Wait for the shared counter to drop to zero, using a short sleep between checks to ensure we don't spin lock ''' ret_results = [] display.debug("waiting for pending results...") while self._pending_results > 0 and not self._tqm._terminated: if self._tqm.has_dead_workers(): > raise AnsibleError("A worker was found in a dead state") E ansible.errors.AnsibleError: A worker was found in a dead state ``` The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18. While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts. For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads. Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts. Signed-off-by: Xin Wang <xiwang5@microsoft.com> Co-authored-by: Xin Wang <xiwang5@microsoft.com>

…tomatically (sonic-net#21414) [submodule] Update submodule sonic-linux-kernel to the latest HEAD automatically

mssonicbld requested review from a team and wangxin as code owners November 25, 2025 00:42

mssonicbld mentioned this pull request Nov 25, 2025

Workaround for "dead worker" after docker-sonic-mgmt upgrade #21407

Merged

11 tasks

mssonicbld added the automerge label Nov 25, 2025

StormLiangMS merged commit 60fa628 into sonic-net:202505 Nov 25, 2025
11 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[action] [PR:21407] Workaround for "dead worker" after docker-sonic-mgmt upgrade#21414

[action] [PR:21407] Workaround for "dead worker" after docker-sonic-mgmt upgrade#21414
StormLiangMS merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/21407

mssonicbld commented Nov 25, 2025

Uh oh!

mssonicbld commented Nov 25, 2025

Uh oh!

mssonicbld commented Nov 25, 2025

Uh oh!

azure-pipelines bot commented Nov 25, 2025

Uh oh!

StormLiangMS commented Nov 25, 2025

Uh oh!

azure-pipelines bot commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mssonicbld commented Nov 25, 2025

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Nov 25, 2025

Uh oh!

mssonicbld commented Nov 25, 2025

Uh oh!

azure-pipelines bot commented Nov 25, 2025

Uh oh!

StormLiangMS commented Nov 25, 2025

Uh oh!

azure-pipelines bot commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants