Skip to content

[action] [PR:21407] Workaround for "dead worker" after docker-sonic-mgmt upgrade#21414

Merged
StormLiangMS merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/21407
Nov 25, 2025
Merged

[action] [PR:21407] Workaround for "dead worker" after docker-sonic-mgmt upgrade#21414
StormLiangMS merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/21407

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
  • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

After the docker-sonic-mgmt image is upgraded, some tests failed while creating nbrhosts fixture with below error:

self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

 def _wait_on_pending_results(self, iterator):
 '''
 Wait for the shared counter to drop to zero, using a short sleep
 between checks to ensure we don't spin lock
 '''

 ret_results = []

 display.debug("waiting for pending results...")
 while self._pending_results > 0 and not self._tqm._terminated:

 if self._tqm.has_dead_workers():
> raise AnsibleError("A worker was found in a dead state")
E ansible.errors.AnsibleError: A worker was found in a dead state

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18. While creating the nbrhosts fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The nbrhosts fixture needs to initialize multiple SonicHost objects. In init of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts. For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

How did you do it?

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts. I'll try to dig deeper to find out the exact root cause and come back with a better fix.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #21407

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS StormLiangMS merged commit 60fa628 into sonic-net:202505 Nov 25, 2025
11 of 17 checks passed
justin-wong-ce pushed a commit to justin-wong-ce/sonic-mgmt that referenced this pull request Nov 25, 2025
…et#21407) (sonic-net#21414)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
justin-wong-ce pushed a commit to justin-wong-ce/sonic-mgmt that referenced this pull request Nov 25, 2025
…et#21407) (sonic-net#21414)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…tomatically (sonic-net#21414)

[submodule] Update submodule sonic-linux-kernel to the latest HEAD automatically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants