Skip to content

Workaround for "dead worker" after docker-sonic-mgmt upgrade#21407

Merged
StormLiangMS merged 1 commit intosonic-net:masterfrom
wangxin:dead-worker-workaround
Nov 25, 2025
Merged

Workaround for "dead worker" after docker-sonic-mgmt upgrade#21407
StormLiangMS merged 1 commit intosonic-net:masterfrom
wangxin:dead-worker-workaround

Conversation

@wangxin
Copy link
Collaborator

@wangxin wangxin commented Nov 24, 2025

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

After the docker-sonic-mgmt image is upgraded, some tests failed while creating nbrhosts fixture with below error:

self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18. While creating the nbrhosts fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The nbrhosts fixture needs to initialize multiple SonicHost objects. In init of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts. For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

How did you do it?

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts. I'll try to dig deeper to find out the exact root cause and come back with a better fix.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@wangxin wangxin requested a review from a team as a code owner November 24, 2025 10:41
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit 639854a into sonic-net:master Nov 25, 2025
19 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Nov 25, 2025
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202505: #21414

wangxin added a commit to wangxin/sonic-mgmt that referenced this pull request Nov 25, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
StormLiangMS pushed a commit that referenced this pull request Nov 25, 2025
…#21414)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
StormLiangMS pushed a commit that referenced this pull request Nov 25, 2025
According to #19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR #21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in #19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR #21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Nov 25, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
justin-wong-ce pushed a commit to justin-wong-ce/sonic-mgmt that referenced this pull request Nov 25, 2025
…et#21407) (sonic-net#21414)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
justin-wong-ce pushed a commit to justin-wong-ce/sonic-mgmt that referenced this pull request Nov 25, 2025
…et#21407) (sonic-net#21414)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
StormLiangMS pushed a commit that referenced this pull request Nov 26, 2025
According to #19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR #21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in #19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR #21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
vikumarks pushed a commit to vikumarks/sonic-mgmt that referenced this pull request Dec 1, 2025
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: vikumarks <vikumar7ks@gmail.com>
vikumarks pushed a commit to vikumarks/sonic-mgmt that referenced this pull request Dec 1, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: vikumarks <vikumar7ks@gmail.com>
@vmittal-msft vmittal-msft added Request for 202511 branch Request to backport a change to 202511 branch Approved for 202511 branch labels Dec 4, 2025
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 4, 2025
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 4, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
yifan-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Jan 14, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: YiFan Wang <yifan@nexthop.ai>
yifan-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Jan 14, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: YiFan Wang <yifan@nexthop.ai>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Jan 20, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Jan 20, 2026
According to #19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR #21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in #19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR #21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
PriyanshTratiya pushed a commit to PriyanshTratiya/sonic-mgmt that referenced this pull request Jan 21, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
PriyanshTratiya pushed a commit to PriyanshTratiya/sonic-mgmt that referenced this pull request Jan 21, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Jan 28, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Jan 28, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Jan 29, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Jan 29, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 6, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 6, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Feb 11, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Feb 11, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Feb 11, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Feb 13, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Feb 13, 2026
According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Feb 18, 2026
…et#21407)

After the docker-sonic-mgmt image is upgraded, some tests failed while creating `nbrhosts` fixture with below error:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

The new docker-sonic-mgmt image has ansible upgraded from 2.13 to 2.18.
While creating the `nbrhosts` fixture, thread pool is used to improve the performance for initializing large number of neighbors. For the t0-sonic topology, sonic VM is used as neighbor. The `nbrhosts` fixture needs to initialize multiple `SonicHost` objects. In __init__ of SonicHost, parallel thread is again being used to boost the execution of multiple commands on the device to gather various facts.
For new ansible 2.18, it is not able to handle the complicated scenario properly. Possibly the task queue manager is checking state of workers of other task queue manager created by other threads.

Because of this issue, PR testing easily fail. To stop bleeding and unblock PR testing, this change added a threading lock for initializing neighbor hosts.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants