Skip to content

[action] [PR:21423] Fix dead worker issue by using SafeThreadPoolExecutor#21428

Merged
StormLiangMS merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/21423
Nov 26, 2025
Merged

[action] [PR:21423] Fix dead worker issue by using SafeThreadPoolExecutor#21428
StormLiangMS merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/21423

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
  • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

According to #19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs. After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:

self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

 def _wait_on_pending_results(self, iterator):
 '''
 Wait for the shared counter to drop to zero, using a short sleep
 between checks to ensure we don't spin lock
 '''

 ret_results = []

 display.debug("waiting for pending results...")
 while self._pending_results > 0 and not self._tqm._terminated:

 if self._tqm.has_dead_workers():
> raise AnsibleError("A worker was found in a dead state")
E ansible.errors.AnsibleError: A worker was found in a dead state

PR #21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in #19263 to initialize the nbrhosts objects.

How did you do it?

This change reverted the threading lock of PR #21407 and updated the nbrhosts fixture to use the new SafeThreadPoolExecutor.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

According to sonic-net#19263, python 3.12 enforces more rigorous check around fork() in multiple-threaded programs.
After the docker-sonic-mgmt image is upgraded to Ubuntu 24.04. python and ansible are upgraded too. With python 3.12 and ansible 2.18 in new docker-sonic-mgmt, the nbrhosts fixture depends on concurrent.futures may fail with error like below:
```
self = <ansible.plugins.strategy.linear.StrategyModule object at 0x7596c07986e0>
iterator = <ansible.executor.play_iterator.PlayIterator object at 0x7596c09b2a80>

    def _wait_on_pending_results(self, iterator):
        '''
        Wait for the shared counter to drop to zero, using a short sleep
        between checks to ensure we don't spin lock
        '''

        ret_results = []

        display.debug("waiting for pending results...")
        while self._pending_results > 0 and not self._tqm._terminated:

            if self._tqm.has_dead_workers():
>               raise AnsibleError("A worker was found in a dead state")
E               ansible.errors.AnsibleError: A worker was found in a dead state
```

PR sonic-net#21407 introduced threading lock to temporarily workaround the issue.

A better way to fix the issue is to use the SafeThreadPoolExecutor updated in sonic-net#19263 to initialize the `nbrhosts` objects.

This change reverted the threading lock of PR sonic-net#21407 and updated the `nbrhosts` fixture to use the new SafeThreadPoolExecutor.

Signed-off-by: Xin Wang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #21423

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS StormLiangMS merged commit f162748 into sonic-net:202505 Nov 26, 2025
13 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants