Skip to content

Conversation

@zzstoatzz
Copy link
Collaborator

@zzstoatzz zzstoatzz commented Oct 7, 2025

Summary

Fixes a deadlock that occurs on Linux when using multiprocessing after prefect_test_harness().

Issue

On Linux, multiprocessing.Process() defaults to fork(), which copies all process state including locks from background threads. When prefect_test_harness() exits, background threads (WorkerThread, EventLoopThread, and services like APILogWorker and EventsWorker) remain alive with their locks. When fork() copies these locks into the child process, the locks remain in inconsistent states (held by threads that don't exist in the child), causing deadlocks.

Solution

Registers os.register_at_fork() handlers that reset thread and service state in child processes after fork(), ensuring clean state without inherited locks from dead threads.

Changes

  • Add fork handlers to prefect._internal.concurrency.threads module to reset WorkerThread and EventLoopThread state
  • Add fork handlers to prefect._internal.concurrency.services module to reset QueueService instances
  • Add regression test in tests/testing/test_utilites.py that reproduces the reported issue

Test Notes

The regression test uses os._exit(0) in the worker function despite the underscore prefix. This is necessary because:

  • On Linux with fork(), the child process inherits Prefect's logging/event state
  • Normal exit (return or sys.exit()) triggers Python cleanup that fails with inherited state, causing exitcode=1
  • os._exit() bypasses cleanup and is documented for use "in the child process after os.fork()" - which is exactly this scenario

Test Results

  • All existing concurrency tests pass (171 tests)
  • Original issue reproduction now passes on both macOS and Linux
  • macOS: Tests that previously passed still pass
  • Linux: Tests that hung indefinitely now pass in ~12s

Closes #19112

🤖 Generated with Claude Code

@github-actions github-actions bot added the bug Something isn't working label Oct 7, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 7, 2025

CodSpeed Performance Report

Merging #19116 will not alter performance

Comparing fix/multiprocessing-after-test-harness (595af27) with main (ee25381)

Summary

✅ 2 untouched

zzstoatzz and others added 4 commits October 7, 2025 13:25
This PR fixes a deadlock that occurs on Linux when using multiprocessing
after `prefect_test_harness()`. The issue was that background threads
(WorkerThread, EventLoopThread, and QueueServices like APILogWorker and
EventsWorker) with locks remain alive when the test harness exits. On
Linux, `multiprocessing.Process()` defaults to `fork()`, which copies
these lock states into the child process, causing deadlocks.

The fix registers `os.register_at_fork()` handlers that reset thread
and service state in child processes after fork(), ensuring clean state
without inherited locks.

Changes:
- Add fork handlers to `prefect._internal.concurrency.threads` module
- Add fork handlers to `prefect._internal.concurrency.services` module
- Add smoke tests to verify fork handlers are registered

Closes #19112

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Move fork reset logic into class methods to avoid accessing protected members
- Consolidate thread tracking into single WeakSet
- Remove broad exception handling in favor of cleaner class methods
- All type errors fixed, all tests still pass
Remove underscore prefix from reset_for_fork methods to avoid
reportPrivateUsage errors when calling from module-level fork handlers.

Verified with: uv run pyright -p pyrightconfig-ci.json --level error
@zzstoatzz zzstoatzz force-pushed the fix/multiprocessing-after-test-harness branch from 9a04c62 to 6ac8315 Compare October 7, 2025 18:25
zzstoatzz and others added 2 commits October 7, 2025 13:48
…eadlock

Moved the test from internal concurrency tests to testing utilities where it
belongs, and replaced the useless smoke tests with a real test that reproduces
the actual issue: using multiprocessing.Process() after prefect_test_harness().

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
On Linux with fork(), the child process inherits Prefect's logging and event
state. When the worker function returns normally, Python's cleanup tries to
flush logs/close connections, which can fail since those resources are in an
inconsistent state after fork. Using os._exit(0) bypasses cleanup entirely.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Comment on lines 19 to 36
def _multiprocessing_worker():
"""
Worker function for multiprocessing test. Must be at module level for pickling.
Explicitly exits with code 0 to avoid any Python cleanup issues in the forked process.
"""
import os
import sys

try:
# Explicitly exit with success code instead of relying on return value cleanup
os._exit(0)
except Exception as e:
print(f"Worker error: {e}", file=sys.stderr)
import traceback

traceback.print_exc()
os._exit(1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know if this is a good thing ™️ to do

@zzstoatzz
Copy link
Collaborator Author

zzstoatzz commented Oct 7, 2025

this does appear to solve the issue.... but feels rather hacky. which might be fine because its only a testing util. but it might be better interpreted as signal that we should make prefect_test_harness less hacky

zzstoatzz and others added 3 commits October 7, 2025 15:05
multiprocessing's _bootstrap() explicitly handles SystemExit raised by
sys.exit(), making it the public API for setting exit codes. This avoids
using the private os._exit() function.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
On Linux with fork(), the child process inherits Prefect's logging/event
state. Normal exit (return or sys.exit) triggers Python cleanup that fails
with this inherited state, causing exitcode=1. os._exit() bypasses cleanup
and is documented for use "in the child process after os.fork()".

Despite the underscore prefix, this is the intended API for this scenario.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@github-actions
Copy link
Contributor

This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment.

@zzstoatzz zzstoatzz marked this pull request as ready for review October 30, 2025 20:14
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me!

Comment on lines +49 to +50
# Might fail in certain contexts (e.g., if already in a child process)
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a debug log here for breadcrumbs in case this is ever an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i missed this, will add in a follow up PR

T = TypeVar("T", infer_variance=True)

# Track all active instances for fork handling
_active_instances: weakref.WeakSet[WorkerThread | EventLoopThread] = weakref.WeakSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of weakref

@zzstoatzz zzstoatzz merged commit 79232b9 into main Oct 30, 2025
59 checks passed
@zzstoatzz zzstoatzz deleted the fix/multiprocessing-after-test-harness branch October 30, 2025 22:12
zzstoatzz added a commit that referenced this pull request Oct 31, 2025
closes #19116

this PR adds a debug log when fork handler registration fails, as requested in review feedback from the original PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multiprocessing after a prefect test hangs

3 participants