fix(metal): per-stream locking for concurrent inference thread safety#3247
Closed
rsnow wants to merge 2 commits intoml-explore:mainfrom
Closed
fix(metal): per-stream locking for concurrent inference thread safety#3247rsnow wants to merge 2 commits intoml-explore:mainfrom
rsnow wants to merge 2 commits intoml-explore:mainfrom
Conversation
added 2 commits
March 12, 2026 15:57
Replace unsynchronized DeviceStream mutable state access with fine-grained per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent Metal inference. Architecture: - SubmissionEpoch: RAII wrapper for buffer/encoder/temporaries/sequence - StreamOpLock: [[nodiscard]] scoped lock with with_fence_state() chaining - DebugOwner: thread-ID based ownership tracking (replaces buggy try_lock) - Narrow lock boundaries: host prep outside lock, short lock for Metal only - Lock ordering enforced by construction: op_mtx -> fence_mtx - FenceImpl::count and Event::value_ made atomic - stream_map_ protected by shared_mutex Tested: 355/355 requests across 3 model architectures (Mamba-2 hybrid, dense transformer, GLM) at up to 20 concurrent streams. Zero crashes. Fixes: ml-explore#3216, ml-explore#2067, ml-explore#3078
Six tests covering: - Concurrent matmuls on separate streams (4 threads) - Mixed operation types (matmul, reduction, elementwise, softmax) - Numerical correctness verification under concurrency - Sustained pressure (4 threads × 20 iterations) - Cross-stream data dependencies (shared input) - High concurrency (8 threads, varied workloads) Uses deterministic inputs to isolate Metal stream thread safety from unrelated mx.random global state concurrency issues.
Collaborator
|
I have put some thoughts about thread safety in #3078 (comment): basically I think we should not try to achieve thread safety for arrays in different threads, at least not in the first try, for more practical targets (e.g. #3078, which I think is what projects like vllm-mlx or omlx do) we shouldn't need a complex solution like the one in this PR. |
Author
|
Understandable, it is deep and complex. We've had good luck with it locally and will likely continue to run it, so if this gets revisited, we'd be happy to report results from our continued soak. |
jundot
added a commit
to jundot/omlx
that referenced
this pull request
Mar 18, 2026
pin mlx==0.31.1 in venvstacks to match _mlx_source patch base. add build_patched_libmlx() that builds libmlx.dylib from _mlx_source (per-stream-lock fix for Metal thread safety, see ml-explore/mlx#3247) and replaces it in site-packages while keeping the PyPI metallib intact. enabled by default in both build.py and build_release.py. opt out with --no-mlx-patch for stock PyPI libmlx. closes #300, relates to #173
JianShan-1214
pushed a commit
to JianShan-1214/omlx
that referenced
this pull request
Mar 18, 2026
pin mlx==0.31.1 in venvstacks to match _mlx_source patch base. add build_patched_libmlx() that builds libmlx.dylib from _mlx_source (per-stream-lock fix for Metal thread safety, see ml-explore/mlx#3247) and replaces it in site-packages while keeping the PyPI metallib intact. enabled by default in both build.py and build_release.py. opt out with --no-mlx-patch for stock PyPI libmlx. closes jundot#300, relates to jundot#173
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace unsynchronized
DeviceStreammutable state access with fine-grained per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent Metal inference.Related issues: #3216, #2067, #3078
Problem
The Metal backend's
DeviceStreamholds mutable state (command buffer, encoder, temporaries, fence counters) that is accessed without synchronization during concurrent GPU evaluations. This causes crashes in Metal's command encoder lifecycle, particularly during multi-stream inference with large models on Apple Silicon.PR #2104 proposed a global mutex but that's too coarse — it serializes all GPU work and risks deadlocks.
Approach
Three-domain architecture with narrow critical sections:
op_mtx) — encoder commands, buffer rotation, temporary tracking. Short hold, only Metal mutations.fence_mtx) — fence signaling and waiting. Separate lock to avoid blocking submitters on sync.Key design elements:
SubmissionEpoch— RAII struct wrapping buffer/encoder/temporaries/sequence for atomic rotationStreamOpLock—[[nodiscard]]scoped lock withwith_fence_state()chaining for ordered acquisitionDebugOwner— thread-ID based ownership tracking, replacing the previoustry_lock()assert (which had UB under POSIX and false negatives under contention)op_mtxviaStreamOpLock, thenfence_mtxviawith_fence_state()FenceImpl::countandEvent::value_madestd::atomicfor lock-free fast pathsstream_map_protected bystd::shared_mutexfor concurrent readsTesting
Unit tests (included in this PR)
6 new tests in
python/tests/test_concurrent_eval.py:Integration testing (external, on M3 Ultra 256GB)
Files changed (6)
mlx/backend/metal/device.h—SubmissionEpoch,StreamOpLock,DebugOwner, per-stream mutexesmlx/backend/metal/device.cpp— narrow lock scopes innew_encoder(),commit_command_buffer(),end_encoding()mlx/backend/metal/eval.cpp—StreamOpLockacquisition around Metal submission, host prep outside lockmlx/backend/metal/event.cpp— atomicvalue_accessmlx/backend/metal/fence.cpp—fence_mtxfor signal/wait, atomicFenceImpl::countmlx/event.h—std::atomicforEvent::value_Notes
mx.randomhas a separate thread-safety issue (global PRNG state, unprotected). Unit tests use deterministic inputs to isolate the Metal stream behavior from that unrelated bug.Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes