Skip to content

fix(metal): per-stream locking for concurrent inference thread safety#3247

Closed
rsnow wants to merge 2 commits intoml-explore:mainfrom
rsnow:fix/per-stream-lock-v2
Closed

fix(metal): per-stream locking for concurrent inference thread safety#3247
rsnow wants to merge 2 commits intoml-explore:mainfrom
rsnow:fix/per-stream-lock-v2

Conversation

@rsnow
Copy link
Copy Markdown

@rsnow rsnow commented Mar 12, 2026

Summary

Replace unsynchronized DeviceStream mutable state access with fine-grained per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent Metal inference.

Related issues: #3216, #2067, #3078

Problem

The Metal backend's DeviceStream holds mutable state (command buffer, encoder, temporaries, fence counters) that is accessed without synchronization during concurrent GPU evaluations. This causes crashes in Metal's command encoder lifecycle, particularly during multi-stream inference with large models on Apple Silicon.

PR #2104 proposed a global mutex but that's too coarse — it serializes all GPU work and risks deadlocks.

Approach

Three-domain architecture with narrow critical sections:

  1. Host preparation (unlocked) — graph traversal, kernel lookup, shape specialization. Pure reads, no stream state mutation.
  2. Stream submission (per-stream op_mtx) — encoder commands, buffer rotation, temporary tracking. Short hold, only Metal mutations.
  3. Dependency state (fence_mtx) — fence signaling and waiting. Separate lock to avoid blocking submitters on sync.

Key design elements:

  • SubmissionEpoch — RAII struct wrapping buffer/encoder/temporaries/sequence for atomic rotation
  • StreamOpLock[[nodiscard]] scoped lock with with_fence_state() chaining for ordered acquisition
  • DebugOwner — thread-ID based ownership tracking, replacing the previous try_lock() assert (which had UB under POSIX and false negatives under contention)
  • Lock ordering enforced by constructionop_mtx via StreamOpLock, then fence_mtx via with_fence_state()
  • FenceImpl::count and Event::value_ made std::atomic for lock-free fast paths
  • stream_map_ protected by std::shared_mutex for concurrent reads

Testing

Unit tests (included in this PR)

6 new tests in python/tests/test_concurrent_eval.py:

  • Concurrent matmuls on separate streams (4 threads)
  • Mixed operation types (matmul, reduction, elementwise, softmax)
  • Numerical correctness verification under concurrency
  • Sustained pressure (4 threads × 20 iterations)
  • Cross-stream data dependencies (shared input, separate consumer streams)
  • High concurrency (8 threads, varied workloads)

Integration testing (external, on M3 Ultra 256GB)

  • 355/355 requests across 3 model architectures (Mamba-2 hybrid, dense transformer, GLM-4) at up to 20 concurrent streams — zero crashes, zero restarts
  • Ramp test scaling: 3 concurrent → 15s, 6 → 20s, 9 → 24s, 12 → 29s (linear, no cliff)
  • 48-request torture test at 12→16→20 concurrent with 4096 tokens each — clean throughout
  • oMLX server PID unchanged across all test sessions

Files changed (6)

  • mlx/backend/metal/device.hSubmissionEpoch, StreamOpLock, DebugOwner, per-stream mutexes
  • mlx/backend/metal/device.cpp — narrow lock scopes in new_encoder(), commit_command_buffer(), end_encoding()
  • mlx/backend/metal/eval.cppStreamOpLock acquisition around Metal submission, host prep outside lock
  • mlx/backend/metal/event.cpp — atomic value_ access
  • mlx/backend/metal/fence.cppfence_mtx for signal/wait, atomic FenceImpl::count
  • mlx/event.hstd::atomic for Event::value_

Notes

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

fang added 2 commits March 12, 2026 15:57
Replace unsynchronized DeviceStream mutable state access with fine-grained
per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent
Metal inference.

Architecture:
- SubmissionEpoch: RAII wrapper for buffer/encoder/temporaries/sequence
- StreamOpLock: [[nodiscard]] scoped lock with with_fence_state() chaining
- DebugOwner: thread-ID based ownership tracking (replaces buggy try_lock)
- Narrow lock boundaries: host prep outside lock, short lock for Metal only
- Lock ordering enforced by construction: op_mtx -> fence_mtx
- FenceImpl::count and Event::value_ made atomic
- stream_map_ protected by shared_mutex

Tested: 355/355 requests across 3 model architectures (Mamba-2 hybrid,
dense transformer, GLM) at up to 20 concurrent streams. Zero crashes.

Fixes: ml-explore#3216, ml-explore#2067, ml-explore#3078
Six tests covering:
- Concurrent matmuls on separate streams (4 threads)
- Mixed operation types (matmul, reduction, elementwise, softmax)
- Numerical correctness verification under concurrency
- Sustained pressure (4 threads × 20 iterations)
- Cross-stream data dependencies (shared input)
- High concurrency (8 threads, varied workloads)

Uses deterministic inputs to isolate Metal stream thread safety
from unrelated mx.random global state concurrency issues.
@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Mar 14, 2026

I have put some thoughts about thread safety in #3078 (comment): basically I think we should not try to achieve thread safety for arrays in different threads, at least not in the first try, for more practical targets (e.g. #3078, which I think is what projects like vllm-mlx or omlx do) we shouldn't need a complex solution like the one in this PR.

@zcbenz zcbenz closed this Mar 14, 2026
@rsnow
Copy link
Copy Markdown
Author

rsnow commented Mar 15, 2026

Understandable, it is deep and complex. We've had good luck with it locally and will likely continue to run it, so if this gets revisited, we'd be happy to report results from our continued soak.

jundot added a commit to jundot/omlx that referenced this pull request Mar 18, 2026
pin mlx==0.31.1 in venvstacks to match _mlx_source patch base.
add build_patched_libmlx() that builds libmlx.dylib from
_mlx_source (per-stream-lock fix for Metal thread safety,
see ml-explore/mlx#3247) and replaces it in site-packages
while keeping the PyPI metallib intact.

enabled by default in both build.py and build_release.py.
opt out with --no-mlx-patch for stock PyPI libmlx.

closes #300, relates to #173
JianShan-1214 pushed a commit to JianShan-1214/omlx that referenced this pull request Mar 18, 2026
pin mlx==0.31.1 in venvstacks to match _mlx_source patch base.
add build_patched_libmlx() that builds libmlx.dylib from
_mlx_source (per-stream-lock fix for Metal thread safety,
see ml-explore/mlx#3247) and replaces it in site-packages
while keeping the PyPI metallib intact.

enabled by default in both build.py and build_release.py.
opt out with --no-mlx-patch for stock PyPI libmlx.

closes jundot#300, relates to jundot#173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants