fix(metal): per-stream locking for concurrent inference thread safety by rsnow · Pull Request #3247 · ml-explore/mlx

rsnow · 2026-03-12T21:04:16Z

Summary

Replace unsynchronized DeviceStream mutable state access with fine-grained per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent Metal inference.

Related issues: #3216, #2067, #3078

Problem

The Metal backend's DeviceStream holds mutable state (command buffer, encoder, temporaries, fence counters) that is accessed without synchronization during concurrent GPU evaluations. This causes crashes in Metal's command encoder lifecycle, particularly during multi-stream inference with large models on Apple Silicon.

PR #2104 proposed a global mutex but that's too coarse — it serializes all GPU work and risks deadlocks.

Approach

Three-domain architecture with narrow critical sections:

Host preparation (unlocked) — graph traversal, kernel lookup, shape specialization. Pure reads, no stream state mutation.
Stream submission (per-stream op_mtx) — encoder commands, buffer rotation, temporary tracking. Short hold, only Metal mutations.
Dependency state (fence_mtx) — fence signaling and waiting. Separate lock to avoid blocking submitters on sync.

Key design elements:

SubmissionEpoch — RAII struct wrapping buffer/encoder/temporaries/sequence for atomic rotation
StreamOpLock — [[nodiscard]] scoped lock with with_fence_state() chaining for ordered acquisition
DebugOwner — thread-ID based ownership tracking, replacing the previous try_lock() assert (which had UB under POSIX and false negatives under contention)
Lock ordering enforced by construction — op_mtx via StreamOpLock, then fence_mtx via with_fence_state()
FenceImpl::count and Event::value_ made std::atomic for lock-free fast paths
stream_map_ protected by std::shared_mutex for concurrent reads

Testing

Unit tests (included in this PR)

6 new tests in python/tests/test_concurrent_eval.py:

Concurrent matmuls on separate streams (4 threads)
Mixed operation types (matmul, reduction, elementwise, softmax)
Numerical correctness verification under concurrency
Sustained pressure (4 threads × 20 iterations)
Cross-stream data dependencies (shared input, separate consumer streams)
High concurrency (8 threads, varied workloads)

Integration testing (external, on M3 Ultra 256GB)

355/355 requests across 3 model architectures (Mamba-2 hybrid, dense transformer, GLM-4) at up to 20 concurrent streams — zero crashes, zero restarts
Ramp test scaling: 3 concurrent → 15s, 6 → 20s, 9 → 24s, 12 → 29s (linear, no cliff)
48-request torture test at 12→16→20 concurrent with 4096 tokens each — clean throughout
oMLX server PID unchanged across all test sessions

Files changed (6)

mlx/backend/metal/device.h — SubmissionEpoch, StreamOpLock, DebugOwner, per-stream mutexes
mlx/backend/metal/device.cpp — narrow lock scopes in new_encoder(), commit_command_buffer(), end_encoding()
mlx/backend/metal/eval.cpp — StreamOpLock acquisition around Metal submission, host prep outside lock
mlx/backend/metal/event.cpp — atomic value_ access
mlx/backend/metal/fence.cpp — fence_mtx for signal/wait, atomic FenceImpl::count
mlx/event.h — std::atomic for Event::value_

Notes

mx.random has a separate thread-safety issue (global PRNG state, unprotected). Unit tests use deterministic inputs to isolate the Metal stream behavior from that unrelated bug.
This does not address Request: Support for concurrent inference of independent models from separate threads #3078 (StreamContext / default stream semantics), which is a separate concern.

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Replace unsynchronized DeviceStream mutable state access with fine-grained per-stream locking to eliminate SIGSEGV/SIGABRT crashes during concurrent Metal inference. Architecture: - SubmissionEpoch: RAII wrapper for buffer/encoder/temporaries/sequence - StreamOpLock: [[nodiscard]] scoped lock with with_fence_state() chaining - DebugOwner: thread-ID based ownership tracking (replaces buggy try_lock) - Narrow lock boundaries: host prep outside lock, short lock for Metal only - Lock ordering enforced by construction: op_mtx -> fence_mtx - FenceImpl::count and Event::value_ made atomic - stream_map_ protected by shared_mutex Tested: 355/355 requests across 3 model architectures (Mamba-2 hybrid, dense transformer, GLM) at up to 20 concurrent streams. Zero crashes. Fixes: ml-explore#3216, ml-explore#2067, ml-explore#3078

Six tests covering: - Concurrent matmuls on separate streams (4 threads) - Mixed operation types (matmul, reduction, elementwise, softmax) - Numerical correctness verification under concurrency - Sustained pressure (4 threads × 20 iterations) - Cross-stream data dependencies (shared input) - High concurrency (8 threads, varied workloads) Uses deterministic inputs to isolate Metal stream thread safety from unrelated mx.random global state concurrency issues.

zcbenz · 2026-03-14T23:43:46Z

I have put some thoughts about thread safety in #3078 (comment): basically I think we should not try to achieve thread safety for arrays in different threads, at least not in the first try, for more practical targets (e.g. #3078, which I think is what projects like vllm-mlx or omlx do) we shouldn't need a complex solution like the one in this PR.

rsnow · 2026-03-15T01:20:45Z

Understandable, it is deep and complex. We've had good luck with it locally and will likely continue to run it, so if this gets revisited, we'd be happy to report results from our continued soak.

pin mlx==0.31.1 in venvstacks to match _mlx_source patch base. add build_patched_libmlx() that builds libmlx.dylib from _mlx_source (per-stream-lock fix for Metal thread safety, see ml-explore/mlx#3247) and replaces it in site-packages while keeping the PyPI metallib intact. enabled by default in both build.py and build_release.py. opt out with --no-mlx-patch for stock PyPI libmlx. closes #300, relates to #173

pin mlx==0.31.1 in venvstacks to match _mlx_source patch base. add build_patched_libmlx() that builds libmlx.dylib from _mlx_source (per-stream-lock fix for Metal thread safety, see ml-explore/mlx#3247) and replaces it in site-packages while keeping the PyPI metallib intact. enabled by default in both build.py and build_release.py. opt out with --no-mlx-patch for stock PyPI libmlx. closes jundot#300, relates to jundot#173

fang added 2 commits March 12, 2026 15:57

rsnow mentioned this pull request Mar 12, 2026

Server process crashing on macOS 26.3 - Metal GPU errors jundot/omlx#173

Closed

zcbenz closed this Mar 14, 2026

jundot mentioned this pull request Mar 18, 2026

Kernel panic after running few minutes jundot/omlx#300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metal): per-stream locking for concurrent inference thread safety#3247

fix(metal): per-stream locking for concurrent inference thread safety#3247
rsnow wants to merge 2 commits intoml-explore:mainfrom
rsnow:fix/per-stream-lock-v2

rsnow commented Mar 12, 2026

Uh oh!

zcbenz commented Mar 14, 2026

Uh oh!

rsnow commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rsnow commented Mar 12, 2026

Summary

Problem

Approach

Testing

Unit tests (included in this PR)

Integration testing (external, on M3 Ultra 256GB)

Files changed (6)

Notes

Checklist

Uh oh!

zcbenz commented Mar 14, 2026

Uh oh!

rsnow commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants