forked from allenai/olmocr
-
Notifications
You must be signed in to change notification settings - Fork 0
Add MLX-VLM backend support for Apple Silicon #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduce InferenceBackend abstract base class with implementations for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction enables olmOCR to support multiple inference backends through a unified interface. Key components: - BackendConfig dataclass for unified backend configuration - VLLMBackend: OpenAI Chat Completions API, guided decoding support - MLXVLMBackend: OpenAI Responses API, lazy model loading - get_backend() factory for backend instantiation Each backend handles: - Server process lifecycle management - Health check endpoints - Request/response format translation - Model-specific validation Breaking changes: None (additive change)
Replace hardcoded vLLM logic with backend-agnostic implementation
using the new InferenceBackend abstraction. This enables seamless
switching between vLLM and MLX-VLM backends.
Key changes:
- Thread backend instance through processing pipeline
- Use backend.build_request() and backend.parse_response()
- Dynamic endpoint paths via backend.get_endpoint_path()
- Backend-aware platform checks (skip CUDA for MLX)
- Fix critical model path handling bug:
* vLLM: Use served name "olmocr" (model pre-loaded at startup)
* MLX-VLM: Use actual model path (lazy loading on first request)
- Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits
- Backend-specific port defaults (vLLM: 30024, MLX: 8000)
CLI additions:
- --backend {vllm,mlx-vlm}: Select inference backend
- --custom_prompt: Override default OCR prompt
- --mlx_quantization: MLX model quantization (4bit, 8bit, etc.)
- --mlx_kv_bits: MLX KV-cache quantization bits
Breaking changes: None (default behavior unchanged)
Extend PipelineConfig with backend configuration options and platform-specific validation for MLX-VLM backend. New configuration fields: - backend: str = "vllm" - Select inference backend - mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.) - mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8) Validation: - Ensure backend is "vllm" or "mlx-vlm" - MLX-specific checks in __post_init__: * Verify platform is macOS (Darwin) * Verify architecture is ARM64/Apple Silicon * Check mlx-vlm package installation Provides early, clear error messages when attempting to use MLX backend on unsupported platforms. Breaking changes: None (additive with safe defaults)
Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify
converting olmOCR models from HuggingFace to MLX format.
Features:
- Convert models from HuggingFace Hub or local paths
- Support for quantization (4-bit, 8-bit with configurable group size)
- Platform validation (macOS + Apple Silicon only)
- Optional upload to HuggingFace Hub
- Clear usage instructions and progress logging
Command-line interface:
python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4]
Usage example:
python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
--output ~/models/olmocr-mlx --quantize 4 --group-size 64
Implementation details:
- Calls mlx_vlm.convert() directly with q_bits and q_group_size
- Default group size: 64 (same as mlx-community models)
- Validates Apple Silicon before attempting conversion
Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx])
Update dependencies to support both vLLM and MLX-VLM backends. Changes: - Upgrade transformers: 4.55.2 → 4.57.0+ * Ensures compatibility with latest HuggingFace models * Required for both training and inference backends - Add MLX optional dependency group: * mlx-vlm>=0.3.5 for Apple Silicon inference * Install with: pip install olmocr[mlx] - Add CLI entry point: * olmocr = "olmocr.pipeline:cli" * Enables `olmocr` command after installation Breaking changes: None (transformers upgrade is compatible)
Add detailed documentation for using olmOCR with MLX-VLM backend on Apple Silicon Macs. Integrated into Sphinx documentation site. Location: docs/source/mlx-backend.md (added to Getting Started section) Contents: - Overview of MLX-VLM vs vLLM backends - System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM) - Installation instructions - Quick start guide with pre-quantized models - Configuration options and CLI flags - Model selection guide (4-bit vs 8-bit quantization) - Performance optimization tips - Troubleshooting section - API differences between vLLM and MLX-VLM - Current limitations and workarounds - Performance benchmarks on different Mac models Key information: - Default port: 8000 (vs 30024 for vLLM) - API endpoint: /responses (vs /v1/chat/completions for vLLM) - No guided decoding support (uses post-validation instead) - Pre-quantized models available: * mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB) * mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB) Target audience: Users with Apple Silicon Macs wanting on-device inference without cloud costs or NVIDIA GPU requirements.
Ignore workspace/ directory used for test runs and pipeline output. Similar to existing localworkspace/* entry.
Update system requirements to require macOS 15.0+ instead of 12.0+. This reflects the tested and recommended minimum version for MLX-VLM backend support.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.
Motivation
Changes
1. Backend Abstraction Layer (
olmocr/backends.py) - 458 linesInferenceBackendfor multi-backend supportVLLMBackendimplementation for NVIDIA GPUsMLXVLMBackendimplementation for Apple SiliconKey architectural decision: Each backend handles its own API format differences:
/v1/chat/completionsendpoint, pre-loads model with name "olmocr"/responsesendpoint, lazy-loads model using actual path on first request2. Pipeline Integration (
olmocr/pipeline.py) - +145/-50 lines3. Configuration (
olmocr/config.py) - +41 linesbackendfield: "vllm" (default) or "mlx-vlm"mlx_quantization,mlx_kv_bits4. Model Conversion Utility (
olmocr/convert_to_mlx.py) - 218 linesUsage:
python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \ --output ~/models/olmocr-mlx --quantize 4 --group-size 645. Dependencies (
pyproject.toml,uv.lock)pip install olmocr[mlx]olmocrcommand6. Documentation (
docs/source/mlx-backend.md) - 482 linesComprehensive guide covering:
7. Gitignore (
gitignore)workspace/*to ignore test pipeline outputPre-quantized Models
Ready-to-use models available on HuggingFace:
mlx-community/olmOCR-2-7B-1025-mlx-4bit(~2GB, fast)mlx-community/olmOCR-2-7B-1025-mlx-8bit(~4GB, better quality)Usage Example
Testing
Tested on:
API Differences
/v1/chat/completions/responseschoices[0].message.contentoutput[0].content[0].textLimitations
Breaking Changes
None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.
Next Steps
Related Issues
Addresses user requests for Apple Silicon support and cloud-free local inference.