Add MLX-VLM backend support for Apple Silicon #1

aryasaatvik · 2025-10-29T19:38:14Z

Summary

This PR adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.

Motivation

Enable on-device inference for users with Apple Silicon Macs
Reduce cloud inference costs for development and testing
Provide privacy-focused local processing option
Support users without access to NVIDIA GPU infrastructure

Changes

1. Backend Abstraction Layer (`olmocr/backends.py`) - 458 lines

New abstract base class InferenceBackend for multi-backend support
VLLMBackend implementation for NVIDIA GPUs
MLXVLMBackend implementation for Apple Silicon
Backend-specific request/response formatting
Automatic server health checking and startup

Key architectural decision: Each backend handles its own API format differences:

vLLM: /v1/chat/completions endpoint, pre-loads model with name "olmocr"
MLX-VLM: /responses endpoint, lazy-loads model using actual path on first request

2. Pipeline Integration (`olmocr/pipeline.py`) - +145/-50 lines

Integrated backend abstraction into main inference pipeline
Backend-agnostic request building and response parsing
Model download happens before server startup
Correct model path handling for each backend

3. Configuration (`olmocr/config.py`) - +41 lines

Added backend field: "vllm" (default) or "mlx-vlm"
Added MLX-specific options: mlx_quantization, mlx_kv_bits
Platform validation for MLX backend:
- Checks for macOS (Darwin)
- Validates Apple Silicon architecture (arm64/aarch64)
- Verifies mlx-vlm package installation

4. Model Conversion Utility (`olmocr/convert_to_mlx.py`) - 218 lines

CLI tool for converting HuggingFace models to MLX format
Supports 4-bit and 8-bit quantization
Configurable group size for quantization (default: 64)
Platform validation and clear error messages
Progress logging for multi-step conversion process

Usage:

python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
  --output ~/models/olmocr-mlx --quantize 4 --group-size 64

5. Dependencies (`pyproject.toml`, `uv.lock`)

Upgraded transformers: 4.55.2 → 4.57.0+
Added MLX optional dependency: pip install olmocr[mlx]
Requires mlx-vlm>=0.3.5
Added CLI entry point: olmocr command

6. Documentation (`docs/source/mlx-backend.md`) - 482 lines

Comprehensive guide covering:

System requirements and installation
Quick start with pre-quantized models
Configuration options and CLI usage
Model selection (4-bit vs 8-bit trade-offs)
Performance optimization tips
Troubleshooting common issues
API differences between backends
Performance benchmarks on different Mac models

7. Gitignore (`gitignore`)

Added workspace/* to ignore test pipeline output

Pre-quantized Models

Ready-to-use models available on HuggingFace:

mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB, fast)
mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB, better quality)

Usage Example

# Install with MLX support
pip install olmocr[mlx]

# Run with 4-bit quantized model
olmocr ~/workspace \
  --pdfs sample.pdf \
  --backend mlx-vlm \
  --model mlx-community/olmOCR-2-7B-1025-mlx-4bit

Testing

Tested on:

macOS 15.2 (Sequoia)
Apple M4 Pro
MLX-VLM 0.3.5
Pre-quantized 4-bit and 8-bit models

API Differences

Feature	vLLM	MLX-VLM
Endpoint	`/v1/chat/completions`	`/responses`
Model loading	Pre-load at startup	Lazy load on first request
Request format	OpenAI Chat Completions	OpenAI Responses
Response path	`choices[0].message.content`	`output[0].content[0].text`
Guided decoding	✅ Yes	❌ No (post-validation)
Default port	30024	8000

Limitations

macOS + Apple Silicon only: MLX-VLM requires M-series chips
No guided decoding: Responses validated after generation, may require retries
Single GPU: No multi-GPU support (uses unified memory)

Breaking Changes

None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.

Next Steps

Integration testing on different Mac models (M1/M2/M3)
Benchmark performance comparison vs vLLM
Consider adding MLX backend to CI/CD (if GitHub Actions supports M-series runners)

Related Issues

Addresses user requests for Apple Silicon support and cloud-free local inference.

Introduce InferenceBackend abstract base class with implementations for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction enables olmOCR to support multiple inference backends through a unified interface. Key components: - BackendConfig dataclass for unified backend configuration - VLLMBackend: OpenAI Chat Completions API, guided decoding support - MLXVLMBackend: OpenAI Responses API, lazy model loading - get_backend() factory for backend instantiation Each backend handles: - Server process lifecycle management - Health check endpoints - Request/response format translation - Model-specific validation Breaking changes: None (additive change)

Replace hardcoded vLLM logic with backend-agnostic implementation using the new InferenceBackend abstraction. This enables seamless switching between vLLM and MLX-VLM backends. Key changes: - Thread backend instance through processing pipeline - Use backend.build_request() and backend.parse_response() - Dynamic endpoint paths via backend.get_endpoint_path() - Backend-aware platform checks (skip CUDA for MLX) - Fix critical model path handling bug: * vLLM: Use served name "olmocr" (model pre-loaded at startup) * MLX-VLM: Use actual model path (lazy loading on first request) - Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits - Backend-specific port defaults (vLLM: 30024, MLX: 8000) CLI additions: - --backend {vllm,mlx-vlm}: Select inference backend - --custom_prompt: Override default OCR prompt - --mlx_quantization: MLX model quantization (4bit, 8bit, etc.) - --mlx_kv_bits: MLX KV-cache quantization bits Breaking changes: None (default behavior unchanged)

Extend PipelineConfig with backend configuration options and platform-specific validation for MLX-VLM backend. New configuration fields: - backend: str = "vllm" - Select inference backend - mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.) - mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8) Validation: - Ensure backend is "vllm" or "mlx-vlm" - MLX-specific checks in __post_init__: * Verify platform is macOS (Darwin) * Verify architecture is ARM64/Apple Silicon * Check mlx-vlm package installation Provides early, clear error messages when attempting to use MLX backend on unsupported platforms. Breaking changes: None (additive with safe defaults)

Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify converting olmOCR models from HuggingFace to MLX format. Features: - Convert models from HuggingFace Hub or local paths - Support for quantization (4-bit, 8-bit with configurable group size) - Platform validation (macOS + Apple Silicon only) - Optional upload to HuggingFace Hub - Clear usage instructions and progress logging Command-line interface: python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4] Usage example: python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \ --output ~/models/olmocr-mlx --quantize 4 --group-size 64 Implementation details: - Calls mlx_vlm.convert() directly with q_bits and q_group_size - Default group size: 64 (same as mlx-community models) - Validates Apple Silicon before attempting conversion Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx])

Update dependencies to support both vLLM and MLX-VLM backends. Changes: - Upgrade transformers: 4.55.2 → 4.57.0+ * Ensures compatibility with latest HuggingFace models * Required for both training and inference backends - Add MLX optional dependency group: * mlx-vlm>=0.3.5 for Apple Silicon inference * Install with: pip install olmocr[mlx] - Add CLI entry point: * olmocr = "olmocr.pipeline:cli" * Enables `olmocr` command after installation Breaking changes: None (transformers upgrade is compatible)

Add detailed documentation for using olmOCR with MLX-VLM backend on Apple Silicon Macs. Integrated into Sphinx documentation site. Location: docs/source/mlx-backend.md (added to Getting Started section) Contents: - Overview of MLX-VLM vs vLLM backends - System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM) - Installation instructions - Quick start guide with pre-quantized models - Configuration options and CLI flags - Model selection guide (4-bit vs 8-bit quantization) - Performance optimization tips - Troubleshooting section - API differences between vLLM and MLX-VLM - Current limitations and workarounds - Performance benchmarks on different Mac models Key information: - Default port: 8000 (vs 30024 for vLLM) - API endpoint: /responses (vs /v1/chat/completions for vLLM) - No guided decoding support (uses post-validation instead) - Pre-quantized models available: * mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB) * mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB) Target audience: Users with Apple Silicon Macs wanting on-device inference without cloud costs or NVIDIA GPU requirements.

Ignore workspace/ directory used for test runs and pipeline output. Similar to existing localworkspace/* entry.

Update system requirements to require macOS 15.0+ instead of 12.0+. This reflects the tested and recommended minimum version for MLX-VLM backend support.

aryasaatvik added 8 commits October 30, 2025 00:43

chore: add workspace/ to .gitignore

b86f941

Ignore workspace/ directory used for test runs and pipeline output. Similar to existing localworkspace/* entry.

docs: update minimum macOS version to 15.0+ (Sequoia)

ee91381

Update system requirements to require macOS 15.0+ instead of 12.0+. This reflects the tested and recommended minimum version for MLX-VLM backend support.

aryasaatvik merged commit 3192ed9 into main Oct 29, 2025

aryasaatvik mentioned this pull request Oct 29, 2025

macOS support allenai/olmocr#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MLX-VLM backend support for Apple Silicon #1

Add MLX-VLM backend support for Apple Silicon #1

Uh oh!

aryasaatvik commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add MLX-VLM backend support for Apple Silicon #1

Add MLX-VLM backend support for Apple Silicon #1

Uh oh!

Conversation

aryasaatvik commented Oct 29, 2025

Summary

Motivation

Changes

1. Backend Abstraction Layer (olmocr/backends.py) - 458 lines

2. Pipeline Integration (olmocr/pipeline.py) - +145/-50 lines

3. Configuration (olmocr/config.py) - +41 lines

4. Model Conversion Utility (olmocr/convert_to_mlx.py) - 218 lines

5. Dependencies (pyproject.toml, uv.lock)

6. Documentation (docs/source/mlx-backend.md) - 482 lines

7. Gitignore (gitignore)

Pre-quantized Models

Usage Example

Testing

API Differences

Limitations

Breaking Changes

Next Steps

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Backend Abstraction Layer (`olmocr/backends.py`) - 458 lines

2. Pipeline Integration (`olmocr/pipeline.py`) - +145/-50 lines

3. Configuration (`olmocr/config.py`) - +41 lines

4. Model Conversion Utility (`olmocr/convert_to_mlx.py`) - 218 lines

5. Dependencies (`pyproject.toml`, `uv.lock`)

6. Documentation (`docs/source/mlx-backend.md`) - 482 lines

7. Gitignore (`gitignore`)