Skip to content

Conversation

@aryasaatvik
Copy link
Owner

Summary

This PR adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.

Motivation

  • Enable on-device inference for users with Apple Silicon Macs
  • Reduce cloud inference costs for development and testing
  • Provide privacy-focused local processing option
  • Support users without access to NVIDIA GPU infrastructure

Changes

1. Backend Abstraction Layer (olmocr/backends.py) - 458 lines

  • New abstract base class InferenceBackend for multi-backend support
  • VLLMBackend implementation for NVIDIA GPUs
  • MLXVLMBackend implementation for Apple Silicon
  • Backend-specific request/response formatting
  • Automatic server health checking and startup

Key architectural decision: Each backend handles its own API format differences:

  • vLLM: /v1/chat/completions endpoint, pre-loads model with name "olmocr"
  • MLX-VLM: /responses endpoint, lazy-loads model using actual path on first request

2. Pipeline Integration (olmocr/pipeline.py) - +145/-50 lines

  • Integrated backend abstraction into main inference pipeline
  • Backend-agnostic request building and response parsing
  • Model download happens before server startup
  • Correct model path handling for each backend

3. Configuration (olmocr/config.py) - +41 lines

  • Added backend field: "vllm" (default) or "mlx-vlm"
  • Added MLX-specific options: mlx_quantization, mlx_kv_bits
  • Platform validation for MLX backend:
    • Checks for macOS (Darwin)
    • Validates Apple Silicon architecture (arm64/aarch64)
    • Verifies mlx-vlm package installation

4. Model Conversion Utility (olmocr/convert_to_mlx.py) - 218 lines

  • CLI tool for converting HuggingFace models to MLX format
  • Supports 4-bit and 8-bit quantization
  • Configurable group size for quantization (default: 64)
  • Platform validation and clear error messages
  • Progress logging for multi-step conversion process

Usage:

python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
  --output ~/models/olmocr-mlx --quantize 4 --group-size 64

5. Dependencies (pyproject.toml, uv.lock)

  • Upgraded transformers: 4.55.2 → 4.57.0+
  • Added MLX optional dependency: pip install olmocr[mlx]
  • Requires mlx-vlm>=0.3.5
  • Added CLI entry point: olmocr command

6. Documentation (docs/source/mlx-backend.md) - 482 lines

Comprehensive guide covering:

  • System requirements and installation
  • Quick start with pre-quantized models
  • Configuration options and CLI usage
  • Model selection (4-bit vs 8-bit trade-offs)
  • Performance optimization tips
  • Troubleshooting common issues
  • API differences between backends
  • Performance benchmarks on different Mac models

7. Gitignore (gitignore)

  • Added workspace/* to ignore test pipeline output

Pre-quantized Models

Ready-to-use models available on HuggingFace:

  • mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB, fast)
  • mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB, better quality)

Usage Example

# Install with MLX support
pip install olmocr[mlx]

# Run with 4-bit quantized model
olmocr ~/workspace \
  --pdfs sample.pdf \
  --backend mlx-vlm \
  --model mlx-community/olmOCR-2-7B-1025-mlx-4bit

Testing

Tested on:

  • macOS 15.2 (Sequoia)
  • Apple M4 Pro
  • MLX-VLM 0.3.5
  • Pre-quantized 4-bit and 8-bit models

API Differences

Feature vLLM MLX-VLM
Endpoint /v1/chat/completions /responses
Model loading Pre-load at startup Lazy load on first request
Request format OpenAI Chat Completions OpenAI Responses
Response path choices[0].message.content output[0].content[0].text
Guided decoding ✅ Yes ❌ No (post-validation)
Default port 30024 8000

Limitations

  • macOS + Apple Silicon only: MLX-VLM requires M-series chips
  • No guided decoding: Responses validated after generation, may require retries
  • Single GPU: No multi-GPU support (uses unified memory)

Breaking Changes

None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.

Next Steps

  • Integration testing on different Mac models (M1/M2/M3)
  • Benchmark performance comparison vs vLLM
  • Consider adding MLX backend to CI/CD (if GitHub Actions supports M-series runners)

Related Issues

Addresses user requests for Apple Silicon support and cloud-free local inference.

Introduce InferenceBackend abstract base class with implementations
for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction
enables olmOCR to support multiple inference backends through a
unified interface.

Key components:
- BackendConfig dataclass for unified backend configuration
- VLLMBackend: OpenAI Chat Completions API, guided decoding support
- MLXVLMBackend: OpenAI Responses API, lazy model loading
- get_backend() factory for backend instantiation

Each backend handles:
- Server process lifecycle management
- Health check endpoints
- Request/response format translation
- Model-specific validation

Breaking changes: None (additive change)
Replace hardcoded vLLM logic with backend-agnostic implementation
using the new InferenceBackend abstraction. This enables seamless
switching between vLLM and MLX-VLM backends.

Key changes:
- Thread backend instance through processing pipeline
- Use backend.build_request() and backend.parse_response()
- Dynamic endpoint paths via backend.get_endpoint_path()
- Backend-aware platform checks (skip CUDA for MLX)
- Fix critical model path handling bug:
  * vLLM: Use served name "olmocr" (model pre-loaded at startup)
  * MLX-VLM: Use actual model path (lazy loading on first request)
- Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits
- Backend-specific port defaults (vLLM: 30024, MLX: 8000)

CLI additions:
- --backend {vllm,mlx-vlm}: Select inference backend
- --custom_prompt: Override default OCR prompt
- --mlx_quantization: MLX model quantization (4bit, 8bit, etc.)
- --mlx_kv_bits: MLX KV-cache quantization bits

Breaking changes: None (default behavior unchanged)
Extend PipelineConfig with backend configuration options and
platform-specific validation for MLX-VLM backend.

New configuration fields:
- backend: str = "vllm" - Select inference backend
- mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.)
- mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8)

Validation:
- Ensure backend is "vllm" or "mlx-vlm"
- MLX-specific checks in __post_init__:
  * Verify platform is macOS (Darwin)
  * Verify architecture is ARM64/Apple Silicon
  * Check mlx-vlm package installation

Provides early, clear error messages when attempting to use
MLX backend on unsupported platforms.

Breaking changes: None (additive with safe defaults)
Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify
converting olmOCR models from HuggingFace to MLX format.

Features:
- Convert models from HuggingFace Hub or local paths
- Support for quantization (4-bit, 8-bit with configurable group size)
- Platform validation (macOS + Apple Silicon only)
- Optional upload to HuggingFace Hub
- Clear usage instructions and progress logging

Command-line interface:
  python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4]

Usage example:
  python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
    --output ~/models/olmocr-mlx --quantize 4 --group-size 64

Implementation details:
- Calls mlx_vlm.convert() directly with q_bits and q_group_size
- Default group size: 64 (same as mlx-community models)
- Validates Apple Silicon before attempting conversion

Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx])
Update dependencies to support both vLLM and MLX-VLM backends.

Changes:
- Upgrade transformers: 4.55.2 → 4.57.0+
  * Ensures compatibility with latest HuggingFace models
  * Required for both training and inference backends

- Add MLX optional dependency group:
  * mlx-vlm>=0.3.5 for Apple Silicon inference
  * Install with: pip install olmocr[mlx]

- Add CLI entry point:
  * olmocr = "olmocr.pipeline:cli"
  * Enables `olmocr` command after installation

Breaking changes: None (transformers upgrade is compatible)
Add detailed documentation for using olmOCR with MLX-VLM backend
on Apple Silicon Macs. Integrated into Sphinx documentation site.

Location: docs/source/mlx-backend.md (added to Getting Started section)

Contents:
- Overview of MLX-VLM vs vLLM backends
- System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM)
- Installation instructions
- Quick start guide with pre-quantized models
- Configuration options and CLI flags
- Model selection guide (4-bit vs 8-bit quantization)
- Performance optimization tips
- Troubleshooting section
- API differences between vLLM and MLX-VLM
- Current limitations and workarounds
- Performance benchmarks on different Mac models

Key information:
- Default port: 8000 (vs 30024 for vLLM)
- API endpoint: /responses (vs /v1/chat/completions for vLLM)
- No guided decoding support (uses post-validation instead)
- Pre-quantized models available:
  * mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB)
  * mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB)

Target audience: Users with Apple Silicon Macs wanting on-device
inference without cloud costs or NVIDIA GPU requirements.
Ignore workspace/ directory used for test runs and pipeline output.
Similar to existing localworkspace/* entry.
Update system requirements to require macOS 15.0+ instead of 12.0+.
This reflects the tested and recommended minimum version for MLX-VLM
backend support.
@aryasaatvik aryasaatvik merged commit 3192ed9 into main Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants