Add MLX-VLM backend support and improvements #374

aryasaatvik · 2025-10-29T20:10:04Z

Summary

This PR adds four improvements to olmOCR:

Fix markdown output path bug for absolute PDF paths (f3198d2)
Add programmatic Python API with custom prompt support (2b3530e)
Add MLX-VLM backend support for Apple Silicon (3192ed9) resolves macOS support #33

1. Fix Markdown Output Path Bug

Commit: f3198d2

Fixes a bug where markdown output paths were incorrectly generated when using absolute PDF paths. Previously, the path calculation would fail or produce incorrect paths.

Changes:

Updated path handling in olmocr/pipeline.py
Added comprehensive test coverage in tests/test_pipeline.py

Files changed: 2 files (+94/-2)

2. Add Programmatic Python API

Commit: 2b3530e

Adds a clean programmatic Python API for olmOCR, making it easier to integrate into Python applications without using the CLI.

Features:

New PipelineConfig dataclass for type-safe configuration
run_pipeline() async function for programmatic use
Support for custom OCR prompts
Exported through olmocr.__init__.py for easy imports

Usage:

import asyncio
from olmocr import run_pipeline, PipelineConfig

config = PipelineConfig(
    workspace="./workspace",
    pdfs=["document.pdf"],
    custom_prompt="Extract all text, preserving formatting...",
    markdown=True
)

asyncio.run(run_pipeline(config))

Files changed: 4 files (+283/-68)

3. Add MLX-VLM Backend Support for Apple Silicon

Commit: 3192ed9

Adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.

Motivation

Enable on-device inference for users with Apple Silicon Macs
Reduce cloud inference costs for development and testing
Provide privacy-focused local processing option
Support users without access to NVIDIA GPU infrastructure

Changes

Backend Abstraction Layer (`olmocr/backends.py`) - 458 lines

New abstract base class InferenceBackend for multi-backend support
VLLMBackend implementation for NVIDIA GPUs
MLXVLMBackend implementation for Apple Silicon
Backend-specific request/response formatting
Automatic server health checking and startup

Key architectural decision: Each backend handles its own API format differences:

vLLM: /v1/chat/completions endpoint, pre-loads model with name "olmocr"
MLX-VLM: /responses endpoint, lazy-loads model using actual path on first request

Pipeline Integration (`olmocr/pipeline.py`) - +145/-50 lines

Integrated backend abstraction into main inference pipeline
Backend-agnostic request building and response parsing
Model download happens before server startup
Correct model path handling for each backend

Configuration (`olmocr/config.py`) - +41 lines

Added backend field: "vllm" (default) or "mlx-vlm"
Added MLX-specific options: mlx_quantization, mlx_kv_bits
Platform validation for MLX backend:
- Checks for macOS (Darwin)
- Validates Apple Silicon architecture (arm64/aarch64)
- Verifies mlx-vlm package installation

Model Conversion Utility (`olmocr/convert_to_mlx.py`) - 218 lines

CLI tool for converting HuggingFace models to MLX format
Supports 4-bit and 8-bit quantization
Configurable group size for quantization (default: 64)
Platform validation and clear error messages
Progress logging for multi-step conversion process

Usage:

python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
  --output ~/models/olmocr-mlx --quantize 4 --group-size 64

Dependencies (`pyproject.toml`, `uv.lock`)

Upgraded transformers: 4.55.2 → 4.57.0+
Added MLX optional dependency: pip install olmocr[mlx]
Requires mlx-vlm>=0.3.5
Added CLI entry point: olmocr command

Documentation (`docs/source/mlx-backend.md`) - 482 lines

Comprehensive guide covering:

System requirements: macOS 15.0+, Apple Silicon
Installation instructions
Quick start with pre-quantized models
Configuration options and CLI usage
Model selection (4-bit vs 8-bit quantization)
Performance optimization tips
Troubleshooting common issues
API differences between vLLM and MLX-VLM
Performance benchmarks on different Mac models

Pre-quantized Models

Ready-to-use models available on HuggingFace:

mlx-community/olmOCR-2-7B-1025-mlx-4bit
mlx-community/olmOCR-2-7B-1025-mlx-8bit

Usage Example

# Install with MLX support
pip install olmocr[mlx]

# Run with 4-bit quantized model
olmocr ~/workspace \
  --pdfs sample.pdf \
  --backend mlx-vlm \
  --model mlx-community/olmOCR-2-7B-1025-mlx-4bit

API Differences

Feature	vLLM	MLX-VLM
Endpoint	`/v1/chat/completions`	`/responses`
Model loading	Pre-load at startup	Lazy load on first request
Request format	OpenAI Chat Completions	OpenAI Responses
Response path	`choices[0].message.content`	`output[0].content[0].text`
Guided decoding	✅ Yes	❌ No (post-validation)
Default port	30024	8000

Limitations

macOS 15.0+ + Apple Silicon only: MLX-VLM requires M-series chips
No guided decoding: Responses validated after generation, may require retries
Single GPU: No multi-GPU support (uses unified memory)

Testing

Tested on:

macOS 15.2 (Sequoia)
Apple M4 Pro
MLX-VLM 0.3.5
Pre-quantized 4-bit and 8-bit models

Files changed: 9 files (+1414/-81)

Overall Changes

Total: 16 files changed, +9484 insertions, +151 deletions

Breaking Changes

None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.

Testing

All changes have been tested locally. New test coverage added for:

Markdown output path handling
Programmatic API usage
MLX backend functionality on Apple Silicon

Related Issues

Addresses user requests for Apple Silicon support
Enables cloud-free local inference
Improves programmatic API usability

When using --markdown with absolute PDF paths, markdown files were incorrectly written to the source PDF directory instead of the workspace. This occurred because os.path.join(workspace, "markdown", "/absolute/path") discards the workspace prefix when given an absolute path. Changes: - Extract only parent directory name from absolute paths to make them relative - Example: /path/to/pdfs/2008/file.pdf -> workspace/markdown/2008/file.md - Add comprehensive test suite (TestMarkdownPathHandling) with 4 test cases - Tests cover various path depths, edge cases, and document the original bug This ensures markdown files are stored in workspace/markdown/ as documented, while preserving the folder structure of input PDFs.

Add a type-safe, composable Python API for running OlmoCR pipeline programmatically, eliminating the need for subprocess calls. Key features: - New PipelineConfig dataclass with 25+ configuration options - run_pipeline() async function as main programmatic entry point - Custom prompt support via custom_prompt parameter - Full backward compatibility - CLI unchanged, delegates to shared impl - All existing features maintained (retries, batching, server management) Example usage: ```python import asyncio from olmocr import run_pipeline, PipelineConfig config = PipelineConfig( workspace="./workspace", pdfs=["doc1.pdf", "doc2.pdf"], custom_prompt="Extract text from this legal document...", markdown=True, workers=10 ) asyncio.run(run_pipeline(config)) ``` Changes: - Created olmocr/config.py with PipelineConfig dataclass - Extracted _main_impl() from main() to share logic between CLI and API - Added run_pipeline() as programmatic entry point - Added _config_to_args() helper to convert config to argparse.Namespace - Added custom_prompt parameter to build_page_query() - Threaded custom prompt through process_page() call stack - Updated __init__.py to export PipelineConfig and run_pipeline - Updated test mocks to accept custom_prompt parameter Backward compatibility: - CLI interface unchanged - main() delegates to _main_impl() - All default behaviors preserved - All existing flags and options work identically - Custom prompt optional - defaults to original prompt if not provided

Update dependency lock file to reflect current package state.

* feat(backends): add inference backend abstraction layer Introduce InferenceBackend abstract base class with implementations for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction enables olmOCR to support multiple inference backends through a unified interface. Key components: - BackendConfig dataclass for unified backend configuration - VLLMBackend: OpenAI Chat Completions API, guided decoding support - MLXVLMBackend: OpenAI Responses API, lazy model loading - get_backend() factory for backend instantiation Each backend handles: - Server process lifecycle management - Health check endpoints - Request/response format translation - Model-specific validation Breaking changes: None (additive change) * feat(pipeline): integrate backend abstraction for multi-backend support Replace hardcoded vLLM logic with backend-agnostic implementation using the new InferenceBackend abstraction. This enables seamless switching between vLLM and MLX-VLM backends. Key changes: - Thread backend instance through processing pipeline - Use backend.build_request() and backend.parse_response() - Dynamic endpoint paths via backend.get_endpoint_path() - Backend-aware platform checks (skip CUDA for MLX) - Fix critical model path handling bug: * vLLM: Use served name "olmocr" (model pre-loaded at startup) * MLX-VLM: Use actual model path (lazy loading on first request) - Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits - Backend-specific port defaults (vLLM: 30024, MLX: 8000) CLI additions: - --backend {vllm,mlx-vlm}: Select inference backend - --custom_prompt: Override default OCR prompt - --mlx_quantization: MLX model quantization (4bit, 8bit, etc.) - --mlx_kv_bits: MLX KV-cache quantization bits Breaking changes: None (default behavior unchanged) * feat(config): add backend selection and platform validation Extend PipelineConfig with backend configuration options and platform-specific validation for MLX-VLM backend. New configuration fields: - backend: str = "vllm" - Select inference backend - mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.) - mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8) Validation: - Ensure backend is "vllm" or "mlx-vlm" - MLX-specific checks in __post_init__: * Verify platform is macOS (Darwin) * Verify architecture is ARM64/Apple Silicon * Check mlx-vlm package installation Provides early, clear error messages when attempting to use MLX backend on unsupported platforms. Breaking changes: None (additive with safe defaults) * feat(mlx): add model conversion utility for MLX format Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify converting olmOCR models from HuggingFace to MLX format. Features: - Convert models from HuggingFace Hub or local paths - Support for quantization (4-bit, 8-bit with configurable group size) - Platform validation (macOS + Apple Silicon only) - Optional upload to HuggingFace Hub - Clear usage instructions and progress logging Command-line interface: python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4] Usage example: python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \ --output ~/models/olmocr-mlx --quantize 4 --group-size 64 Implementation details: - Calls mlx_vlm.convert() directly with q_bits and q_group_size - Default group size: 64 (same as mlx-community models) - Validates Apple Silicon before attempting conversion Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx]) * build: upgrade transformers and add MLX optional dependency Update dependencies to support both vLLM and MLX-VLM backends. Changes: - Upgrade transformers: 4.55.2 → 4.57.0+ * Ensures compatibility with latest HuggingFace models * Required for both training and inference backends - Add MLX optional dependency group: * mlx-vlm>=0.3.5 for Apple Silicon inference * Install with: pip install olmocr[mlx] - Add CLI entry point: * olmocr = "olmocr.pipeline:cli" * Enables `olmocr` command after installation Breaking changes: None (transformers upgrade is compatible) * docs: add comprehensive MLX backend guide Add detailed documentation for using olmOCR with MLX-VLM backend on Apple Silicon Macs. Integrated into Sphinx documentation site. Location: docs/source/mlx-backend.md (added to Getting Started section) Contents: - Overview of MLX-VLM vs vLLM backends - System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM) - Installation instructions - Quick start guide with pre-quantized models - Configuration options and CLI flags - Model selection guide (4-bit vs 8-bit quantization) - Performance optimization tips - Troubleshooting section - API differences between vLLM and MLX-VLM - Current limitations and workarounds - Performance benchmarks on different Mac models Key information: - Default port: 8000 (vs 30024 for vLLM) - API endpoint: /responses (vs /v1/chat/completions for vLLM) - No guided decoding support (uses post-validation instead) - Pre-quantized models available: * mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB) * mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB) Target audience: Users with Apple Silicon Macs wanting on-device inference without cloud costs or NVIDIA GPU requirements. * chore: add workspace/ to .gitignore Ignore workspace/ directory used for test runs and pipeline output. Similar to existing localworkspace/* entry. * docs: update minimum macOS version to 15.0+ (Sequoia) Update system requirements to require macOS 15.0+ instead of 12.0+. This reflects the tested and recommended minimum version for MLX-VLM backend support.

aryasaatvik · 2025-10-29T20:21:25Z

used around ~16GB of peak memory with 8 bit quant --backend mlx-vlm --model mlx-community/olmOCR-2-7B-1025-mlx-8bit

2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - FINAL METRICS SUMMARY
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - ================================================================================
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total elapsed time: 105.75 seconds
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total Server Input tokens: 10,524
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total Server Output tokens: 3,543
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Finished input tokens: 10,524
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Finished output tokens: 3,543
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Completed pages: 7
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Failed pages: 0
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Page Failure rate: 0.00%
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO -
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Pages finished by attempt number:
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Attempt 0: 7 pages (100.0%) - Cumulative: 7 (100.0%)
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Server Input tokens/sec rate: 99.52
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Server Output tokens/sec rate: 33.50
2025-10-30 01:21:23,214 - olmocr.pipeline - INFO - Finished Input tokens/sec rate: 99.52
2025-10-30 01:21:23,214 - olmocr.pipeline - INFO - Finished Output tokens/sec rate: 33.50

Change VLLMBackend.get_endpoint_path() from "/v1/chat/completions" to "/chat/completions" to avoid double /v1 in URL construction. This fixes external vLLM servers (like DeepInfra) which were getting 404 errors due to malformed URLs: - Before: https://api.deepinfra.com/v1/openai/v1/chat/completions (404) - After: https://api.deepinfra.com/v1/openai/chat/completions (works) Internal vLLM servers still work correctly because the base URL already includes /v1 (set in pipeline.py:1326): - Internal: http://localhost:30024/v1/chat/completions ✅

Remove global pdf_render_max_workers BoundedSemaphore that was created at module import time, causing RuntimeError when running batched extractions with multiple event loops. Solution: - Create semaphore in _main_impl() where event loop exists - Pass through call chain: worker -> process_pdf -> process_page - Each event loop now gets its own semaphore instance This fixes: RuntimeError: <BoundedSemaphore> is bound to a different event loop Fixes batched processing in bharatlex and other multi-loop scenarios.

…or improved concurrency handling - Introduced max_concurrent_work_items parameter in PipelineConfig to control the number of work items processed concurrently, enhancing flexibility for local GPU and external API usage. - Updated _main_impl to utilize this new parameter, allowing for dynamic semaphore configuration based on user input. - Added command-line argument support for max_concurrent_work_items to facilitate user configuration.

…nt processing - Introduced max_tokens parameter in PipelineConfig to specify the maximum tokens generated per page, allowing for better handling of dense documents. - Updated process_page function to utilize max_tokens from command-line arguments, defaulting to 8000 if not specified. - Added command-line argument support for max_tokens to improve user configurability.

jakep-allenai · 2025-11-03T23:30:19Z

I don't think I can merge in a 10k line diff here.

Systemcluster · 2025-11-06T10:15:34Z

I don't think I can merge in a 10k line diff here.

Considering that 8k of those lines are in uv.lock, do the other changes look reasonable?

jakep-allenai · 2025-11-06T17:05:52Z

No sorry, it makes too many changes. The overall pipeline.py currently is only 1,400 lines currently. Tripling the complexity is not something we can take on. You could support a new backend with 0 lines of code, by simply giving instructions to launch an openai-API compatible endpoint, and then you can point the current pipeline to it with the --server argument.

aryasaatvik added 4 commits October 29, 2025 20:50

chore: update uv.lock

99825da

Update dependency lock file to reflect current package state.

aryasaatvik added 5 commits October 30, 2025 04:54

fix(backends): update model name usage in vLLM request

a01a107

jakep-allenai closed this Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MLX-VLM backend support and improvements #374

Add MLX-VLM backend support and improvements #374

Uh oh!

aryasaatvik commented Oct 29, 2025 •

edited

Loading

Uh oh!

aryasaatvik commented Oct 29, 2025

Uh oh!

jakep-allenai commented Nov 3, 2025

Uh oh!

Systemcluster commented Nov 6, 2025

Uh oh!

jakep-allenai commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add MLX-VLM backend support and improvements #374

Add MLX-VLM backend support and improvements #374

Uh oh!

Conversation

aryasaatvik commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Fix Markdown Output Path Bug

2. Add Programmatic Python API

3. Add MLX-VLM Backend Support for Apple Silicon

Motivation

Changes

Backend Abstraction Layer (olmocr/backends.py) - 458 lines

Pipeline Integration (olmocr/pipeline.py) - +145/-50 lines

Configuration (olmocr/config.py) - +41 lines

Model Conversion Utility (olmocr/convert_to_mlx.py) - 218 lines

Dependencies (pyproject.toml, uv.lock)

Documentation (docs/source/mlx-backend.md) - 482 lines

Pre-quantized Models

Usage Example

API Differences

Limitations

Testing

Overall Changes

Breaking Changes

Testing

Related Issues

Uh oh!

aryasaatvik commented Oct 29, 2025

Uh oh!

jakep-allenai commented Nov 3, 2025

Uh oh!

Systemcluster commented Nov 6, 2025

Uh oh!

jakep-allenai commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aryasaatvik commented Oct 29, 2025 •

edited

Loading

Backend Abstraction Layer (`olmocr/backends.py`) - 458 lines

Pipeline Integration (`olmocr/pipeline.py`) - +145/-50 lines

Configuration (`olmocr/config.py`) - +41 lines

Model Conversion Utility (`olmocr/convert_to_mlx.py`) - 218 lines

Dependencies (`pyproject.toml`, `uv.lock`)

Documentation (`docs/source/mlx-backend.md`) - 482 lines