Skip to content

Conversation

@aryasaatvik
Copy link

@aryasaatvik aryasaatvik commented Oct 29, 2025

Summary

This PR adds four improvements to olmOCR:

  1. Fix markdown output path bug for absolute PDF paths (f3198d2)
  2. Add programmatic Python API with custom prompt support (2b3530e)
  3. Add MLX-VLM backend support for Apple Silicon (3192ed9) resolves macOS support #33

1. Fix Markdown Output Path Bug

Commit: f3198d2

Fixes a bug where markdown output paths were incorrectly generated when using absolute PDF paths. Previously, the path calculation would fail or produce incorrect paths.

Changes:

  • Updated path handling in olmocr/pipeline.py
  • Added comprehensive test coverage in tests/test_pipeline.py

Files changed: 2 files (+94/-2)

2. Add Programmatic Python API

Commit: 2b3530e

Adds a clean programmatic Python API for olmOCR, making it easier to integrate into Python applications without using the CLI.

Features:

  • New PipelineConfig dataclass for type-safe configuration
  • run_pipeline() async function for programmatic use
  • Support for custom OCR prompts
  • Exported through olmocr.__init__.py for easy imports

Usage:

import asyncio
from olmocr import run_pipeline, PipelineConfig

config = PipelineConfig(
    workspace="./workspace",
    pdfs=["document.pdf"],
    custom_prompt="Extract all text, preserving formatting...",
    markdown=True
)

asyncio.run(run_pipeline(config))

Files changed: 4 files (+283/-68)

3. Add MLX-VLM Backend Support for Apple Silicon

Commit: 3192ed9

Adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.

Motivation

  • Enable on-device inference for users with Apple Silicon Macs
  • Reduce cloud inference costs for development and testing
  • Provide privacy-focused local processing option
  • Support users without access to NVIDIA GPU infrastructure

Changes

Backend Abstraction Layer (olmocr/backends.py) - 458 lines

  • New abstract base class InferenceBackend for multi-backend support
  • VLLMBackend implementation for NVIDIA GPUs
  • MLXVLMBackend implementation for Apple Silicon
  • Backend-specific request/response formatting
  • Automatic server health checking and startup

Key architectural decision: Each backend handles its own API format differences:

  • vLLM: /v1/chat/completions endpoint, pre-loads model with name "olmocr"
  • MLX-VLM: /responses endpoint, lazy-loads model using actual path on first request

Pipeline Integration (olmocr/pipeline.py) - +145/-50 lines

  • Integrated backend abstraction into main inference pipeline
  • Backend-agnostic request building and response parsing
  • Model download happens before server startup
  • Correct model path handling for each backend

Configuration (olmocr/config.py) - +41 lines

  • Added backend field: "vllm" (default) or "mlx-vlm"
  • Added MLX-specific options: mlx_quantization, mlx_kv_bits
  • Platform validation for MLX backend:
    • Checks for macOS (Darwin)
    • Validates Apple Silicon architecture (arm64/aarch64)
    • Verifies mlx-vlm package installation

Model Conversion Utility (olmocr/convert_to_mlx.py) - 218 lines

  • CLI tool for converting HuggingFace models to MLX format
  • Supports 4-bit and 8-bit quantization
  • Configurable group size for quantization (default: 64)
  • Platform validation and clear error messages
  • Progress logging for multi-step conversion process

Usage:

python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
  --output ~/models/olmocr-mlx --quantize 4 --group-size 64

Dependencies (pyproject.toml, uv.lock)

  • Upgraded transformers: 4.55.2 → 4.57.0+
  • Added MLX optional dependency: pip install olmocr[mlx]
  • Requires mlx-vlm>=0.3.5
  • Added CLI entry point: olmocr command

Documentation (docs/source/mlx-backend.md) - 482 lines

Comprehensive guide covering:

  • System requirements: macOS 15.0+, Apple Silicon
  • Installation instructions
  • Quick start with pre-quantized models
  • Configuration options and CLI usage
  • Model selection (4-bit vs 8-bit quantization)
  • Performance optimization tips
  • Troubleshooting common issues
  • API differences between vLLM and MLX-VLM
  • Performance benchmarks on different Mac models

Pre-quantized Models

Ready-to-use models available on HuggingFace:

  • mlx-community/olmOCR-2-7B-1025-mlx-4bit
  • mlx-community/olmOCR-2-7B-1025-mlx-8bit

Usage Example

# Install with MLX support
pip install olmocr[mlx]

# Run with 4-bit quantized model
olmocr ~/workspace \
  --pdfs sample.pdf \
  --backend mlx-vlm \
  --model mlx-community/olmOCR-2-7B-1025-mlx-4bit

API Differences

Feature vLLM MLX-VLM
Endpoint /v1/chat/completions /responses
Model loading Pre-load at startup Lazy load on first request
Request format OpenAI Chat Completions OpenAI Responses
Response path choices[0].message.content output[0].content[0].text
Guided decoding ✅ Yes ❌ No (post-validation)
Default port 30024 8000

Limitations

  • macOS 15.0+ + Apple Silicon only: MLX-VLM requires M-series chips
  • No guided decoding: Responses validated after generation, may require retries
  • Single GPU: No multi-GPU support (uses unified memory)

Testing

Tested on:

  • macOS 15.2 (Sequoia)
  • Apple M4 Pro
  • MLX-VLM 0.3.5
  • Pre-quantized 4-bit and 8-bit models

Files changed: 9 files (+1414/-81)


Overall Changes

Total: 16 files changed, +9484 insertions, +151 deletions

Breaking Changes

None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.

Testing

All changes have been tested locally. New test coverage added for:

  • Markdown output path handling
  • Programmatic API usage
  • MLX backend functionality on Apple Silicon

Related Issues

  • Addresses user requests for Apple Silicon support
  • Enables cloud-free local inference
  • Improves programmatic API usability

When using --markdown with absolute PDF paths, markdown files were
incorrectly written to the source PDF directory instead of the workspace.
This occurred because os.path.join(workspace, "markdown", "/absolute/path")
discards the workspace prefix when given an absolute path.

Changes:
- Extract only parent directory name from absolute paths to make them relative
- Example: /path/to/pdfs/2008/file.pdf -> workspace/markdown/2008/file.md
- Add comprehensive test suite (TestMarkdownPathHandling) with 4 test cases
- Tests cover various path depths, edge cases, and document the original bug

This ensures markdown files are stored in workspace/markdown/ as documented,
while preserving the folder structure of input PDFs.
Add a type-safe, composable Python API for running OlmoCR pipeline
programmatically, eliminating the need for subprocess calls.

Key features:
- New PipelineConfig dataclass with 25+ configuration options
- run_pipeline() async function as main programmatic entry point
- Custom prompt support via custom_prompt parameter
- Full backward compatibility - CLI unchanged, delegates to shared impl
- All existing features maintained (retries, batching, server management)

Example usage:
```python
import asyncio
from olmocr import run_pipeline, PipelineConfig

config = PipelineConfig(
    workspace="./workspace",
    pdfs=["doc1.pdf", "doc2.pdf"],
    custom_prompt="Extract text from this legal document...",
    markdown=True,
    workers=10
)
asyncio.run(run_pipeline(config))
```

Changes:
- Created olmocr/config.py with PipelineConfig dataclass
- Extracted _main_impl() from main() to share logic between CLI and API
- Added run_pipeline() as programmatic entry point
- Added _config_to_args() helper to convert config to argparse.Namespace
- Added custom_prompt parameter to build_page_query()
- Threaded custom prompt through process_page() call stack
- Updated __init__.py to export PipelineConfig and run_pipeline
- Updated test mocks to accept custom_prompt parameter

Backward compatibility:
- CLI interface unchanged - main() delegates to _main_impl()
- All default behaviors preserved
- All existing flags and options work identically
- Custom prompt optional - defaults to original prompt if not provided
Update dependency lock file to reflect current package state.
* feat(backends): add inference backend abstraction layer

Introduce InferenceBackend abstract base class with implementations
for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction
enables olmOCR to support multiple inference backends through a
unified interface.

Key components:
- BackendConfig dataclass for unified backend configuration
- VLLMBackend: OpenAI Chat Completions API, guided decoding support
- MLXVLMBackend: OpenAI Responses API, lazy model loading
- get_backend() factory for backend instantiation

Each backend handles:
- Server process lifecycle management
- Health check endpoints
- Request/response format translation
- Model-specific validation

Breaking changes: None (additive change)

* feat(pipeline): integrate backend abstraction for multi-backend support

Replace hardcoded vLLM logic with backend-agnostic implementation
using the new InferenceBackend abstraction. This enables seamless
switching between vLLM and MLX-VLM backends.

Key changes:
- Thread backend instance through processing pipeline
- Use backend.build_request() and backend.parse_response()
- Dynamic endpoint paths via backend.get_endpoint_path()
- Backend-aware platform checks (skip CUDA for MLX)
- Fix critical model path handling bug:
  * vLLM: Use served name "olmocr" (model pre-loaded at startup)
  * MLX-VLM: Use actual model path (lazy loading on first request)
- Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits
- Backend-specific port defaults (vLLM: 30024, MLX: 8000)

CLI additions:
- --backend {vllm,mlx-vlm}: Select inference backend
- --custom_prompt: Override default OCR prompt
- --mlx_quantization: MLX model quantization (4bit, 8bit, etc.)
- --mlx_kv_bits: MLX KV-cache quantization bits

Breaking changes: None (default behavior unchanged)

* feat(config): add backend selection and platform validation

Extend PipelineConfig with backend configuration options and
platform-specific validation for MLX-VLM backend.

New configuration fields:
- backend: str = "vllm" - Select inference backend
- mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.)
- mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8)

Validation:
- Ensure backend is "vllm" or "mlx-vlm"
- MLX-specific checks in __post_init__:
  * Verify platform is macOS (Darwin)
  * Verify architecture is ARM64/Apple Silicon
  * Check mlx-vlm package installation

Provides early, clear error messages when attempting to use
MLX backend on unsupported platforms.

Breaking changes: None (additive with safe defaults)

* feat(mlx): add model conversion utility for MLX format

Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify
converting olmOCR models from HuggingFace to MLX format.

Features:
- Convert models from HuggingFace Hub or local paths
- Support for quantization (4-bit, 8-bit with configurable group size)
- Platform validation (macOS + Apple Silicon only)
- Optional upload to HuggingFace Hub
- Clear usage instructions and progress logging

Command-line interface:
  python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4]

Usage example:
  python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
    --output ~/models/olmocr-mlx --quantize 4 --group-size 64

Implementation details:
- Calls mlx_vlm.convert() directly with q_bits and q_group_size
- Default group size: 64 (same as mlx-community models)
- Validates Apple Silicon before attempting conversion

Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx])

* build: upgrade transformers and add MLX optional dependency

Update dependencies to support both vLLM and MLX-VLM backends.

Changes:
- Upgrade transformers: 4.55.2 → 4.57.0+
  * Ensures compatibility with latest HuggingFace models
  * Required for both training and inference backends

- Add MLX optional dependency group:
  * mlx-vlm>=0.3.5 for Apple Silicon inference
  * Install with: pip install olmocr[mlx]

- Add CLI entry point:
  * olmocr = "olmocr.pipeline:cli"
  * Enables `olmocr` command after installation

Breaking changes: None (transformers upgrade is compatible)

* docs: add comprehensive MLX backend guide

Add detailed documentation for using olmOCR with MLX-VLM backend
on Apple Silicon Macs. Integrated into Sphinx documentation site.

Location: docs/source/mlx-backend.md (added to Getting Started section)

Contents:
- Overview of MLX-VLM vs vLLM backends
- System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM)
- Installation instructions
- Quick start guide with pre-quantized models
- Configuration options and CLI flags
- Model selection guide (4-bit vs 8-bit quantization)
- Performance optimization tips
- Troubleshooting section
- API differences between vLLM and MLX-VLM
- Current limitations and workarounds
- Performance benchmarks on different Mac models

Key information:
- Default port: 8000 (vs 30024 for vLLM)
- API endpoint: /responses (vs /v1/chat/completions for vLLM)
- No guided decoding support (uses post-validation instead)
- Pre-quantized models available:
  * mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB)
  * mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB)

Target audience: Users with Apple Silicon Macs wanting on-device
inference without cloud costs or NVIDIA GPU requirements.

* chore: add workspace/ to .gitignore

Ignore workspace/ directory used for test runs and pipeline output.
Similar to existing localworkspace/* entry.

* docs: update minimum macOS version to 15.0+ (Sequoia)

Update system requirements to require macOS 15.0+ instead of 12.0+.
This reflects the tested and recommended minimum version for MLX-VLM
backend support.
@aryasaatvik
Copy link
Author

used around ~16GB of peak memory with 8 bit quant --backend mlx-vlm --model mlx-community/olmOCR-2-7B-1025-mlx-8bit

2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - FINAL METRICS SUMMARY
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - ================================================================================
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total elapsed time: 105.75 seconds
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total Server Input tokens: 10,524
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Total Server Output tokens: 3,543
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Finished input tokens: 10,524
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Finished output tokens: 3,543
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Completed pages: 7
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Failed pages: 0
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Page Failure rate: 0.00%
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO -
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Pages finished by attempt number:
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Attempt 0: 7 pages (100.0%) - Cumulative: 7 (100.0%)
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Server Input tokens/sec rate: 99.52
2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - Server Output tokens/sec rate: 33.50
2025-10-30 01:21:23,214 - olmocr.pipeline - INFO - Finished Input tokens/sec rate: 99.52
2025-10-30 01:21:23,214 - olmocr.pipeline - INFO - Finished Output tokens/sec rate: 33.50

Change VLLMBackend.get_endpoint_path() from "/v1/chat/completions"
to "/chat/completions" to avoid double /v1 in URL construction.

This fixes external vLLM servers (like DeepInfra) which were getting
404 errors due to malformed URLs:
- Before: https://api.deepinfra.com/v1/openai/v1/chat/completions (404)
- After:  https://api.deepinfra.com/v1/openai/chat/completions (works)

Internal vLLM servers still work correctly because the base URL already
includes /v1 (set in pipeline.py:1326):
- Internal: http://localhost:30024/v1/chat/completions ✅
Remove global pdf_render_max_workers BoundedSemaphore that was created
at module import time, causing RuntimeError when running batched
extractions with multiple event loops.

Solution:
- Create semaphore in _main_impl() where event loop exists
- Pass through call chain: worker -> process_pdf -> process_page
- Each event loop now gets its own semaphore instance

This fixes: RuntimeError: <BoundedSemaphore> is bound to a different event loop

Fixes batched processing in bharatlex and other multi-loop scenarios.
…or improved concurrency handling

- Introduced max_concurrent_work_items parameter in PipelineConfig to control the number of work items processed concurrently, enhancing flexibility for local GPU and external API usage.
- Updated _main_impl to utilize this new parameter, allowing for dynamic semaphore configuration based on user input.
- Added command-line argument support for max_concurrent_work_items to facilitate user configuration.
…nt processing

- Introduced max_tokens parameter in PipelineConfig to specify the maximum tokens generated per page, allowing for better handling of dense documents.
- Updated process_page function to utilize max_tokens from command-line arguments, defaulting to 8000 if not specified.
- Added command-line argument support for max_tokens to improve user configurability.
@jakep-allenai
Copy link
Collaborator

I don't think I can merge in a 10k line diff here.

@Systemcluster
Copy link

I don't think I can merge in a 10k line diff here.

Considering that 8k of those lines are in uv.lock, do the other changes look reasonable?

@jakep-allenai
Copy link
Collaborator

No sorry, it makes too many changes. The overall pipeline.py currently is only 1,400 lines currently. Tripling the complexity is not something we can take on. You could support a new backend with 0 lines of code, by simply giving instructions to launch an openai-API compatible endpoint, and then you can point the current pipeline to it with the --server argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS support

3 participants