-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add MLX-VLM backend support and improvements #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When using --markdown with absolute PDF paths, markdown files were incorrectly written to the source PDF directory instead of the workspace. This occurred because os.path.join(workspace, "markdown", "/absolute/path") discards the workspace prefix when given an absolute path. Changes: - Extract only parent directory name from absolute paths to make them relative - Example: /path/to/pdfs/2008/file.pdf -> workspace/markdown/2008/file.md - Add comprehensive test suite (TestMarkdownPathHandling) with 4 test cases - Tests cover various path depths, edge cases, and document the original bug This ensures markdown files are stored in workspace/markdown/ as documented, while preserving the folder structure of input PDFs.
Add a type-safe, composable Python API for running OlmoCR pipeline
programmatically, eliminating the need for subprocess calls.
Key features:
- New PipelineConfig dataclass with 25+ configuration options
- run_pipeline() async function as main programmatic entry point
- Custom prompt support via custom_prompt parameter
- Full backward compatibility - CLI unchanged, delegates to shared impl
- All existing features maintained (retries, batching, server management)
Example usage:
```python
import asyncio
from olmocr import run_pipeline, PipelineConfig
config = PipelineConfig(
workspace="./workspace",
pdfs=["doc1.pdf", "doc2.pdf"],
custom_prompt="Extract text from this legal document...",
markdown=True,
workers=10
)
asyncio.run(run_pipeline(config))
```
Changes:
- Created olmocr/config.py with PipelineConfig dataclass
- Extracted _main_impl() from main() to share logic between CLI and API
- Added run_pipeline() as programmatic entry point
- Added _config_to_args() helper to convert config to argparse.Namespace
- Added custom_prompt parameter to build_page_query()
- Threaded custom prompt through process_page() call stack
- Updated __init__.py to export PipelineConfig and run_pipeline
- Updated test mocks to accept custom_prompt parameter
Backward compatibility:
- CLI interface unchanged - main() delegates to _main_impl()
- All default behaviors preserved
- All existing flags and options work identically
- Custom prompt optional - defaults to original prompt if not provided
Update dependency lock file to reflect current package state.
* feat(backends): add inference backend abstraction layer
Introduce InferenceBackend abstract base class with implementations
for vLLM (NVIDIA GPUs) and MLX-VLM (Apple Silicon). This abstraction
enables olmOCR to support multiple inference backends through a
unified interface.
Key components:
- BackendConfig dataclass for unified backend configuration
- VLLMBackend: OpenAI Chat Completions API, guided decoding support
- MLXVLMBackend: OpenAI Responses API, lazy model loading
- get_backend() factory for backend instantiation
Each backend handles:
- Server process lifecycle management
- Health check endpoints
- Request/response format translation
- Model-specific validation
Breaking changes: None (additive change)
* feat(pipeline): integrate backend abstraction for multi-backend support
Replace hardcoded vLLM logic with backend-agnostic implementation
using the new InferenceBackend abstraction. This enables seamless
switching between vLLM and MLX-VLM backends.
Key changes:
- Thread backend instance through processing pipeline
- Use backend.build_request() and backend.parse_response()
- Dynamic endpoint paths via backend.get_endpoint_path()
- Backend-aware platform checks (skip CUDA for MLX)
- Fix critical model path handling bug:
* vLLM: Use served name "olmocr" (model pre-loaded at startup)
* MLX-VLM: Use actual model path (lazy loading on first request)
- Add CLI support: --backend, --mlx_quantization, --mlx_kv_bits
- Backend-specific port defaults (vLLM: 30024, MLX: 8000)
CLI additions:
- --backend {vllm,mlx-vlm}: Select inference backend
- --custom_prompt: Override default OCR prompt
- --mlx_quantization: MLX model quantization (4bit, 8bit, etc.)
- --mlx_kv_bits: MLX KV-cache quantization bits
Breaking changes: None (default behavior unchanged)
* feat(config): add backend selection and platform validation
Extend PipelineConfig with backend configuration options and
platform-specific validation for MLX-VLM backend.
New configuration fields:
- backend: str = "vllm" - Select inference backend
- mlx_quantization: Optional[str] - MLX quantization (4bit, 8bit, etc.)
- mlx_kv_bits: Optional[int] - KV-cache quantization bits (1, 2, 4, 8)
Validation:
- Ensure backend is "vllm" or "mlx-vlm"
- MLX-specific checks in __post_init__:
* Verify platform is macOS (Darwin)
* Verify architecture is ARM64/Apple Silicon
* Check mlx-vlm package installation
Provides early, clear error messages when attempting to use
MLX backend on unsupported platforms.
Breaking changes: None (additive with safe defaults)
* feat(mlx): add model conversion utility for MLX format
Add convert_to_mlx.py utility that wraps mlx_vlm.convert to simplify
converting olmOCR models from HuggingFace to MLX format.
Features:
- Convert models from HuggingFace Hub or local paths
- Support for quantization (4-bit, 8-bit with configurable group size)
- Platform validation (macOS + Apple Silicon only)
- Optional upload to HuggingFace Hub
- Clear usage instructions and progress logging
Command-line interface:
python -m olmocr.convert_to_mlx MODEL --output PATH [--quantize 4]
Usage example:
python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \
--output ~/models/olmocr-mlx --quantize 4 --group-size 64
Implementation details:
- Calls mlx_vlm.convert() directly with q_bits and q_group_size
- Default group size: 64 (same as mlx-community models)
- Validates Apple Silicon before attempting conversion
Dependencies: Requires mlx-vlm>=0.3.5 (installed via olmocr[mlx])
* build: upgrade transformers and add MLX optional dependency
Update dependencies to support both vLLM and MLX-VLM backends.
Changes:
- Upgrade transformers: 4.55.2 → 4.57.0+
* Ensures compatibility with latest HuggingFace models
* Required for both training and inference backends
- Add MLX optional dependency group:
* mlx-vlm>=0.3.5 for Apple Silicon inference
* Install with: pip install olmocr[mlx]
- Add CLI entry point:
* olmocr = "olmocr.pipeline:cli"
* Enables `olmocr` command after installation
Breaking changes: None (transformers upgrade is compatible)
* docs: add comprehensive MLX backend guide
Add detailed documentation for using olmOCR with MLX-VLM backend
on Apple Silicon Macs. Integrated into Sphinx documentation site.
Location: docs/source/mlx-backend.md (added to Getting Started section)
Contents:
- Overview of MLX-VLM vs vLLM backends
- System requirements (M1/M2/M3/M4, macOS 12.0+, 16GB+ RAM)
- Installation instructions
- Quick start guide with pre-quantized models
- Configuration options and CLI flags
- Model selection guide (4-bit vs 8-bit quantization)
- Performance optimization tips
- Troubleshooting section
- API differences between vLLM and MLX-VLM
- Current limitations and workarounds
- Performance benchmarks on different Mac models
Key information:
- Default port: 8000 (vs 30024 for vLLM)
- API endpoint: /responses (vs /v1/chat/completions for vLLM)
- No guided decoding support (uses post-validation instead)
- Pre-quantized models available:
* mlx-community/olmOCR-2-7B-1025-mlx-4bit (~2GB)
* mlx-community/olmOCR-2-7B-1025-mlx-8bit (~4GB)
Target audience: Users with Apple Silicon Macs wanting on-device
inference without cloud costs or NVIDIA GPU requirements.
* chore: add workspace/ to .gitignore
Ignore workspace/ directory used for test runs and pipeline output.
Similar to existing localworkspace/* entry.
* docs: update minimum macOS version to 15.0+ (Sequoia)
Update system requirements to require macOS 15.0+ instead of 12.0+.
This reflects the tested and recommended minimum version for MLX-VLM
backend support.
|
used around ~16GB of peak memory with 8 bit quant 2025-10-30 01:21:23,213 - olmocr.pipeline - INFO - FINAL METRICS SUMMARY |
Change VLLMBackend.get_endpoint_path() from "/v1/chat/completions" to "/chat/completions" to avoid double /v1 in URL construction. This fixes external vLLM servers (like DeepInfra) which were getting 404 errors due to malformed URLs: - Before: https://api.deepinfra.com/v1/openai/v1/chat/completions (404) - After: https://api.deepinfra.com/v1/openai/chat/completions (works) Internal vLLM servers still work correctly because the base URL already includes /v1 (set in pipeline.py:1326): - Internal: http://localhost:30024/v1/chat/completions ✅
Remove global pdf_render_max_workers BoundedSemaphore that was created at module import time, causing RuntimeError when running batched extractions with multiple event loops. Solution: - Create semaphore in _main_impl() where event loop exists - Pass through call chain: worker -> process_pdf -> process_page - Each event loop now gets its own semaphore instance This fixes: RuntimeError: <BoundedSemaphore> is bound to a different event loop Fixes batched processing in bharatlex and other multi-loop scenarios.
…or improved concurrency handling - Introduced max_concurrent_work_items parameter in PipelineConfig to control the number of work items processed concurrently, enhancing flexibility for local GPU and external API usage. - Updated _main_impl to utilize this new parameter, allowing for dynamic semaphore configuration based on user input. - Added command-line argument support for max_concurrent_work_items to facilitate user configuration.
…nt processing - Introduced max_tokens parameter in PipelineConfig to specify the maximum tokens generated per page, allowing for better handling of dense documents. - Updated process_page function to utilize max_tokens from command-line arguments, defaulting to 8000 if not specified. - Added command-line argument support for max_tokens to improve user configurability.
|
I don't think I can merge in a 10k line diff here. |
Considering that 8k of those lines are in |
|
No sorry, it makes too many changes. The overall pipeline.py currently is only 1,400 lines currently. Tripling the complexity is not something we can take on. You could support a new backend with 0 lines of code, by simply giving instructions to launch an openai-API compatible endpoint, and then you can point the current pipeline to it with the |
Summary
This PR adds four improvements to olmOCR:
1. Fix Markdown Output Path Bug
Commit: f3198d2
Fixes a bug where markdown output paths were incorrectly generated when using absolute PDF paths. Previously, the path calculation would fail or produce incorrect paths.
Changes:
olmocr/pipeline.pytests/test_pipeline.pyFiles changed: 2 files (+94/-2)
2. Add Programmatic Python API
Commit: 2b3530e
Adds a clean programmatic Python API for olmOCR, making it easier to integrate into Python applications without using the CLI.
Features:
PipelineConfigdataclass for type-safe configurationrun_pipeline()async function for programmatic useolmocr.__init__.pyfor easy importsUsage:
Files changed: 4 files (+283/-68)
3. Add MLX-VLM Backend Support for Apple Silicon
Commit: 3192ed9
Adds MLX-VLM backend support for olmOCR, enabling efficient inference on Apple Silicon Macs (M1/M2/M3/M4) without requiring NVIDIA GPUs or cloud services.
Motivation
Changes
Backend Abstraction Layer (
olmocr/backends.py) - 458 linesInferenceBackendfor multi-backend supportVLLMBackendimplementation for NVIDIA GPUsMLXVLMBackendimplementation for Apple SiliconKey architectural decision: Each backend handles its own API format differences:
/v1/chat/completionsendpoint, pre-loads model with name "olmocr"/responsesendpoint, lazy-loads model using actual path on first requestPipeline Integration (
olmocr/pipeline.py) - +145/-50 linesConfiguration (
olmocr/config.py) - +41 linesbackendfield: "vllm" (default) or "mlx-vlm"mlx_quantization,mlx_kv_bitsModel Conversion Utility (
olmocr/convert_to_mlx.py) - 218 linesUsage:
python -m olmocr.convert_to_mlx allenai/olmOCR-2-7B-1025 \ --output ~/models/olmocr-mlx --quantize 4 --group-size 64Dependencies (
pyproject.toml,uv.lock)pip install olmocr[mlx]olmocrcommandDocumentation (
docs/source/mlx-backend.md) - 482 linesComprehensive guide covering:
Pre-quantized Models
Ready-to-use models available on HuggingFace:
mlx-community/olmOCR-2-7B-1025-mlx-4bitmlx-community/olmOCR-2-7B-1025-mlx-8bitUsage Example
API Differences
/v1/chat/completions/responseschoices[0].message.contentoutput[0].content[0].textLimitations
Testing
Tested on:
Files changed: 9 files (+1414/-81)
Overall Changes
Total: 16 files changed, +9484 insertions, +151 deletions
Breaking Changes
None - all changes are additive with safe defaults. Existing vLLM usage continues to work unchanged.
Testing
All changes have been tested locally. New test coverage added for:
Related Issues