World-class README for the doc2md open-source document conversion pipeline. Covers architecture, installation, usage, Claude Code integration, and all component files.
High-fidelity document-to-Markdown conversion pipeline for Claude Code
Convert PDF, DOCX, and PPTX files to structured Markdown with image extraction, multi-stage quality control, and LLM-ready image analysis preparation.
Quick Start | Architecture | Usage | Claude Code Integration | Configuration
Anthropic's copyright filter blocks most direct PDF reads in Claude Code. Even when reads succeed, raw PDF parsing loses tables, headings, and images. This pipeline solves both problems:
- Zero-token Python tier extracts text and images with full structural fidelity
- Optional LLM tier generates expert image descriptions using 8 specialist personas
- Multi-stage QC catches table collapse, heading hierarchy errors, and missing content
- SHA-256 registry tracks every conversion, preventing duplicate work
The result: Markdown files that Claude Code can read, reason about, and reference with full access to every word, table, heading, and figure from the source document.
| Feature | Description |
|---|---|
| Unified router | Single entry point handles PDF, DOCX, PPTX, and TXT |
| Multi-extractor PDF | pymupdf4llm (default), pdfplumber (cross-validation), MinerU (complex layouts) |
| Office conversion | DOCX via pandoc + python-docx; PPTX via python-pptx with recursive group shape extraction |
| Chart rendering | LibreOffice → PDF → pdftoppm at 300 DPI for SmartArt and embedded charts |
| Image deduplication | SHA-256 hashing skips duplicate images across pages |
| Blank detection | 3-tier detection: file size, pixel statistics, near-black analysis |
| Per-image classification | 8-heuristic chain classifies each image as substantive or decorative |
| Vector content detection | pymupdf get_drawings() identifies diagrams, SmartArt, shape-based figures |
| Structural QC engine | Automated checks for table collapse, heading hierarchy, YAML metadata, encoding errors |
| Persona activation matrix | Maps 24+ image types to 8 expert personas for targeted LLM analysis |
| Conversion registry | JSON registry with SHA-256 hashes, image metadata, conversion timestamps |
| Image indexing | Per-file and project-level testable image indexes |
| Claude Code hook | Enforces "never read raw PDF" policy at the tool level |
| MinerU fallback | Auto-switches to MinerU when cross-validation failure rate exceeds 40% |
| DOCX table styling | Professional styling for pandoc-generated Word documents |
┌─────────────────────────────┐
│ run-pipeline.py │
│ (unified router) │
└──────────┬──────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ PDF │ │ DOCX │ │ PPTX │
└────┬─────┘ └──────┬───────┘ └────┬─────┘
│ │ │
═══════════════════════════════════════════════════════════════════
TIER 1: Python (zero LLM tokens)
═══════════════════════════════════════════════════════════════════
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ convert-paper │ │ convert-office │ │ convert-office │
│ pymupdf4llm │ │ pandoc+docx │ │ python-pptx │
└────────┬───────┘ └────────┬───────┘ └────────┬───────┘
│ │ │
└───────────┬───────┘────────────────────┘
│
┌───────────▼───────────┐
│ Step 1b: Cross-Val │ (PDF only: pdfplumber)
│ Step 2: Structural │ (QC gate — must PASS)
│ Step 3: Image Prep │ (persona activation)
│ Step 6c: Image Index │ (SUB/DEC classification)
└───────────┬───────────┘
│
═══════════════════════════════════════════════════════════════════
TIER 2: Claude (LLM — optional, manual)
═══════════════════════════════════════════════════════════════════
│
┌───────────▼───────────┐
│ Step 4: IMAGE NOTEs │ (8 expert personas)
│ Step 5: Content QC │ (fidelity check)
│ Step 6: Final Review │ (human-in-the-loop)
└───────────────────────┘
Tier 1 runs entirely via Python. No API calls, no tokens consumed. It extracts text, images, and metadata; runs structural QC; and prepares the analysis manifest that tells Tier 2 which expert personas should examine each image.
Tier 2 is optional and uses Claude's vision capabilities to generate multi-expert image descriptions. This tier is invoked manually through Claude Code's agent system or the included skill definition.
# Core dependencies
pip install pymupdf pymupdf4llm pdfplumber python-docx python-pptx Pillow numpy
# Pandoc (required for DOCX text extraction)
brew install pandoc # macOS
sudo apt install pandoc # Ubuntu/Debian
# Optional: LibreOffice (for chart/SmartArt rendering in PPTX)
brew install --cask libreoffice # macOS
# Optional: MinerU (for complex/scanned PDFs)
# See https://github.com/opendatalab/MinerU for installationgit clone https://github.com/orangefineblue/doc2md.git
cd doc2md
# Copy scripts to your preferred location
cp scripts/*.py ~/.local/bin/ # or any directory in your PATH# Convert a PDF
python3 scripts/run-pipeline.py paper.pdf -o paper.md -i images/
# Convert a DOCX
python3 scripts/run-pipeline.py report.docx -o report.md
# Convert a PPTX
python3 scripts/run-pipeline.py slides.pptx -o slides.md
# Convert with organized output directory
python3 scripts/run-pipeline.py paper.pdf --target-dir ./converted/The unified router (run-pipeline.py) auto-detects file format and selects
the appropriate extractor:
python3 run-pipeline.py <input-file> [options]| Option | Description |
|---|---|
-o, --output |
Output markdown file path |
-i, --images |
Image output directory |
-s, --short-name |
Short name for file references |
--target-dir |
Organized output directory (moves source to _originals/) |
--force-extractor |
Override extractor selection (pymupdf4llm, tesseract, mineru) |
--skip-xval |
Skip cross-validation step |
--dry-run |
Test without moving files |
--generate-testable-index |
Generate project-level image index |
# Standard (pymupdf4llm + pdfplumber cross-validation)
python3 run-pipeline.py paper.pdf -o paper.md -i paper_images/
# Force MinerU for complex layouts
python3 run-pipeline.py scanned-doc.pdf --force-extractor mineru -o output.md
# Skip cross-validation for faster processing
python3 run-pipeline.py simple.pdf -o simple.md --skip-xvalExtractor selection logic:
| Document Type | Default Extractor | Fallback Chain |
|---|---|---|
| Digital PDF (>50 chars/page avg) | pymupdf4llm | markitdown → calibre |
| Scanned PDF (<50 chars/page avg) | tesseract | mineru → zerox |
| Complex PDF (>40% cross-val failures) | Auto-switches to MinerU | — |
# Standard (pandoc for text, python-docx for images)
python3 run-pipeline.py report.docx -o report.md
# With organized output
python3 run-pipeline.py report.docx --target-dir ./reports/# Standard (python-pptx with recursive group shape extraction)
python3 run-pipeline.py deck.pptx -o deck.md
# Charts and SmartArt are rendered via LibreOffice when available
python3 run-pipeline.py charts.pptx --target-dir ./presentations/XLSX files use a lightweight text-only path (no image pipeline):
# Via markitdown (recommended)
pip install markitdown
markitdown spreadsheet.xlsx > spreadsheet.mdWhen you specify --target-dir, the pipeline organizes all output:
target-dir/
paper.md # Converted markdown
paper_images/ # Extracted images
paper_manifest.json # Image manifest with metadata
paper_image-index.md # Image classification index
_originals/ # Source files moved here
paper.pdf
PIPELINE-REPORT.md # Visual conversion report
ISSUE-LOG.md # Tracked issues (appended per conversion)
The full pipeline runs these steps in sequence:
| Step | Name | Tool | Description |
|---|---|---|---|
| 0 | Extractor Router | Python | Detect format, measure text density, select extractor |
| 1 | Text + Image Extraction | Python | Run selected extractor (pymupdf4llm, convert-office, etc.) |
| 1b | Cross-Validation | Python | Compare extraction against pdfplumber (PDF only) |
| 1c | Early Image Index | Python | Pre-QC image index for MinerU output |
| 2 | Structural QC | Python | GATE — must PASS before proceeding |
| 3 | Image Analysis Prep | Python | Persona activation matrix, analysis manifest |
| 4 | IMAGE NOTEs | Claude | Multi-expert image descriptions (8 personas) |
| 5 | Content Fidelity QC | Claude | Verify no text was lost in conversion |
| 6a | Number Extraction | Python | Extract numerical data (PDF only) |
| 6c | Image Index | Python | Per-image SUB/DEC classification with 8 heuristics |
| 7-13 | File Organization | Python | Move, rename, registry update, visual report |
Steps 0-3 and 6 run automatically. Steps 4-5 require Claude Code (Tier 2).
Each extracted image passes through an 8-heuristic classification chain:
- Blank detection — 3-tier: file size (<2KB), pixel statistics, near-black analysis
- Dimension check — Minimum size thresholds
- Aspect ratio — Extreme ratios suggest decorative elements (banners, rules)
- Journal branding — Small logos, publisher marks
- Color block detection — Solid/near-solid color fills
- Low-density badge — Small images with minimal visual information
- Page position heuristics — Header/footer regions
- Vector content detection — pymupdf
get_drawings()count + area analysis
Images classified as substantive (SUB) proceed to Tier 2 analysis. Images classified as decorative (DEC) are skipped, saving LLM tokens.
For substantive images, the pipeline maps each image type to relevant expert personas:
| Image Type | Always Active | Conditionally Active |
|---|---|---|
| Kaplan-Meier | Statistician, Viz Critic | Clinical Trialist, Epidemiologist, Health Economist |
| Forest Plot | Statistician, Viz Critic | Clinical Trialist, Regulatory Analyst |
| Tornado Diagram | Health Economist, Statistician, Viz Critic | Regulatory Analyst |
| Decision Tree | Model Architect, Health Economist | Clinical Trialist, Regulatory Analyst |
| Flow Chart | Viz Critic | Clinical Trialist (CONSORT), Regulatory (PRISMA), Model Architect |
| Scatter Plot | Statistician, Viz Critic | Health Economist (CE plane), Epidemiologist |
The full matrix covers 24+ image types across 8 personas. The prepare-image-analysis.py
script generates an analysis-manifest.json with per-image persona assignments, template
skeletons, and section context.
qc-structural.py runs automated quality checks and acts as a gate — the pipeline
stops if QC fails.
- YAML header validation — Required fields:
source_file,conversion_date,conversion_tool,fidelity_standard,document_type - Section/heading count — Detects missing or collapsed sections
- Table column consistency — Flags tables with inconsistent column counts
- Table collapse detection — Detects multi-column tables collapsed into fewer cells (numeric density heuristic)
- Reference numbering — Validates
[1]-[N]sequential references - Encoding errors — Catches mojibake and broken Unicode
- Image index completeness — Cross-references manifest against extracted files
- Manifest consistency — Validates manifest JSON against image index table
- Markdown syntax — Checks for common formatting errors
| Code | Meaning | Pipeline Action |
|---|---|---|
| 0 | PASS | Continue to next step |
| 1 | FAIL | Pipeline stops — fix required |
| 2 | WARN | Fix and rerun (do not proceed on WARN) |
The included hook intercepts Read tool calls in Claude Code and redirects
PDF/DOCX/PPTX reads to their converted Markdown equivalents.
Setup:
- Copy the hook script:
cp hooks/enforce-pdf-conversion.sh ~/.claude/hooks/
chmod +x ~/.claude/hooks/enforce-pdf-conversion.sh- Register in
~/.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Read",
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/enforce-pdf-conversion.sh"
}
]
}
]
}
}How it works:
- Hook intercepts every
Readtool call - If the file is a PDF/DOCX/PPTX:
- Computes SHA-256 hash
- Looks up the hash in the conversion registry
- If found: redirects to the registered
.mdfile - If not found: checks for a co-located
.md(same directory, same name) - If no
.mdexists: blocks the read and prints the conversion command
- All other file types pass through unchanged
- Every interception is logged to
~/.claude/pipeline/hook-interceptions.log
The included SKILL.md defines a Claude Code skill that orchestrates the
complete pipeline with step-by-step instructions for both tiers.
Setup:
cp skill/SKILL.md ~/.claude/skills/convert-documents/SKILL.mdThe skill provides:
- Quick-start commands for each format
- Step-by-step orchestration instructions
- Expert persona reference table
- QC loop enforcement (fix and rerun until zero issues)
- Image analysis prompt templates
The pipeline maintains a JSON registry at ~/.claude/pipeline/conversion_registry.json.
Each entry records:
{
"sha256": "a1b2c3...",
"source_file": "/path/to/original.pdf",
"output_md": "/path/to/converted.md",
"pipeline_version": "3.2.0",
"extractor": "pymupdf4llm",
"conversion_date": "2025-01-15T10:30:00Z",
"pages": 47,
"image_index_path": "/path/to/image-index.md",
"total_images_detected": 30,
"substantive_images": 22,
"has_testable_images": true
}The registry enables:
- Deduplication — Same file (by hash) is never converted twice
- Hook lookup — The Claude Code hook finds converted
.mdby hash - Audit trail — Full provenance for every conversion
For cases where automatic classification is wrong, create an
image-index-overrides.json alongside the image index:
{
"page_5_img_3.png": {
"classification": "SUB",
"reason": "Manual override: contains relevant diagram"
}
}The pipeline applies overrides during image index generation (Step 6c).
Every converted Markdown file includes a YAML frontmatter header:
---
source_file: paper.pdf
source_format: pdf
conversion_date: "2025-01-15T10:30:00Z"
conversion_tool: pymupdf4llm
pipeline_version: "3.2.0"
fidelity_standard: zero_missing_text
document_type: academic_paper
pages: 47
domain: health_economics
---| File | Lines | Description |
|---|---|---|
scripts/run-pipeline.py |
7,749 | Unified pipeline router, image classification, file organization, registry management |
scripts/convert-paper.py |
1,329 | PDF text/image extraction via pymupdf4llm, multi-panel splitting, sparse page rendering |
scripts/convert-office.py |
3,122 | DOCX/PPTX conversion, recursive shape extraction, chart rendering, PUA Unicode mapping |
scripts/qc-structural.py |
1,211 | Structural QC engine: YAML validation, table collapse detection, encoding checks |
scripts/prepare-image-analysis.py |
634 | Persona activation matrix, analysis manifest generation, template skeletons |
scripts/convert-mineru.py |
228 | MinerU fallback wrapper for complex/scanned PDFs (CPU-only) |
scripts/style-docx-tables.py |
262 | Professional DOCX styling for pandoc output (table colors, borders, code blocks) |
hooks/enforce-pdf-conversion.sh |
276 | Claude Code PreToolUse hook: intercepts PDF/Office reads, redirects to Markdown |
skill/SKILL.md |
1,354 | Claude Code skill definition: full pipeline orchestration with QC loops |
Total: ~16,165 lines
| Package | Purpose |
|---|---|
PyMuPDF (fitz) |
PDF parsing, image extraction, vector detection |
| pymupdf4llm | LLM-optimized Markdown extraction from PDF |
| pdfplumber | Cross-validation of PDF extraction |
| python-docx | DOCX image extraction and styling |
| python-pptx | PPTX text and image extraction |
| Pillow | Image processing, blank detection, format conversion |
| NumPy | Pixel-level image analysis (near-black detection) |
| Pandoc | DOCX text extraction to Markdown |
| jq | JSON processing in the hook script |
| Package | Purpose |
|---|---|
| LibreOffice | Chart/SmartArt rendering (PPTX) |
| MinerU | Complex/scanned PDF fallback extractor |
| Tesseract | OCR for scanned documents |
| MarkItDown | XLSX and fallback PDF conversion |
Python 3.8+ is required. The codebase uses dataclasses, typing.Literal,
and pathlib features available from Python 3.8 onward.
# Full pipeline with organized output
python3 scripts/run-pipeline.py \
~/papers/smith-2024-cost-effectiveness.pdf \
--target-dir ~/converted/smith-2024/
# Output structure:
# ~/converted/smith-2024/
# smith-2024-cost-effectiveness.md
# smith-2024-cost-effectiveness_images/
# smith-2024-cost-effectiveness_manifest.json
# smith-2024-cost-effectiveness_image-index.md
# _originals/smith-2024-cost-effectiveness.pdf
# PIPELINE-REPORT.md# Charts are rendered via LibreOffice at 300 DPI
python3 scripts/run-pipeline.py \
~/presentations/quarterly-review.pptx \
--target-dir ~/converted/quarterly/
# SmartArt and charts appear as high-resolution PNG images
# in the _images/ directory with type_guess="chart" or "diagram"# Convert all PDFs in a directory
for f in ~/papers/*.pdf; do
python3 scripts/run-pipeline.py "$f" \
--target-dir ~/converted/ \
--skip-xval
done
# Generate project-level image index
python3 scripts/run-pipeline.py --generate-testable-index ~/converted/python3 scripts/run-pipeline.py paper.pdf \
--target-dir ~/converted/ \
--dry-run| Issue | Cause | Fix |
|---|---|---|
FAIL: No YAML header block found |
Extractor produced malformed output | Check source file is valid; try --force-extractor mineru |
| Step 2 WARN: table collapse | Multi-column tables lost columns in conversion | QC inserts HTML WARNING comments; fix manually or re-extract |
| MinerU fallback triggered | >40% of pages failed cross-validation | Expected for complex layouts; MinerU handles these better |
ValueError: min() iterable argument is empty |
pymupdf4llm bug on certain table layouts | Fixed by disabling layout mode; should not recur |
| Hook blocks PDF read | No converted .md found |
Run the pipeline first: python3 run-pipeline.py <file> |
| Near-black images not detected | Anti-aliased rendering creates subtle gradients | Pipeline uses 4-tier detection including pixel-percentage pass |
| Script | Code | Meaning |
|---|---|---|
run-pipeline.py |
0 | Success |
run-pipeline.py |
1 | General failure |
run-pipeline.py |
3 | Extractor crash (pymupdf4llm) |
qc-structural.py |
0 | QC PASS |
qc-structural.py |
1 | QC FAIL |
qc-structural.py |
2 | QC WARN |
convert-mineru.py |
1 | MinerU not installed |
convert-mineru.py |
2 | Conversion failed |
Why not just use markitdown?
MarkItDown is excellent for simple documents but loses table structure,
heading hierarchy, and images in complex PDFs. This pipeline uses
pymupdf4llm for superior table and multi-column support, with pdfplumber
cross-validation to catch extraction errors.
Why a 2-tier architecture? LLM tokens are expensive. The Python tier handles everything that can be done deterministically (text extraction, image classification, QC) at zero token cost. The LLM tier is reserved for tasks that genuinely require visual understanding (image descriptions) or natural language judgement (content fidelity verification).
Why 8 expert personas? A single "describe this image" prompt produces generic descriptions. Domain-specific personas (e.g., a Statistician analyzing a Kaplan-Meier curve) produce descriptions that capture methodologically relevant details like confidence intervals, at-risk tables, and crossing hazard curves.
Why SHA-256 everywhere? File names change. File contents don't. Hash-based deduplication and registry lookup means the pipeline never re-converts a document it has already processed, even if the file is moved, renamed, or copied to a different directory.
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Run the QC checks on any modified scripts
- Submit a pull request with a clear description of changes
git clone https://github.com/orangefineblue/doc2md.git
cd doc2md
pip install -r requirements.txt # when available
# Run structural QC on a test conversion
python3 scripts/run-pipeline.py tests/fixtures/sample.pdf -o /tmp/test.md
python3 scripts/qc-structural.py /tmp/test.md --verboseWhen reporting a bug, please include:
- The source file format (PDF/DOCX/PPTX)
- The extractor used (check pipeline output)
- The full error message or QC failure output
- Python version (
python3 --version)
MIT License. See LICENSE for details.
- claude-code-orchestration-protocol — A zero-read orchestrator protocol for Claude Code that manages context window usage, delegates work to sub-agents, and runs QC loops until zero issues remain. Designed to work alongside doc2md for complex multi-document workflows where context rot is a concern.
Built for use with Claude Code by Anthropic. Uses PyMuPDF, pdfplumber, MinerU, and Pandoc for document processing.
doc2md is designed for researchers, analysts, and anyone who needs high-fidelity document conversion in LLM-powered workflows.