doc2md

Summary

World-class README for the doc2md open-source document conversion pipeline. Covers architecture, installation, usage, Claude Code integration, and all component files.

doc2md

High-fidelity document-to-Markdown conversion pipeline for Claude Code

Convert PDF, DOCX, and PPTX files to structured Markdown with image extraction, multi-stage quality control, and LLM-ready image analysis preparation.

Quick Start | Architecture | Usage | Claude Code Integration | Configuration

Why doc2md?

Anthropic's copyright filter blocks most direct PDF reads in Claude Code. Even when reads succeed, raw PDF parsing loses tables, headings, and images. This pipeline solves both problems:

Zero-token Python tier extracts text and images with full structural fidelity
Optional LLM tier generates expert image descriptions using 8 specialist personas
Multi-stage QC catches table collapse, heading hierarchy errors, and missing content
SHA-256 registry tracks every conversion, preventing duplicate work

The result: Markdown files that Claude Code can read, reason about, and reference with full access to every word, table, heading, and figure from the source document.

Features

Feature	Description
Unified router	Single entry point handles PDF, DOCX, PPTX, and TXT
Multi-extractor PDF	pymupdf4llm (default), pdfplumber (cross-validation), MinerU (complex layouts)
Office conversion	DOCX via pandoc + python-docx; PPTX via python-pptx with recursive group shape extraction
Chart rendering	LibreOffice → PDF → pdftoppm at 300 DPI for SmartArt and embedded charts
Image deduplication	SHA-256 hashing skips duplicate images across pages
Blank detection	3-tier detection: file size, pixel statistics, near-black analysis
Per-image classification	8-heuristic chain classifies each image as substantive or decorative
Vector content detection	pymupdf `get_drawings()` identifies diagrams, SmartArt, shape-based figures
Structural QC engine	Automated checks for table collapse, heading hierarchy, YAML metadata, encoding errors
Persona activation matrix	Maps 24+ image types to 8 expert personas for targeted LLM analysis
Conversion registry	JSON registry with SHA-256 hashes, image metadata, conversion timestamps
Image indexing	Per-file and project-level testable image indexes
Claude Code hook	Enforces "never read raw PDF" policy at the tool level
MinerU fallback	Auto-switches to MinerU when cross-validation failure rate exceeds 40%
DOCX table styling	Professional styling for pandoc-generated Word documents

Architecture

                              ┌─────────────────────────────┐
                              │       run-pipeline.py        │
                              │      (unified router)        │
                              └──────────┬──────────────────┘
                                         │
                    ┌────────────────────┼────────────────────┐
                    ▼                    ▼                    ▼
              ┌──────────┐      ┌──────────────┐      ┌──────────┐
              │   PDF     │      │    DOCX      │      │   PPTX   │
              └────┬─────┘      └──────┬───────┘      └────┬─────┘
                   │                   │                    │
 ═══════════════════════════════════════════════════════════════════
  TIER 1: Python (zero LLM tokens)
 ═══════════════════════════════════════════════════════════════════
                   │                   │                    │
                   ▼                   ▼                    ▼
          ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
          │ convert-paper  │  │ convert-office  │  │ convert-office  │
          │  pymupdf4llm   │  │  pandoc+docx    │  │  python-pptx   │
          └────────┬───────┘  └────────┬───────┘  └────────┬───────┘
                   │                   │                    │
                   └───────────┬───────┘────────────────────┘
                               │
                   ┌───────────▼───────────┐
                   │  Step 1b: Cross-Val   │  (PDF only: pdfplumber)
                   │  Step 2:  Structural  │  (QC gate — must PASS)
                   │  Step 3:  Image Prep  │  (persona activation)
                   │  Step 6c: Image Index │  (SUB/DEC classification)
                   └───────────┬───────────┘
                               │
 ═══════════════════════════════════════════════════════════════════
  TIER 2: Claude (LLM — optional, manual)
 ═══════════════════════════════════════════════════════════════════
                               │
                   ┌───────────▼───────────┐
                   │  Step 4: IMAGE NOTEs  │  (8 expert personas)
                   │  Step 5: Content QC   │  (fidelity check)
                   │  Step 6: Final Review │  (human-in-the-loop)
                   └───────────────────────┘

Tier 1 runs entirely via Python. No API calls, no tokens consumed. It extracts text, images, and metadata; runs structural QC; and prepares the analysis manifest that tells Tier 2 which expert personas should examine each image.

Tier 2 is optional and uses Claude's vision capabilities to generate multi-expert image descriptions. This tier is invoked manually through Claude Code's agent system or the included skill definition.

Quick Start

Prerequisites

# Core dependencies
pip install pymupdf pymupdf4llm pdfplumber python-docx python-pptx Pillow numpy

# Pandoc (required for DOCX text extraction)
brew install pandoc        # macOS
sudo apt install pandoc    # Ubuntu/Debian

# Optional: LibreOffice (for chart/SmartArt rendering in PPTX)
brew install --cask libreoffice    # macOS

# Optional: MinerU (for complex/scanned PDFs)
# See https://github.com/opendatalab/MinerU for installation

Install

git clone https://github.com/orangefineblue/doc2md.git
cd doc2md

# Copy scripts to your preferred location
cp scripts/*.py ~/.local/bin/    # or any directory in your PATH

Run

# Convert a PDF
python3 scripts/run-pipeline.py paper.pdf -o paper.md -i images/

# Convert a DOCX
python3 scripts/run-pipeline.py report.docx -o report.md

# Convert a PPTX
python3 scripts/run-pipeline.py slides.pptx -o slides.md

# Convert with organized output directory
python3 scripts/run-pipeline.py paper.pdf --target-dir ./converted/

Usage

Basic Conversion

The unified router (run-pipeline.py) auto-detects file format and selects the appropriate extractor:

python3 run-pipeline.py <input-file> [options]

Option	Description
`-o`, `--output`	Output markdown file path
`-i`, `--images`	Image output directory
`-s`, `--short-name`	Short name for file references
`--target-dir`	Organized output directory (moves source to `_originals/`)
`--force-extractor`	Override extractor selection (pymupdf4llm, tesseract, mineru)
`--skip-xval`	Skip cross-validation step
`--dry-run`	Test without moving files
`--generate-testable-index`	Generate project-level image index

PDF Conversion

# Standard (pymupdf4llm + pdfplumber cross-validation)
python3 run-pipeline.py paper.pdf -o paper.md -i paper_images/

# Force MinerU for complex layouts
python3 run-pipeline.py scanned-doc.pdf --force-extractor mineru -o output.md

# Skip cross-validation for faster processing
python3 run-pipeline.py simple.pdf -o simple.md --skip-xval

Extractor selection logic:

Document Type	Default Extractor	Fallback Chain
Digital PDF (>50 chars/page avg)	pymupdf4llm	markitdown → calibre
Scanned PDF (<50 chars/page avg)	tesseract	mineru → zerox
Complex PDF (>40% cross-val failures)	Auto-switches to MinerU	—

DOCX Conversion

# Standard (pandoc for text, python-docx for images)
python3 run-pipeline.py report.docx -o report.md

# With organized output
python3 run-pipeline.py report.docx --target-dir ./reports/

PPTX Conversion

# Standard (python-pptx with recursive group shape extraction)
python3 run-pipeline.py deck.pptx -o deck.md

# Charts and SmartArt are rendered via LibreOffice when available
python3 run-pipeline.py charts.pptx --target-dir ./presentations/

XLSX Conversion

XLSX files use a lightweight text-only path (no image pipeline):

# Via markitdown (recommended)
pip install markitdown
markitdown spreadsheet.xlsx > spreadsheet.md

Organized Output (`--target-dir`)

When you specify --target-dir, the pipeline organizes all output:

target-dir/
  paper.md                    # Converted markdown
  paper_images/               # Extracted images
  paper_manifest.json         # Image manifest with metadata
  paper_image-index.md        # Image classification index
  _originals/                 # Source files moved here
    paper.pdf
  PIPELINE-REPORT.md          # Visual conversion report
  ISSUE-LOG.md                # Tracked issues (appended per conversion)

Pipeline Steps

The full pipeline runs these steps in sequence:

Step	Name	Tool	Description
0	Extractor Router	Python	Detect format, measure text density, select extractor
1	Text + Image Extraction	Python	Run selected extractor (pymupdf4llm, convert-office, etc.)
1b	Cross-Validation	Python	Compare extraction against pdfplumber (PDF only)
1c	Early Image Index	Python	Pre-QC image index for MinerU output
2	Structural QC	Python	GATE — must PASS before proceeding
3	Image Analysis Prep	Python	Persona activation matrix, analysis manifest
4	IMAGE NOTEs	Claude	Multi-expert image descriptions (8 personas)
5	Content Fidelity QC	Claude	Verify no text was lost in conversion
6a	Number Extraction	Python	Extract numerical data (PDF only)
6c	Image Index	Python	Per-image SUB/DEC classification with 8 heuristics
7-13	File Organization	Python	Move, rename, registry update, visual report

Steps 0-3 and 6 run automatically. Steps 4-5 require Claude Code (Tier 2).

Image Classification

Each extracted image passes through an 8-heuristic classification chain:

Blank detection — 3-tier: file size (<2KB), pixel statistics, near-black analysis
Dimension check — Minimum size thresholds
Aspect ratio — Extreme ratios suggest decorative elements (banners, rules)
Journal branding — Small logos, publisher marks
Color block detection — Solid/near-solid color fills
Low-density badge — Small images with minimal visual information
Page position heuristics — Header/footer regions
Vector content detection — pymupdf get_drawings() count + area analysis

Images classified as substantive (SUB) proceed to Tier 2 analysis. Images classified as decorative (DEC) are skipped, saving LLM tokens.

Persona Activation Matrix

For substantive images, the pipeline maps each image type to relevant expert personas:

Image Type	Always Active	Conditionally Active
Kaplan-Meier	Statistician, Viz Critic	Clinical Trialist, Epidemiologist, Health Economist
Forest Plot	Statistician, Viz Critic	Clinical Trialist, Regulatory Analyst
Tornado Diagram	Health Economist, Statistician, Viz Critic	Regulatory Analyst
Decision Tree	Model Architect, Health Economist	Clinical Trialist, Regulatory Analyst
Flow Chart	Viz Critic	Clinical Trialist (CONSORT), Regulatory (PRISMA), Model Architect
Scatter Plot	Statistician, Viz Critic	Health Economist (CE plane), Epidemiologist

The full matrix covers 24+ image types across 8 personas. The prepare-image-analysis.py script generates an analysis-manifest.json with per-image persona assignments, template skeletons, and section context.

Structural QC Engine

qc-structural.py runs automated quality checks and acts as a gate — the pipeline stops if QC fails.

Checks Performed

YAML header validation — Required fields: source_file, conversion_date, conversion_tool, fidelity_standard, document_type
Section/heading count — Detects missing or collapsed sections
Table column consistency — Flags tables with inconsistent column counts
Table collapse detection — Detects multi-column tables collapsed into fewer cells (numeric density heuristic)
Reference numbering — Validates [1]-[N] sequential references
Encoding errors — Catches mojibake and broken Unicode
Image index completeness — Cross-references manifest against extracted files
Manifest consistency — Validates manifest JSON against image index table
Markdown syntax — Checks for common formatting errors

Exit Codes

Code	Meaning	Pipeline Action
0	PASS	Continue to next step
1	FAIL	Pipeline stops — fix required
2	WARN	Fix and rerun (do not proceed on WARN)

Claude Code Integration

Hook: Enforce MD-First Reading

The included hook intercepts Read tool calls in Claude Code and redirects PDF/DOCX/PPTX reads to their converted Markdown equivalents.

Setup:

Copy the hook script:

cp hooks/enforce-pdf-conversion.sh ~/.claude/hooks/
chmod +x ~/.claude/hooks/enforce-pdf-conversion.sh

Register in ~/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Read",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/enforce-pdf-conversion.sh"
          }
        ]
      }
    ]
  }
}

How it works:

Hook intercepts every Read tool call
If the file is a PDF/DOCX/PPTX:
- Computes SHA-256 hash
- Looks up the hash in the conversion registry
- If found: redirects to the registered .md file
- If not found: checks for a co-located .md (same directory, same name)
- If no .md exists: blocks the read and prints the conversion command
All other file types pass through unchanged
Every interception is logged to ~/.claude/pipeline/hook-interceptions.log

Skill: Full Pipeline Orchestration

The included SKILL.md defines a Claude Code skill that orchestrates the complete pipeline with step-by-step instructions for both tiers.

Setup:

cp skill/SKILL.md ~/.claude/skills/convert-documents/SKILL.md

The skill provides:

Quick-start commands for each format
Step-by-step orchestration instructions
Expert persona reference table
QC loop enforcement (fix and rerun until zero issues)
Image analysis prompt templates

Configuration

Conversion Registry

The pipeline maintains a JSON registry at ~/.claude/pipeline/conversion_registry.json. Each entry records:

{
  "sha256": "a1b2c3...",
  "source_file": "/path/to/original.pdf",
  "output_md": "/path/to/converted.md",
  "pipeline_version": "3.2.0",
  "extractor": "pymupdf4llm",
  "conversion_date": "2025-01-15T10:30:00Z",
  "pages": 47,
  "image_index_path": "/path/to/image-index.md",
  "total_images_detected": 30,
  "substantive_images": 22,
  "has_testable_images": true
}

The registry enables:

Deduplication — Same file (by hash) is never converted twice
Hook lookup — The Claude Code hook finds converted .md by hash
Audit trail — Full provenance for every conversion

Image Index Overrides

For cases where automatic classification is wrong, create an image-index-overrides.json alongside the image index:

{
  "page_5_img_3.png": {
    "classification": "SUB",
    "reason": "Manual override: contains relevant diagram"
  }
}

The pipeline applies overrides during image index generation (Step 6c).

Output Metadata

Every converted Markdown file includes a YAML frontmatter header:

---
source_file: paper.pdf
source_format: pdf
conversion_date: "2025-01-15T10:30:00Z"
conversion_tool: pymupdf4llm
pipeline_version: "3.2.0"
fidelity_standard: zero_missing_text
document_type: academic_paper
pages: 47
domain: health_economics
---

Component Reference

File	Lines	Description
`scripts/run-pipeline.py`	7,749	Unified pipeline router, image classification, file organization, registry management
`scripts/convert-paper.py`	1,329	PDF text/image extraction via pymupdf4llm, multi-panel splitting, sparse page rendering
`scripts/convert-office.py`	3,122	DOCX/PPTX conversion, recursive shape extraction, chart rendering, PUA Unicode mapping
`scripts/qc-structural.py`	1,211	Structural QC engine: YAML validation, table collapse detection, encoding checks
`scripts/prepare-image-analysis.py`	634	Persona activation matrix, analysis manifest generation, template skeletons
`scripts/convert-mineru.py`	228	MinerU fallback wrapper for complex/scanned PDFs (CPU-only)
`scripts/style-docx-tables.py`	262	Professional DOCX styling for pandoc output (table colors, borders, code blocks)
`hooks/enforce-pdf-conversion.sh`	276	Claude Code PreToolUse hook: intercepts PDF/Office reads, redirects to Markdown
`skill/SKILL.md`	1,354	Claude Code skill definition: full pipeline orchestration with QC loops

Total: ~16,165 lines

Dependencies

Required

Package	Purpose
PyMuPDF (`fitz`)	PDF parsing, image extraction, vector detection
pymupdf4llm	LLM-optimized Markdown extraction from PDF
pdfplumber	Cross-validation of PDF extraction
python-docx	DOCX image extraction and styling
python-pptx	PPTX text and image extraction
Pillow	Image processing, blank detection, format conversion
NumPy	Pixel-level image analysis (near-black detection)
Pandoc	DOCX text extraction to Markdown
jq	JSON processing in the hook script

Optional

Package	Purpose
LibreOffice	Chart/SmartArt rendering (PPTX)
MinerU	Complex/scanned PDF fallback extractor
Tesseract	OCR for scanned documents
MarkItDown	XLSX and fallback PDF conversion

Python Version

Python 3.8+ is required. The codebase uses dataclasses, typing.Literal, and pathlib features available from Python 3.8 onward.

Examples

Convert an Academic Paper

# Full pipeline with organized output
python3 scripts/run-pipeline.py \
  ~/papers/smith-2024-cost-effectiveness.pdf \
  --target-dir ~/converted/smith-2024/

# Output structure:
# ~/converted/smith-2024/
#   smith-2024-cost-effectiveness.md
#   smith-2024-cost-effectiveness_images/
#   smith-2024-cost-effectiveness_manifest.json
#   smith-2024-cost-effectiveness_image-index.md
#   _originals/smith-2024-cost-effectiveness.pdf
#   PIPELINE-REPORT.md

Convert a Slide Deck with Charts

# Charts are rendered via LibreOffice at 300 DPI
python3 scripts/run-pipeline.py \
  ~/presentations/quarterly-review.pptx \
  --target-dir ~/converted/quarterly/

# SmartArt and charts appear as high-resolution PNG images
# in the _images/ directory with type_guess="chart" or "diagram"

Batch Conversion

# Convert all PDFs in a directory
for f in ~/papers/*.pdf; do
  python3 scripts/run-pipeline.py "$f" \
    --target-dir ~/converted/ \
    --skip-xval
done

# Generate project-level image index
python3 scripts/run-pipeline.py --generate-testable-index ~/converted/

Dry Run (Preview Without Moving Files)

python3 scripts/run-pipeline.py paper.pdf \
  --target-dir ~/converted/ \
  --dry-run

Troubleshooting

Common Issues

Issue	Cause	Fix
`FAIL: No YAML header block found`	Extractor produced malformed output	Check source file is valid; try `--force-extractor mineru`
Step 2 WARN: table collapse	Multi-column tables lost columns in conversion	QC inserts HTML WARNING comments; fix manually or re-extract
MinerU fallback triggered	>40% of pages failed cross-validation	Expected for complex layouts; MinerU handles these better
`ValueError: min() iterable argument is empty`	pymupdf4llm bug on certain table layouts	Fixed by disabling layout mode; should not recur
Hook blocks PDF read	No converted `.md` found	Run the pipeline first: `python3 run-pipeline.py <file>`
Near-black images not detected	Anti-aliased rendering creates subtle gradients	Pipeline uses 4-tier detection including pixel-percentage pass

Exit Codes

Script	Code	Meaning
`run-pipeline.py`	0	Success
`run-pipeline.py`	1	General failure
`run-pipeline.py`	3	Extractor crash (pymupdf4llm)
`qc-structural.py`	0	QC PASS
`qc-structural.py`	1	QC FAIL
`qc-structural.py`	2	QC WARN
`convert-mineru.py`	1	MinerU not installed
`convert-mineru.py`	2	Conversion failed

Design Decisions

Why not just use markitdown? MarkItDown is excellent for simple documents but loses table structure, heading hierarchy, and images in complex PDFs. This pipeline uses pymupdf4llm for superior table and multi-column support, with pdfplumber cross-validation to catch extraction errors.

Why a 2-tier architecture? LLM tokens are expensive. The Python tier handles everything that can be done deterministically (text extraction, image classification, QC) at zero token cost. The LLM tier is reserved for tasks that genuinely require visual understanding (image descriptions) or natural language judgement (content fidelity verification).

Why 8 expert personas? A single "describe this image" prompt produces generic descriptions. Domain-specific personas (e.g., a Statistician analyzing a Kaplan-Meier curve) produce descriptions that capture methodologically relevant details like confidence intervals, at-risk tables, and crossing hazard curves.

Why SHA-256 everywhere? File names change. File contents don't. Hash-based deduplication and registry lookup means the pipeline never re-converts a document it has already processed, even if the file is moved, renamed, or copied to a different directory.

Contributing

Contributions are welcome. Please:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Run the QC checks on any modified scripts
Submit a pull request with a clear description of changes

Development Setup

git clone https://github.com/orangefineblue/doc2md.git
cd doc2md
pip install -r requirements.txt  # when available

# Run structural QC on a test conversion
python3 scripts/run-pipeline.py tests/fixtures/sample.pdf -o /tmp/test.md
python3 scripts/qc-structural.py /tmp/test.md --verbose

Reporting Issues

When reporting a bug, please include:

The source file format (PDF/DOCX/PPTX)
The extractor used (check pipeline output)
The full error message or QC failure output
Python version (python3 --version)

License

MIT License. See LICENSE for details.

Related Projects

claude-code-orchestration-protocol — A zero-read orchestrator protocol for Claude Code that manages context window usage, delegates work to sub-agents, and runs QC loops until zero issues remain. Designed to work alongside doc2md for complex multi-document workflows where context rot is a concern.

Acknowledgements

Built for use with Claude Code by Anthropic. Uses PyMuPDF, pdfplumber, MinerU, and Pandoc for document processing.

doc2md is designed for researchers, analysts, and anyone who needs high-fidelity document conversion in LLM-powered workflows.

Report a Bug | Request a Feature

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
claude-code		claude-code
hooks		hooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

neoncapy/doc2md

Folders and files

Latest commit

History

Repository files navigation

Summary

doc2md

Why doc2md?

Features

Architecture

Quick Start

Prerequisites

Install

Run

Usage

Basic Conversion

PDF Conversion

DOCX Conversion

PPTX Conversion

XLSX Conversion

Organized Output (--target-dir)

Pipeline Steps

Image Classification

Persona Activation Matrix

Structural QC Engine

Checks Performed

Exit Codes

Claude Code Integration

Hook: Enforce MD-First Reading

Skill: Full Pipeline Orchestration

Configuration

Conversion Registry

Image Index Overrides

Output Metadata

Component Reference

Dependencies

Required

Optional

Python Version

Examples

Convert an Academic Paper

Convert a Slide Deck with Charts

Batch Conversion

Dry Run (Preview Without Moving Files)

Troubleshooting

Common Issues

Exit Codes

Design Decisions

Contributing

Development Setup

Reporting Issues

License

Related Projects

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Organized Output (`--target-dir`)

Packages