Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.
- Native Rust core - fast, no runtime dependencies
- Three output modes: markdown, structured JSON IR, Docling JSON
- CLI and SDK for Python, Node/Bun, and Rust
- Sheet, slide, and page selection
- Document property extraction
No install needed - run directly:
uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docxuv tool install officemdOr add as a dependency:
uv add officemdnpm install office-md
# or
bun add office-mdcargo install officemd_cliAll three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).
officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docxThe Rust CLI has additional subcommands:
officemd stream report.docx # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md # write to file
officemd inspect report.pdf --output-format json --pretty| Flag | Description |
|---|---|
--format |
Force document format (docx, xlsx, csv, pptx, pdf) |
--pages |
Select pages/slides/sheets by index (e.g. "1,3-5") |
--sheets |
Select sheets by name or index (e.g. "Sales,1-2") |
--include-document-properties |
Include document metadata in output |
--markdown-style |
Output style: compact (default) or human |
from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes
content = Path("report.docx").read_bytes()
print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";
const content = readFileSync("report.docx");
console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));| Format | Extension | Markdown | JSON IR | Docling |
|---|---|---|---|---|
| Word | .docx | yes | yes | yes |
| Excel | .xlsx | yes | yes | yes |
| CSV | .csv | yes | yes | - |
| PowerPoint | .pptx | yes | yes | yes |
| yes | yes | - |
cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warningsFor JS and Python tests, see examples/README.md.
PDF extraction vendors pdf-inspector by Firecrawl (MIT).
PDF primitives lopdf by J-F-Liu (MIT).
Apache 2.0 - see LICENSE.