Skip to content

ThomAub/officemd

Repository files navigation

OfficeMD

CI crates.io PyPI npm

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

  • Native Rust core - fast, no runtime dependencies
  • Three output modes: markdown, structured JSON IR, Docling JSON
  • CLI and SDK for Python, Node/Bun, and Rust
  • Sheet, slide, and page selection
  • Document property extraction

Quick Start

No install needed - run directly:

uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx

Install

Python

uv tool install officemd

Or add as a dependency:

uv add officemd

Node / Bun

npm install office-md
# or
bun add office-md

Rust

cargo install officemd_cli

CLI

All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx

The Rust CLI has additional subcommands:

officemd stream report.docx                    # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md   # write to file
officemd inspect report.pdf --output-format json --pretty

Common options

Flag Description
--format Force document format (docx, xlsx, csv, pptx, pdf)
--pages Select pages/slides/sheets by index (e.g. "1,3-5")
--sheets Select sheets by name or index (e.g. "Sales,1-2")
--include-document-properties Include document metadata in output
--markdown-style Output style: compact (default) or human

SDK

Python

from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))

JavaScript

import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";

const content = readFileSync("report.docx");

console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));

Supported Formats

Format Extension Markdown JSON IR Docling
Word .docx yes yes yes
Excel .xlsx yes yes yes
CSV .csv yes yes -
PowerPoint .pptx yes yes yes
PDF .pdf yes yes -

Development

cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings

For JS and Python tests, see examples/README.md.

Acknowledgements

PDF extraction vendors pdf-inspector by Firecrawl (MIT).

PDF primitives lopdf by J-F-Liu (MIT).

License

Apache 2.0 - see LICENSE.

About

Turn any Office style document to markdown

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors