OfficeMD

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

Native Rust core - fast, no runtime dependencies
Three output modes: markdown, structured JSON IR, Docling JSON
CLI and SDK for Python, Node/Bun, and Rust
Sheet, slide, and page selection
Document property extraction

Quick Start

No install needed - run directly:

uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx

Install

Python

uv tool install officemd

Or add as a dependency:

uv add officemd

Node / Bun

npm install office-md
# or
bun add office-md

Rust

cargo install officemd_cli

CLI

All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx

The Rust CLI has additional subcommands:

officemd stream report.docx                    # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md   # write to file
officemd inspect report.pdf --output-format json --pretty

Common options

Flag	Description
`--format`	Force document format (docx, xlsx, csv, pptx, pdf)
`--pages`	Select pages/slides/sheets by index (e.g. "1,3-5")
`--sheets`	Select sheets by name or index (e.g. "Sales,1-2")
`--include-document-properties`	Include document metadata in output
`--markdown-style`	Output style: `compact` (default) or `human`

SDK

Python

from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))

JavaScript

import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";

const content = readFileSync("report.docx");

console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));

Supported Formats

Format	Extension	Markdown	JSON IR	Docling
Word	.docx	yes	yes	yes
Excel	.xlsx	yes	yes	yes
CSV	.csv	yes	yes	-
PowerPoint	.pptx	yes	yes	yes
PDF	.pdf	yes	yes	-

Development

cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings

For JS and Python tests, see examples/README.md.

Acknowledgements

PDF extraction vendors pdf-inspector by Firecrawl (MIT).

PDF primitives lopdf by J-F-Liu (MIT).

License

Apache 2.0 - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
examples		examples
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
cliff.toml		cliff.toml
dist-workspace.toml		dist-workspace.toml
prek.toml		prek.toml
release-plz.toml		release-plz.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OfficeMD

Quick Start

Install

Python

Node / Bun

Rust

CLI

Common options

SDK

Python

JavaScript

Supported Formats

Development

Acknowledgements

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OfficeMD

Quick Start

Install

Python

Node / Bun

Rust

CLI

Common options

SDK

Python

JavaScript

Supported Formats

Development

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages