openstax-converter is a CLI that converts OpenStax book sources (CNXML + collection XML) into an LLM-friendly dataset for RAG:
- documents.jsonl — chunked documents with rich metadata
- md/ — Markdown previews with local media links and LaTeX
- figures.json — aggregated figure index
- unhandled_tags.json — debug report for unhandled CNXML tags that contained text
The goal is pragmatic: preserve textbook structure, keep provenance back to the original sources, and produce chunks that work well for retrieval + prompting.
- Parses an OpenStax
*.collection.xml(TOC) and preserves hierarchy: book → chapter → module → section. - Converts each CNXML module into Markdown suitable for retrieval.
- Emits chunks as JSONL documents:
- front_matter — the first top-level module (often TOC/authors/license). Kept for provenance/navigation, but tagged so retrieval can exclude it.
- overview — module-level Learning Objectives + Summary (only if present).
- section — one JSONL document per section chunk (nested sections split into smaller chunks).
- module — fallback single chunk for non-standard modules without sections.
- Converts MathML → LaTeX.
- Renders core CNXML blocks into Markdown (paragraphs, sections, lists, tables, figures, notes, exercises/solutions, etc.).
- Produces
unhandled_tags.jsonas a safety/debug report for tags that contained text but were not explicitly handled.
Extracts structured metadata useful for RAG:
- breadcrumbs (book/chapter/module/section)
- learning objectives (when present)
- terms (when present)
- figures (id, src, alt, caption)
- links (internal/external; emitted per chunk for future graph building)
Conversion writes into the --out directory:
book.json— book metadata + full TOCdocuments.jsonl— one JSON object per chunk (overview/section/module/front_matter)figures.json— aggregated figure index (all figures across the book)unhandled_tags.json— tags with text that were not rendered explicitlymd/— Markdown previews (one.mdper JSONL document)md/media/— copiedmedia/folder so images resolve locally in previews
Each documents.jsonl record contains (field names may evolve while the project is in active development):
doc_id— stable chunk id (<book_id>:<module_id>#<chunk>)book_id,module_idtitlebreadcrumb— hierarchy path (book → chapter → module → section)doc_type—module(current dataset type)section_id— for section chunks (when available)content_md— Markdown with LaTeX mathlearning_objectives— overview/module only (if present)terms— extracted terms (if present)figures— list of figure objects used in that chunklinks— list of link objects found in that chunksource— provenance (CNXML path, uuid, chunk type, section_id/chunk_id, etc.)
- Internal links (OpenStax cross-refs):
{"kind":"internal","target_id":"fs-id...","anchor":"fs-id...","label":"...","section_id":"..."} - External links:
{"kind":"external","url":"https://...","label":"...","section_id":"..."}
Each figures entry typically contains:
id— CNXML figure idsrc— normalized path (usuallymedia/...)altcaption
figures.json is an aggregated index across the entire book.
From your local clone of this repository:
uv run openstax-converter convert /path/to/osbooks-*/collections/<book>.collection.xml --out ./out/<book>Example (OpenStax Calculus Volume 1):
uv run openstax-converter convert \
/path/to/osbooks-calculus-bundle/collections/calculus-volume-1.collection.xml \
--out ./out/calculus-v1Show CLI help:
uv run openstax-converter --help- This converter is optimized for retrieval + instruction-following, not pixel-perfect textbook rendering.
unhandled_tags.jsonis your first stop if you suspect missing content.- Figures are extracted into structured metadata and also rendered into Markdown previews.
linksare emitted per chunk so you can later build a document graph (internal cross-references + external sources).
openstax_converter/pipeline/— conversion pipeline (TOC parsing, chunking, aggregation, writers)openstax_converter/cnxml/— CNXML renderer (blocks/inline), MathML converter, splittingopenstax_converter/models.py— Pydantic models for JSONL schema
This repo is managed with uv.
Common tasks:
# lint/format
uv run ruff check .
uv run ruff format .- Code in this repository is licensed under Apache-2.0.
- OpenStax content has its own license and attribution requirements. Make sure you comply with the applicable OpenStax license when using or redistributing converted materials.