openstax-converter

openstax-converter is a CLI that converts OpenStax book sources (CNXML + collection XML) into an LLM-friendly dataset for RAG:

documents.jsonl — chunked documents with rich metadata
md/ — Markdown previews with local media links and LaTeX
figures.json — aggregated figure index
unhandled_tags.json — debug report for unhandled CNXML tags that contained text

The goal is pragmatic: preserve textbook structure, keep provenance back to the original sources, and produce chunks that work well for retrieval + prompting.

What it does

Structure and chunking

Parses an OpenStax *.collection.xml (TOC) and preserves hierarchy: book → chapter → module → section.
Converts each CNXML module into Markdown suitable for retrieval.
Emits chunks as JSONL documents:
- front_matter — the first top-level module (often TOC/authors/license). Kept for provenance/navigation, but tagged so retrieval can exclude it.
- overview — module-level Learning Objectives + Summary (only if present).
- section — one JSONL document per section chunk (nested sections split into smaller chunks).
- module — fallback single chunk for non-standard modules without sections.

Rendering

Converts MathML → LaTeX.
Renders core CNXML blocks into Markdown (paragraphs, sections, lists, tables, figures, notes, exercises/solutions, etc.).
Produces unhandled_tags.json as a safety/debug report for tags that contained text but were not explicitly handled.

Extraction

Extracts structured metadata useful for RAG:

breadcrumbs (book/chapter/module/section)
learning objectives (when present)
terms (when present)
figures (id, src, alt, caption)
links (internal/external; emitted per chunk for future graph building)

Output

Conversion writes into the --out directory:

book.json — book metadata + full TOC
documents.jsonl — one JSON object per chunk (overview/section/module/front_matter)
figures.json — aggregated figure index (all figures across the book)
unhandled_tags.json — tags with text that were not rendered explicitly
md/ — Markdown previews (one .md per JSONL document)
md/media/ — copied media/ folder so images resolve locally in previews

JSONL schema

Each documents.jsonl record contains (field names may evolve while the project is in active development):

doc_id — stable chunk id (<book_id>:<module_id>#<chunk>)
book_id, module_id
title
breadcrumb — hierarchy path (book → chapter → module → section)
doc_type — module (current dataset type)
section_id — for section chunks (when available)
content_md — Markdown with LaTeX math
learning_objectives — overview/module only (if present)
terms — extracted terms (if present)
figures — list of figure objects used in that chunk
links — list of link objects found in that chunk
source — provenance (CNXML path, uuid, chunk type, section_id/chunk_id, etc.)

Links

Internal links (OpenStax cross-refs):

{"kind":"internal","target_id":"fs-id...","anchor":"fs-id...","label":"...","section_id":"..."}

External links:

{"kind":"external","url":"https://...","label":"...","section_id":"..."}

Figures

Each figures entry typically contains:

id — CNXML figure id
src — normalized path (usually media/...)
alt
caption

figures.json is an aggregated index across the entire book.

Usage

From your local clone of this repository:

uv run openstax-converter convert /path/to/osbooks-*/collections/<book>.collection.xml --out ./out/<book>

Example (OpenStax Calculus Volume 1):

uv run openstax-converter convert \
  /path/to/osbooks-calculus-bundle/collections/calculus-volume-1.collection.xml \
  --out ./out/calculus-v1

Show CLI help:

uv run openstax-converter --help

Notes for RAG

This converter is optimized for retrieval + instruction-following, not pixel-perfect textbook rendering.
unhandled_tags.json is your first stop if you suspect missing content.
Figures are extracted into structured metadata and also rendered into Markdown previews.
links are emitted per chunk so you can later build a document graph (internal cross-references + external sources).

Development

Project layout (high level)

openstax_converter/pipeline/ — conversion pipeline (TOC parsing, chunking, aggregation, writers)
openstax_converter/cnxml/ — CNXML renderer (blocks/inline), MathML converter, splitting
openstax_converter/models.py — Pydantic models for JSONL schema

Tooling

This repo is managed with uv.

Common tasks:

# lint/format
uv run ruff check .
uv run ruff format .

License and content attribution

Code in this repository is licensed under Apache-2.0.
OpenStax content has its own license and attribution requirements. Make sure you comply with the applicable OpenStax license when using or redistributing converted materials.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
openstax_converter		openstax_converter
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openstax-converter

Contents

What it does

Structure and chunking

Rendering

Extraction

Output

JSONL schema

Links

Figures

Usage

Notes for RAG

Development

Project layout (high level)

Tooling

License and content attribution

About

Uh oh!

Languages

License

tsurikow/openstax-converter

Folders and files

Latest commit

History

Repository files navigation

openstax-converter

Contents

What it does

Structure and chunking

Rendering

Extraction

Output

JSONL schema

Links

Figures

Usage

Notes for RAG

Development

Project layout (high level)

Tooling

License and content attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages