Skip to content

OpenStax CNXML → Markdown/JSONL converter for RAG & LLM (MathML → LaTeX, media, TOC)

License

Notifications You must be signed in to change notification settings

tsurikow/openstax-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openstax-converter

openstax-converter is a CLI that converts OpenStax book sources (CNXML + collection XML) into an LLM-friendly dataset for RAG:

  • documents.jsonl — chunked documents with rich metadata
  • md/ — Markdown previews with local media links and LaTeX
  • figures.json — aggregated figure index
  • unhandled_tags.json — debug report for unhandled CNXML tags that contained text

The goal is pragmatic: preserve textbook structure, keep provenance back to the original sources, and produce chunks that work well for retrieval + prompting.


Contents


What it does

Structure and chunking

  • Parses an OpenStax *.collection.xml (TOC) and preserves hierarchy: book → chapter → module → section.
  • Converts each CNXML module into Markdown suitable for retrieval.
  • Emits chunks as JSONL documents:
    • front_matter — the first top-level module (often TOC/authors/license). Kept for provenance/navigation, but tagged so retrieval can exclude it.
    • overview — module-level Learning Objectives + Summary (only if present).
    • section — one JSONL document per section chunk (nested sections split into smaller chunks).
    • module — fallback single chunk for non-standard modules without sections.

Rendering

  • Converts MathML → LaTeX.
  • Renders core CNXML blocks into Markdown (paragraphs, sections, lists, tables, figures, notes, exercises/solutions, etc.).
  • Produces unhandled_tags.json as a safety/debug report for tags that contained text but were not explicitly handled.

Extraction

Extracts structured metadata useful for RAG:

  • breadcrumbs (book/chapter/module/section)
  • learning objectives (when present)
  • terms (when present)
  • figures (id, src, alt, caption)
  • links (internal/external; emitted per chunk for future graph building)

Output

Conversion writes into the --out directory:

  • book.json — book metadata + full TOC
  • documents.jsonl — one JSON object per chunk (overview/section/module/front_matter)
  • figures.json — aggregated figure index (all figures across the book)
  • unhandled_tags.json — tags with text that were not rendered explicitly
  • md/ — Markdown previews (one .md per JSONL document)
  • md/media/ — copied media/ folder so images resolve locally in previews

JSONL schema

Each documents.jsonl record contains (field names may evolve while the project is in active development):

  • doc_id — stable chunk id (<book_id>:<module_id>#<chunk>)
  • book_id, module_id
  • title
  • breadcrumb — hierarchy path (book → chapter → module → section)
  • doc_typemodule (current dataset type)
  • section_id — for section chunks (when available)
  • content_md — Markdown with LaTeX math
  • learning_objectives — overview/module only (if present)
  • terms — extracted terms (if present)
  • figures — list of figure objects used in that chunk
  • links — list of link objects found in that chunk
  • source — provenance (CNXML path, uuid, chunk type, section_id/chunk_id, etc.)

Links

  • Internal links (OpenStax cross-refs):
    {"kind":"internal","target_id":"fs-id...","anchor":"fs-id...","label":"...","section_id":"..."}
  • External links:
    {"kind":"external","url":"https://...","label":"...","section_id":"..."}

Figures

Each figures entry typically contains:

  • id — CNXML figure id
  • src — normalized path (usually media/...)
  • alt
  • caption

figures.json is an aggregated index across the entire book.


Usage

From your local clone of this repository:

uv run openstax-converter convert /path/to/osbooks-*/collections/<book>.collection.xml --out ./out/<book>

Example (OpenStax Calculus Volume 1):

uv run openstax-converter convert \
  /path/to/osbooks-calculus-bundle/collections/calculus-volume-1.collection.xml \
  --out ./out/calculus-v1

Show CLI help:

uv run openstax-converter --help

Notes for RAG

  • This converter is optimized for retrieval + instruction-following, not pixel-perfect textbook rendering.
  • unhandled_tags.json is your first stop if you suspect missing content.
  • Figures are extracted into structured metadata and also rendered into Markdown previews.
  • links are emitted per chunk so you can later build a document graph (internal cross-references + external sources).

Development

Project layout (high level)

  • openstax_converter/pipeline/ — conversion pipeline (TOC parsing, chunking, aggregation, writers)
  • openstax_converter/cnxml/ — CNXML renderer (blocks/inline), MathML converter, splitting
  • openstax_converter/models.py — Pydantic models for JSONL schema

Tooling

This repo is managed with uv.

Common tasks:

# lint/format
uv run ruff check .
uv run ruff format .

License and content attribution

  • Code in this repository is licensed under Apache-2.0.
  • OpenStax content has its own license and attribution requirements. Make sure you comply with the applicable OpenStax license when using or redistributing converted materials.

About

OpenStax CNXML → Markdown/JSONL converter for RAG & LLM (MathML → LaTeX, media, TOC)

Topics

Resources

License

Stars

Watchers

Forks

Languages