Docling Provenance Extension #2897

ceberam · 2026-01-20T13:39:52Z

ceberam
Jan 20, 2026
Collaborator

Why we need it

Docling began as a PDF parser and evolved into a universal document understanding engine that can read .docx, .pptx, HTML, etc.
Content in the DoclingDocument is represented by DocItem objects, each of which carries provenance information:

class DocItem(BaseModel):
    prov: list[ProvenanceItem] = []

ProvenanceItem is tailored to layout sources (PDF, HTML, Word, PowerPoint), expressing a location within a document:

class ProvenanceItem(BaseModel):
    page_no: Annotated[int, Field(description="Page number")]
    bbox: Annotated[BoundingBox, Field(description="Bounding box")]
    charspan: Annotated[tuple[int, int], Field(description="Character span (0-indexed)")]

When Docling started to support non‑layout media like audio tracks through ASR, WebVTT subtitles, and others that may come soon, these fields simply do not make sense: a subtitle has no pages or boxes, but it does have start/end times.

The challenge

We must be able to attach provenance that is not layout‑based without breaking existing data.
Existing codebases (both in Docling and downstream projects) rely on the prov field.
A naïve approach like replacing prov with a union type, would force every consumer to add explicit type checks.

The proposed solution

Keep the legacy prov field unchanged, still used for layout sources.

Add a new, generic provenance field:

class DocItem(BaseModel):
    prov: list[ProvenanceItem] = []
    source: list[ProvenanceType] = []

Define ProvenanceType as a discriminated union that can represent any media type. For now, we would only address those types that can not already be addressed with ProvenanceItem:

class ProvenanceBase(BaseModel):
    kind: str

class ProvenanceTrack(ProvenanceBase):
    kind: Literal["track"] = "track"
    start_time: float
    end_time: float

class ProvenanceVideo(ProvenanceBase):
    kind: Literal["video"] = "video"
    start_time: float
    end_time: float
    # …additional video‑specific fields…

ProvenanceType = Annotated[Union[ProvenanceTrack, ProvenanceVideo], Field(discriminator='kind')]

Populate source with the appropriate subclass for each new media type.
- For WebVTT subtitles → ProvenanceTrack
- For future video files → ProvenanceVideo
Optional metadata for WebVTT: any cue‑level annotations (speaker, styling) can be stored in the generic meta field of DocItem.
Serialization: keep the current behavior of omitting empty arrays like in comments.

Backward‑compatibility strategy

When this feature is released: both prov and source co-exist. New code should use source for non-layout formats. Legacy code continues to work unchanged.
Eventually (no plan at the moment) with a major release: prov content is moved to source, type ProvenanceType gets augmented with ProvenanceItem and prov is marked deprecated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docling Provenance Extension #2897

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Docling Provenance Extension #2897

Uh oh!

ceberam Jan 20, 2026 Collaborator

Why we need it

The challenge

The proposed solution

Backward‑compatibility strategy

Replies: 0 comments

ceberam
Jan 20, 2026
Collaborator