You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docling began as a PDF parser and evolved into a universal document understanding engine that can read .docx, .pptx, HTML, etc.
Content in the DoclingDocument is represented by DocItem objects, each of which carries provenance information:
When Docling started to support non‑layout media like audio tracks through ASR, WebVTT subtitles, and others that may come soon, these fields simply do not make sense: a subtitle has no pages or boxes, but it does have start/end times.
The challenge
We must be able to attach provenance that is not layout‑based without breaking existing data.
Existing codebases (both in Docling and downstream projects) rely on the prov field.
A naïve approach like replacing prov with a union type, would force every consumer to add explicit type checks.
The proposed solution
Keep the legacy prov field unchanged, still used for layout sources.
Define ProvenanceType as a discriminated union that can represent any media type. For now, we would only address those types that can not already be addressed with ProvenanceItem:
Populate source with the appropriate subclass for each new media type.
For WebVTT subtitles → ProvenanceTrack
For future video files → ProvenanceVideo
Optional metadata for WebVTT: any cue‑level annotations (speaker, styling) can be stored in the generic meta field of DocItem.
Serialization: keep the current behavior of omitting empty arrays like in comments.
Backward‑compatibility strategy
When this feature is released: both prov and source co-exist. New code should use source for non-layout formats. Legacy code continues to work unchanged.
Eventually (no plan at the moment) with a major release: prov content is moved to source, type ProvenanceType gets augmented with ProvenanceItem and prov is marked deprecated
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Why we need it
Docling began as a PDF parser and evolved into a universal document understanding engine that can read
.docx,.pptx, HTML, etc.Content in the
DoclingDocumentis represented byDocItemobjects, each of which carries provenance information:ProvenanceItemis tailored to layout sources (PDF, HTML, Word, PowerPoint), expressing a location within a document:When Docling started to support non‑layout media like audio tracks through ASR, WebVTT subtitles, and others that may come soon, these fields simply do not make sense: a subtitle has no pages or boxes, but it does have start/end times.
The challenge
provfield.provwith a union type, would force every consumer to add explicit type checks.The proposed solution
Keep the legacy
provfield unchanged, still used for layout sources.Add a new, generic provenance field:
Define
ProvenanceTypeas a discriminated union that can represent any media type. For now, we would only address those types that can not already be addressed withProvenanceItem:Populate
sourcewith the appropriate subclass for each new media type.ProvenanceTrackProvenanceVideoOptional metadata for WebVTT: any cue‑level annotations (speaker, styling) can be stored in the generic
metafield ofDocItem.Serialization: keep the current behavior of omitting empty arrays like in
comments.Backward‑compatibility strategy
provandsourceco-exist. New code should usesourcefor non-layout formats. Legacy code continues to work unchanged.provcontent is moved tosource, typeProvenanceTypegets augmented withProvenanceItemandprovis marked deprecatedBeta Was this translation helpful? Give feedback.
All reactions