Skip to content

Commit 33089cc

Browse files
committed
Update docs for media
1 parent a3d2b4b commit 33089cc

File tree

3 files changed

+192
-0
lines changed

3 files changed

+192
-0
lines changed
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Processing audio and video
2+
3+
Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured [`DoclingDocument`](../concepts/docling_document.md) — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes.
4+
5+
Under the hood, Docling uses [Whisper Turbo](https://github.com/openai/whisper) for transcription. On Apple Silicon it automatically selects `mlx-whisper` for optimized local inference; on all other hardware it falls back to native Whisper. You don't configure this — it just picks the right backend.
6+
7+
## Supported formats
8+
9+
| Type | Formats |
10+
|------|---------|
11+
| Audio | WAV, MP3, M4A, AAC, OGG, FLAC |
12+
| Video | MP4, AVI, MOV |
13+
14+
For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually.
15+
16+
!!! note "ffmpeg required"
17+
Some audio formats (M4A, AAC, OGG, FLAC) and all video formats require `ffmpeg` to be installed and available on your `PATH`. Install it with your system package manager — e.g. `brew install ffmpeg` on macOS or `apt-get install ffmpeg` on Debian-based Linux.
18+
19+
## Installation
20+
21+
The ASR pipeline is an optional extra. Install it alongside the base package:
22+
23+
```bash
24+
pip install "docling[asr]"
25+
```
26+
27+
Or with `uv`:
28+
29+
```bash
30+
uv add "docling[asr]"
31+
```
32+
33+
## Basic usage
34+
35+
```python
36+
from pathlib import Path
37+
38+
from docling.datamodel import asr_model_specs
39+
from docling.datamodel.base_models import InputFormat
40+
from docling.datamodel.pipeline_options import AsrPipelineOptions
41+
from docling.document_converter import AudioFormatOption, DocumentConverter
42+
from docling.pipeline.asr_pipeline import AsrPipeline
43+
44+
pipeline_options = AsrPipelineOptions()
45+
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
46+
47+
converter = DocumentConverter(
48+
format_options={
49+
InputFormat.AUDIO: AudioFormatOption(
50+
pipeline_cls=AsrPipeline,
51+
pipeline_options=pipeline_options,
52+
)
53+
}
54+
)
55+
56+
result = converter.convert(Path("recording.mp3"))
57+
doc = result.document
58+
59+
# Export to Markdown
60+
print(doc.export_to_markdown())
61+
```
62+
63+
The same code works for video — pass an `.mp4`, `.mov`, or `.avi` path and Docling handles the rest.
64+
65+
### Exporting to different formats
66+
67+
`result.document` is a `DoclingDocument`. You can export it to any supported format:
68+
69+
```python
70+
doc.export_to_markdown() # Markdown
71+
doc.export_to_dict() # JSON-serializable dict
72+
doc.export_to_html() # HTML
73+
doc.export_to_doctags() # DocTags
74+
```
75+
76+
See [Serialization](../concepts/serialization.md) for more on export options.
77+
78+
## Understanding the output
79+
80+
The ASR pipeline produces **paragraph-level** Markdown with timestamps per segment:
81+
82+
```
83+
[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
84+
85+
[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
86+
```
87+
88+
This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage.
89+
90+
## A practical use case: searchable meeting archives
91+
92+
A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files.
93+
94+
Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive.
95+
96+
### Standalone transcription script
97+
98+
For a full working example, see the [example-docling-media](https://github.com/TejasQ/example-docling-media) repository, which processes a directory of audio/video files and writes each transcript to a Markdown file.
99+
100+
The core of that project is ~30 lines:
101+
102+
```python
103+
from pathlib import Path
104+
105+
from docling.datamodel import asr_model_specs
106+
from docling.datamodel.base_models import InputFormat
107+
from docling.datamodel.pipeline_options import AsrPipelineOptions
108+
from docling.document_converter import AudioFormatOption, DocumentConverter
109+
from docling.pipeline.asr_pipeline import AsrPipeline
110+
111+
112+
def main():
113+
audio_path = Path("videoplayback.mp3")
114+
115+
pipeline_options = AsrPipelineOptions()
116+
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
117+
118+
converter = DocumentConverter(
119+
format_options={
120+
InputFormat.AUDIO: AudioFormatOption(
121+
pipeline_cls=AsrPipeline,
122+
pipeline_options=pipeline_options,
123+
)
124+
}
125+
)
126+
127+
result = converter.convert(audio_path)
128+
md = result.document.export_to_markdown()
129+
Path("transcript.md").write_text(md)
130+
print(md)
131+
132+
133+
if __name__ == "__main__":
134+
main()
135+
```
136+
137+
### Building a RAG pipeline with LangChain
138+
139+
Docling integrates with LangChain via `DoclingLoader`, which wraps `DocumentConverter` and handles chunking automatically. To build a retrieval pipeline over your audio archive:
140+
141+
```python
142+
from langchain_docling import DoclingLoader
143+
from langchain_openai import OpenAIEmbeddings
144+
from langchain_community.vectorstores import FAISS
145+
146+
# Load and chunk all audio files in a directory
147+
loader = DoclingLoader("recordings/")
148+
docs = loader.load()
149+
150+
# Embed and index
151+
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
152+
retriever = vectorstore.as_retriever()
153+
154+
# Query in natural language
155+
results = retriever.invoke("What did we decide about the auth service in Q3?")
156+
```
157+
158+
See the [LangChain integration guide](../integrations/langchain.md) for more details on `DoclingLoader` options.
159+
160+
## Customizing the ASR model
161+
162+
`asr_model_specs.WHISPER_TURBO` is the default and recommended starting point — it balances speed and accuracy for most use cases. To use a different model size, pass an alternative spec from `docling.datamodel.asr_model_specs`:
163+
164+
```python
165+
from docling.datamodel import asr_model_specs
166+
167+
pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3
168+
```
169+
170+
Available specs depend on your installed version. Check `dir(asr_model_specs)` for the full list.
171+
172+
## Limitations
173+
174+
| Limitation | Workaround |
175+
|-----------|------------|
176+
| No SRT/WebVTT subtitle output | Use `openai-whisper` CLI: `whisper audio.mp3 --output_format srt` |
177+
| No speaker diarization | Use [`pyannote-audio`](https://github.com/pyannote/pyannote-audio) as a pre- or post-processing step |
178+
| No word-level timestamps | Not available in current export formats |
179+
180+
For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows.
181+
182+
## See also
183+
184+
- [Minimal ASR pipeline example](../examples/minimal_asr_pipeline.py)
185+
- [Supported formats](supported_formats.md)
186+
- [Chunking](../concepts/chunking.md)
187+
- [LangChain integration](../integrations/langchain.md)
188+
- [LlamaIndex integration](../integrations/llamaindex.md)
189+
- [Full code repo example](https://github.com/TejasQ/example-docling-media)

docs/usage/supported_formats.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ Below you can find a listing of all supported input and output formats.
1616
| HTML, XHTML | |
1717
| CSV | |
1818
| PNG, JPEG, TIFF, BMP, WEBP | Image formats |
19+
| WAV, MP3, M4A, AAC, OGG, FLAC | Audio formats (requires `asr` extra — see [Processing audio and video](processing_audio_media.md)) |
20+
| MP4, AVI, MOV | Video formats — audio track is extracted and transcribed (requires `asr` extra and `ffmpeg`) |
1921
| WebVTT | Web Video Text Tracks format for displaying timed text |
2022

2123
Schema-specific support:

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ nav:
6161
- Usage:
6262
- Advanced options: usage/advanced_options.md
6363
- Supported formats: usage/supported_formats.md
64+
- Audio & video: usage/processing_audio_media.md
6465
- Enrichment features: usage/enrichments.md
6566
- Vision models: usage/vision_models.md
6667
- Model catalog: usage/model_catalog.md

0 commit comments

Comments
 (0)