|
| 1 | +# Processing audio and video |
| 2 | + |
| 3 | +Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured [`DoclingDocument`](../concepts/docling_document.md) — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes. |
| 4 | + |
| 5 | +Under the hood, Docling uses [Whisper Turbo](https://github.com/openai/whisper) for transcription. On Apple Silicon it automatically selects `mlx-whisper` for optimized local inference; on all other hardware it falls back to native Whisper. You don't configure this — it just picks the right backend. |
| 6 | + |
| 7 | +## Supported formats |
| 8 | + |
| 9 | +| Type | Formats | |
| 10 | +|------|---------| |
| 11 | +| Audio | WAV, MP3, M4A, AAC, OGG, FLAC | |
| 12 | +| Video | MP4, AVI, MOV | |
| 13 | + |
| 14 | +For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually. |
| 15 | + |
| 16 | +!!! note "ffmpeg required" |
| 17 | + Some audio formats (M4A, AAC, OGG, FLAC) and all video formats require `ffmpeg` to be installed and available on your `PATH`. Install it with your system package manager — e.g. `brew install ffmpeg` on macOS or `apt-get install ffmpeg` on Debian-based Linux. |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +The ASR pipeline is an optional extra. Install it alongside the base package: |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install "docling[asr]" |
| 25 | +``` |
| 26 | + |
| 27 | +Or with `uv`: |
| 28 | + |
| 29 | +```bash |
| 30 | +uv add "docling[asr]" |
| 31 | +``` |
| 32 | + |
| 33 | +## Basic usage |
| 34 | + |
| 35 | +```python |
| 36 | +from pathlib import Path |
| 37 | + |
| 38 | +from docling.datamodel import asr_model_specs |
| 39 | +from docling.datamodel.base_models import InputFormat |
| 40 | +from docling.datamodel.pipeline_options import AsrPipelineOptions |
| 41 | +from docling.document_converter import AudioFormatOption, DocumentConverter |
| 42 | +from docling.pipeline.asr_pipeline import AsrPipeline |
| 43 | + |
| 44 | +pipeline_options = AsrPipelineOptions() |
| 45 | +pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO |
| 46 | + |
| 47 | +converter = DocumentConverter( |
| 48 | + format_options={ |
| 49 | + InputFormat.AUDIO: AudioFormatOption( |
| 50 | + pipeline_cls=AsrPipeline, |
| 51 | + pipeline_options=pipeline_options, |
| 52 | + ) |
| 53 | + } |
| 54 | +) |
| 55 | + |
| 56 | +result = converter.convert(Path("recording.mp3")) |
| 57 | +doc = result.document |
| 58 | + |
| 59 | +# Export to Markdown |
| 60 | +print(doc.export_to_markdown()) |
| 61 | +``` |
| 62 | + |
| 63 | +The same code works for video — pass an `.mp4`, `.mov`, or `.avi` path and Docling handles the rest. |
| 64 | + |
| 65 | +### Exporting to different formats |
| 66 | + |
| 67 | +`result.document` is a `DoclingDocument`. You can export it to any supported format: |
| 68 | + |
| 69 | +```python |
| 70 | +doc.export_to_markdown() # Markdown |
| 71 | +doc.export_to_dict() # JSON-serializable dict |
| 72 | +doc.export_to_html() # HTML |
| 73 | +doc.export_to_doctags() # DocTags |
| 74 | +``` |
| 75 | + |
| 76 | +See [Serialization](../concepts/serialization.md) for more on export options. |
| 77 | + |
| 78 | +## Understanding the output |
| 79 | + |
| 80 | +The ASR pipeline produces **paragraph-level** Markdown with timestamps per segment: |
| 81 | + |
| 82 | +``` |
| 83 | +[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde |
| 84 | +
|
| 85 | +[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain. |
| 86 | +``` |
| 87 | + |
| 88 | +This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage. |
| 89 | + |
| 90 | +## A practical use case: searchable meeting archives |
| 91 | + |
| 92 | +A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files. |
| 93 | + |
| 94 | +Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive. |
| 95 | + |
| 96 | +### Standalone transcription script |
| 97 | + |
| 98 | +For a full working example, see the [example-docling-media](https://github.com/TejasQ/example-docling-media) repository, which processes a directory of audio/video files and writes each transcript to a Markdown file. |
| 99 | + |
| 100 | +The core of that project is ~30 lines: |
| 101 | + |
| 102 | +```python |
| 103 | +from pathlib import Path |
| 104 | + |
| 105 | +from docling.datamodel import asr_model_specs |
| 106 | +from docling.datamodel.base_models import InputFormat |
| 107 | +from docling.datamodel.pipeline_options import AsrPipelineOptions |
| 108 | +from docling.document_converter import AudioFormatOption, DocumentConverter |
| 109 | +from docling.pipeline.asr_pipeline import AsrPipeline |
| 110 | + |
| 111 | + |
| 112 | +def main(): |
| 113 | + audio_path = Path("videoplayback.mp3") |
| 114 | + |
| 115 | + pipeline_options = AsrPipelineOptions() |
| 116 | + pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO |
| 117 | + |
| 118 | + converter = DocumentConverter( |
| 119 | + format_options={ |
| 120 | + InputFormat.AUDIO: AudioFormatOption( |
| 121 | + pipeline_cls=AsrPipeline, |
| 122 | + pipeline_options=pipeline_options, |
| 123 | + ) |
| 124 | + } |
| 125 | + ) |
| 126 | + |
| 127 | + result = converter.convert(audio_path) |
| 128 | + md = result.document.export_to_markdown() |
| 129 | + Path("transcript.md").write_text(md) |
| 130 | + print(md) |
| 131 | + |
| 132 | + |
| 133 | +if __name__ == "__main__": |
| 134 | + main() |
| 135 | +``` |
| 136 | + |
| 137 | +### Building a RAG pipeline with LangChain |
| 138 | + |
| 139 | +Docling integrates with LangChain via `DoclingLoader`, which wraps `DocumentConverter` and handles chunking automatically. To build a retrieval pipeline over your audio archive: |
| 140 | + |
| 141 | +```python |
| 142 | +from langchain_docling import DoclingLoader |
| 143 | +from langchain_openai import OpenAIEmbeddings |
| 144 | +from langchain_community.vectorstores import FAISS |
| 145 | + |
| 146 | +# Load and chunk all audio files in a directory |
| 147 | +loader = DoclingLoader("recordings/") |
| 148 | +docs = loader.load() |
| 149 | + |
| 150 | +# Embed and index |
| 151 | +vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings()) |
| 152 | +retriever = vectorstore.as_retriever() |
| 153 | + |
| 154 | +# Query in natural language |
| 155 | +results = retriever.invoke("What did we decide about the auth service in Q3?") |
| 156 | +``` |
| 157 | + |
| 158 | +See the [LangChain integration guide](../integrations/langchain.md) for more details on `DoclingLoader` options. |
| 159 | + |
| 160 | +## Customizing the ASR model |
| 161 | + |
| 162 | +`asr_model_specs.WHISPER_TURBO` is the default and recommended starting point — it balances speed and accuracy for most use cases. To use a different model size, pass an alternative spec from `docling.datamodel.asr_model_specs`: |
| 163 | + |
| 164 | +```python |
| 165 | +from docling.datamodel import asr_model_specs |
| 166 | + |
| 167 | +pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3 |
| 168 | +``` |
| 169 | + |
| 170 | +Available specs depend on your installed version. Check `dir(asr_model_specs)` for the full list. |
| 171 | + |
| 172 | +## Limitations |
| 173 | + |
| 174 | +| Limitation | Workaround | |
| 175 | +|-----------|------------| |
| 176 | +| No SRT/WebVTT subtitle output | Use `openai-whisper` CLI: `whisper audio.mp3 --output_format srt` | |
| 177 | +| No speaker diarization | Use [`pyannote-audio`](https://github.com/pyannote/pyannote-audio) as a pre- or post-processing step | |
| 178 | +| No word-level timestamps | Not available in current export formats | |
| 179 | + |
| 180 | +For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows. |
| 181 | + |
| 182 | +## See also |
| 183 | + |
| 184 | +- [Minimal ASR pipeline example](../examples/minimal_asr_pipeline.py) |
| 185 | +- [Supported formats](supported_formats.md) |
| 186 | +- [Chunking](../concepts/chunking.md) |
| 187 | +- [LangChain integration](../integrations/langchain.md) |
| 188 | +- [LlamaIndex integration](../integrations/llamaindex.md) |
| 189 | +- [Full code repo example](https://github.com/TejasQ/example-docling-media) |
0 commit comments