-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Stable Timestamps — End-to-End Journey & Final Conclusion
Date: 2026-03-06
Branch: feature/stable-timestamps-clean (PR against ggml-org/whisper.cpp master)
The Problem
whisper.cpp produces word-level token timestamps, but they are often inaccurate — tokens land
inside silence gaps rather than tightly around the spoken word. This causes subtitles to appear
or disappear at the wrong time.
Metric used throughout: pct_words_overlap — percentage of decoded word tokens whose
timestamp range overlaps a silence region in the audio. Lower is better. 0% = perfect (every
word lands on actual speech). Measured on a 5-minute synthetic audio file (synth_5min.wav)
with a known ground-truth silence timeline.
Reference benchmark: stable-ts (Python library wrapping faster-whisper) achieves 5.7%
word overlap on the same file. Goal: match or beat it inside whisper.cpp.
Baseline whisper.cpp (no changes): 41.1% word overlap, 161 segments vs 46 expected.
Iteration History
v1 — Post-hoc snapping only
Snap token boundaries to nearest silence edge after decode. Wrong snapping algorithm. Result:
worse than baseline in some cases. Abandoned.
v2 — Constrained decoding + snapping
Added --stable-timestamps flag. During decode, suppress tokens that would land in silence
(constrained beam search). Then snap remaining token boundaries.
Result: 22.6% word overlap, 45 segments. Much better but not close to stable-ts 5.7%.
Root cause: decoder still processes full 20-second windows including silence. Hallucinated words
fill silence gaps and can't all be snapped away.
v3 — VAD mapping remapping
Added --vad integration. VAD strips silence, concatenates speech, decoder sees one stream.
Built a vad_mapping_table (processed_time → original_time) to remap timestamps back.
Result: 10.3% word overlap, 45 segments. Better, but segments still spanned original silence
gaps because the decoder saw a continuous stream and produced cross-boundary segments.
v4 — Per-segment VAD decode (current PR)
Key insight: instead of concatenating all speech and decoding once, decode each VAD segment
independently and add a fixed offset to timestamps. This is exactly how stable-ts/faster-whisper
works internally.
Changes: removed the concatenation + mapping table infrastructure (~200 lines). Added
whisper_full_vad_segments() — loops over VAD segments, calls whisper_full_with_state() per
segment, shifts all token timestamps by offset_cs = seg_start_centiseconds.
Result: 0.89% word overlap, 46 segments. Beats stable-ts (5.7%).
The Core Question for the PR
whisper.cpp already exposes public APIs for VAD and per-segment decode. Could users achieve the
same result without any internal changes? If yes, the PR's value is "convenience", not
"new capability."
Test
Built user_vad_decode.cpp on upstream/master (no PR changes). Uses only public APIs:
// 1. Run VAD to get speech segments (all defaults, no tuning)
auto vad_params = whisper_vad_default_params();
auto vad_segs = whisper_vad_segments_from_samples(vctx, vad_params, pcm.data(), n_samples);
int n_segs = whisper_vad_segments_n_segments(vad_segs);
// 2. Decode each segment independently
for (int i = 0; i < n_segs; i++) {
float t0 = whisper_vad_segments_get_segment_t0(vad_segs, i); // centiseconds
float t1 = whisper_vad_segments_get_segment_t1(vad_segs, i);
int start = (int)(t0 / 100.0f * WHISPER_SAMPLE_RATE);
int end = (int)(t1 / 100.0f * WHISPER_SAMPLE_RATE);
whisper_full_with_state(ctx, state, params, pcm.data() + start, end - start);
// 3. Read timestamps and shift by segment start
int64_t offset_ms = (int64_t)(t0 / 100.0f * 1000.0f);
for (int s = 0; s < whisper_full_n_segments_from_state(state); s++) {
int64_t seg_t0 = whisper_full_get_segment_t0_from_state(state, s) * 10 + offset_ms;
int64_t seg_t1 = whisper_full_get_segment_t1_from_state(state, s) * 10 + offset_ms;
// collect tokens similarly...
}
}That's the entire pattern. ~50 lines of application code, no whisper.cpp changes.
Performance: The Chunking Problem
Problem: whisper's encoder always processes a fixed 30-second mel window regardless of input
length. A 2-second VAD segment still costs one full encoder pass. With 46 segments on a 5-minute
file, that's 46 encoder runs vs ~10 for full-audio decode — roughly 4-5x slower.
Solution: greedy bin-packing of VAD segments into ~25s chunks.
Group adjacent VAD segments until the next one would push the chunk past ~25 seconds. Decode
each chunk as a single whisper_full_with_state() call. Each sub-segment within the chunk still
gets its own offset (chunk_start + position_within_chunk). Encoder runs drop from 46 → ~10,
matching full-audio speed, while still avoiding long silence-only windows that cause
hallucinations.
// Pack VAD segments into ~25s chunks
std::vector<std::vector<int>> chunks;
std::vector<int> cur;
float cur_dur = 0;
for (int i = 0; i < n_segs; i++) {
float dur = (t1[i] - t0[i]) / 100.0f; // seconds
if (!cur.empty() && cur_dur + dur > 25.0f) {
chunks.push_back(cur); cur.clear(); cur_dur = 0;
}
cur.push_back(i); cur_dur += dur;
}
if (!cur.empty()) chunks.push_back(cur);
// For each chunk: slice PCM from chunk_t0 to chunk_t1, decode once,
// offset each result segment by chunk_t0.This is the same strategy faster-whisper uses internally ("chunking").
This is still pure user-space — no whisper.cpp internals needed. However, it adds another
~30 lines of non-obvious logic (bin-packing, PCM slicing per chunk, per-sub-segment offsets).
This reinforces the PR's value: --vad handles all of this automatically with correct behavior.
Final Results
| Metric | baseline | v2 | v3 | v4 (PR) | user_vad (public API) | stable-ts |
|---|---|---|---|---|---|---|
| n_segments | 161 | 45 | 45 | 46 | 46 | 46 |
| n_words_overlap_any | 745 | 144 | 52 | 5 | 6 | 22 |
| pct_words_overlap % | 41.1 | 22.6 | 10.3 | 0.89 | 1.27 | 5.7 |
| pass_segments_threshold | False | False | False | True | True | False |
Conclusion
The capability is user-replicable with ~50 lines of application code.
Per-segment VAD decoding via public APIs achieves 1.27% word overlap — well below stable-ts
(5.7%) and essentially equivalent to the PR's v4 (0.89%).
The 0.38 percentage point gap between user_vad and v4 comes from the snapping step:
whisper_stable_snap_segments() nudges token boundaries away from silence edges after decode.
This is only available via the PR's new whisper-stable module.
What the PR uniquely adds:
--stable-timestamps --vadCLI flags — zero user code requiredwhisper_stable_snap_segments()— silence-edge boundary snapping (closes 1.27% → 0.89%)
What end-users can do today without the PR (simple solution):
- Call
whisper_vad_segments_from_samples()to get speech segments - Loop: call
whisper_full_with_state()per segment withparams.vad = false - Add
offset_ms = segment_start_seconds * 1000to every token timestamp - Result beats stable-ts with no internal whisper.cpp changes needed
PR justification: Convenience + snapping. The loop pattern is user-space. The snap module is
the differentiated value-add and closes the remaining quality gap to best-in-class (0.89%).