Skip to content

Stable timestamps #13

@thewh1teagle

Description

@thewh1teagle

Stable Timestamps — End-to-End Journey & Final Conclusion

Date: 2026-03-06
Branch: feature/stable-timestamps-clean (PR against ggml-org/whisper.cpp master)


The Problem

whisper.cpp produces word-level token timestamps, but they are often inaccurate — tokens land
inside silence gaps rather than tightly around the spoken word. This causes subtitles to appear
or disappear at the wrong time.

Metric used throughout: pct_words_overlap — percentage of decoded word tokens whose
timestamp range overlaps a silence region in the audio. Lower is better. 0% = perfect (every
word lands on actual speech). Measured on a 5-minute synthetic audio file (synth_5min.wav)
with a known ground-truth silence timeline.

Reference benchmark: stable-ts (Python library wrapping faster-whisper) achieves 5.7%
word overlap on the same file. Goal: match or beat it inside whisper.cpp.

Baseline whisper.cpp (no changes): 41.1% word overlap, 161 segments vs 46 expected.


Iteration History

v1 — Post-hoc snapping only

Snap token boundaries to nearest silence edge after decode. Wrong snapping algorithm. Result:
worse than baseline in some cases. Abandoned.

v2 — Constrained decoding + snapping

Added --stable-timestamps flag. During decode, suppress tokens that would land in silence
(constrained beam search). Then snap remaining token boundaries.
Result: 22.6% word overlap, 45 segments. Much better but not close to stable-ts 5.7%.
Root cause: decoder still processes full 20-second windows including silence. Hallucinated words
fill silence gaps and can't all be snapped away.

v3 — VAD mapping remapping

Added --vad integration. VAD strips silence, concatenates speech, decoder sees one stream.
Built a vad_mapping_table (processed_time → original_time) to remap timestamps back.
Result: 10.3% word overlap, 45 segments. Better, but segments still spanned original silence
gaps because the decoder saw a continuous stream and produced cross-boundary segments.

v4 — Per-segment VAD decode (current PR)

Key insight: instead of concatenating all speech and decoding once, decode each VAD segment
independently and add a fixed offset to timestamps. This is exactly how stable-ts/faster-whisper
works internally.

Changes: removed the concatenation + mapping table infrastructure (~200 lines). Added
whisper_full_vad_segments() — loops over VAD segments, calls whisper_full_with_state() per
segment, shifts all token timestamps by offset_cs = seg_start_centiseconds.

Result: 0.89% word overlap, 46 segments. Beats stable-ts (5.7%).


The Core Question for the PR

whisper.cpp already exposes public APIs for VAD and per-segment decode. Could users achieve the
same result without any internal changes? If yes, the PR's value is "convenience", not
"new capability."

Test

Built user_vad_decode.cpp on upstream/master (no PR changes). Uses only public APIs:

// 1. Run VAD to get speech segments (all defaults, no tuning)
auto vad_params = whisper_vad_default_params();
auto vad_segs = whisper_vad_segments_from_samples(vctx, vad_params, pcm.data(), n_samples);
int n_segs = whisper_vad_segments_n_segments(vad_segs);

// 2. Decode each segment independently
for (int i = 0; i < n_segs; i++) {
    float t0 = whisper_vad_segments_get_segment_t0(vad_segs, i); // centiseconds
    float t1 = whisper_vad_segments_get_segment_t1(vad_segs, i);

    int start = (int)(t0 / 100.0f * WHISPER_SAMPLE_RATE);
    int end   = (int)(t1 / 100.0f * WHISPER_SAMPLE_RATE);

    whisper_full_with_state(ctx, state, params, pcm.data() + start, end - start);

    // 3. Read timestamps and shift by segment start
    int64_t offset_ms = (int64_t)(t0 / 100.0f * 1000.0f);
    for (int s = 0; s < whisper_full_n_segments_from_state(state); s++) {
        int64_t seg_t0 = whisper_full_get_segment_t0_from_state(state, s) * 10 + offset_ms;
        int64_t seg_t1 = whisper_full_get_segment_t1_from_state(state, s) * 10 + offset_ms;
        // collect tokens similarly...
    }
}

That's the entire pattern. ~50 lines of application code, no whisper.cpp changes.


Performance: The Chunking Problem

Problem: whisper's encoder always processes a fixed 30-second mel window regardless of input
length. A 2-second VAD segment still costs one full encoder pass. With 46 segments on a 5-minute
file, that's 46 encoder runs vs ~10 for full-audio decode — roughly 4-5x slower.

Solution: greedy bin-packing of VAD segments into ~25s chunks.

Group adjacent VAD segments until the next one would push the chunk past ~25 seconds. Decode
each chunk as a single whisper_full_with_state() call. Each sub-segment within the chunk still
gets its own offset (chunk_start + position_within_chunk). Encoder runs drop from 46 → ~10,
matching full-audio speed, while still avoiding long silence-only windows that cause
hallucinations.

// Pack VAD segments into ~25s chunks
std::vector<std::vector<int>> chunks;
std::vector<int> cur;
float cur_dur = 0;
for (int i = 0; i < n_segs; i++) {
    float dur = (t1[i] - t0[i]) / 100.0f; // seconds
    if (!cur.empty() && cur_dur + dur > 25.0f) {
        chunks.push_back(cur); cur.clear(); cur_dur = 0;
    }
    cur.push_back(i); cur_dur += dur;
}
if (!cur.empty()) chunks.push_back(cur);

// For each chunk: slice PCM from chunk_t0 to chunk_t1, decode once,
// offset each result segment by chunk_t0.

This is the same strategy faster-whisper uses internally ("chunking").

This is still pure user-space — no whisper.cpp internals needed. However, it adds another
~30 lines of non-obvious logic (bin-packing, PCM slicing per chunk, per-sub-segment offsets).
This reinforces the PR's value: --vad handles all of this automatically with correct behavior.


Final Results

Metric baseline v2 v3 v4 (PR) user_vad (public API) stable-ts
n_segments 161 45 45 46 46 46
n_words_overlap_any 745 144 52 5 6 22
pct_words_overlap % 41.1 22.6 10.3 0.89 1.27 5.7
pass_segments_threshold False False False True True False

Conclusion

The capability is user-replicable with ~50 lines of application code.

Per-segment VAD decoding via public APIs achieves 1.27% word overlap — well below stable-ts
(5.7%) and essentially equivalent to the PR's v4 (0.89%).

The 0.38 percentage point gap between user_vad and v4 comes from the snapping step:
whisper_stable_snap_segments() nudges token boundaries away from silence edges after decode.
This is only available via the PR's new whisper-stable module.

What the PR uniquely adds:

  1. --stable-timestamps --vad CLI flags — zero user code required
  2. whisper_stable_snap_segments() — silence-edge boundary snapping (closes 1.27% → 0.89%)

What end-users can do today without the PR (simple solution):

  • Call whisper_vad_segments_from_samples() to get speech segments
  • Loop: call whisper_full_with_state() per segment with params.vad = false
  • Add offset_ms = segment_start_seconds * 1000 to every token timestamp
  • Result beats stable-ts with no internal whisper.cpp changes needed

PR justification: Convenience + snapping. The loop pattern is user-space. The snap module is
the differentiated value-add and closes the remaining quality gap to best-in-class (0.89%).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions