Stable timestamps

# Stable Timestamps — End-to-End Journey & Final Conclusion

Date: 2026-03-06
Branch: feature/stable-timestamps-clean (PR against ggml-org/whisper.cpp master)

---

## The Problem

whisper.cpp produces word-level token timestamps, but they are often inaccurate — tokens land
inside silence gaps rather than tightly around the spoken word. This causes subtitles to appear
or disappear at the wrong time.

**Metric used throughout:** `pct_words_overlap` — percentage of decoded word tokens whose
timestamp range overlaps a silence region in the audio. Lower is better. 0% = perfect (every
word lands on actual speech). Measured on a 5-minute synthetic audio file (`synth_5min.wav`)
with a known ground-truth silence timeline.

**Reference benchmark:** `stable-ts` (Python library wrapping faster-whisper) achieves 5.7%
word overlap on the same file. Goal: match or beat it inside whisper.cpp.

**Baseline whisper.cpp (no changes):** 41.1% word overlap, 161 segments vs 46 expected.

---

## Iteration History

### v1 — Post-hoc snapping only
Snap token boundaries to nearest silence edge after decode. Wrong snapping algorithm. Result:
worse than baseline in some cases. Abandoned.

### v2 — Constrained decoding + snapping
Added `--stable-timestamps` flag. During decode, suppress tokens that would land in silence
(constrained beam search). Then snap remaining token boundaries.
Result: 22.6% word overlap, 45 segments. Much better but not close to stable-ts 5.7%.
Root cause: decoder still processes full 20-second windows including silence. Hallucinated words
fill silence gaps and can't all be snapped away.

### v3 — VAD mapping remapping
Added `--vad` integration. VAD strips silence, concatenates speech, decoder sees one stream.
Built a `vad_mapping_table` (processed_time → original_time) to remap timestamps back.
Result: 10.3% word overlap, 45 segments. Better, but segments still spanned original silence
gaps because the decoder saw a continuous stream and produced cross-boundary segments.

### v4 — Per-segment VAD decode (current PR)
**Key insight:** instead of concatenating all speech and decoding once, decode each VAD segment
independently and add a fixed offset to timestamps. This is exactly how stable-ts/faster-whisper
works internally.

Changes: removed the concatenation + mapping table infrastructure (~200 lines). Added
`whisper_full_vad_segments()` — loops over VAD segments, calls `whisper_full_with_state()` per
segment, shifts all token timestamps by `offset_cs = seg_start_centiseconds`.

Result: **0.89% word overlap, 46 segments.** Beats stable-ts (5.7%).

---

## The Core Question for the PR

whisper.cpp already exposes public APIs for VAD and per-segment decode. Could users achieve the
same result without any internal changes? If yes, the PR's value is "convenience", not
"new capability."

### Test

Built `user_vad_decode.cpp` on upstream/master (no PR changes). Uses only public APIs:

```cpp
// 1. Run VAD to get speech segments (all defaults, no tuning)
auto vad_params = whisper_vad_default_params();
auto vad_segs = whisper_vad_segments_from_samples(vctx, vad_params, pcm.data(), n_samples);
int n_segs = whisper_vad_segments_n_segments(vad_segs);

// 2. Decode each segment independently
for (int i = 0; i < n_segs; i++) {
    float t0 = whisper_vad_segments_get_segment_t0(vad_segs, i); // centiseconds
    float t1 = whisper_vad_segments_get_segment_t1(vad_segs, i);

    int start = (int)(t0 / 100.0f * WHISPER_SAMPLE_RATE);
    int end   = (int)(t1 / 100.0f * WHISPER_SAMPLE_RATE);

    whisper_full_with_state(ctx, state, params, pcm.data() + start, end - start);

    // 3. Read timestamps and shift by segment start
    int64_t offset_ms = (int64_t)(t0 / 100.0f * 1000.0f);
    for (int s = 0; s < whisper_full_n_segments_from_state(state); s++) {
        int64_t seg_t0 = whisper_full_get_segment_t0_from_state(state, s) * 10 + offset_ms;
        int64_t seg_t1 = whisper_full_get_segment_t1_from_state(state, s) * 10 + offset_ms;
        // collect tokens similarly...
    }
}
```

That's the entire pattern. ~50 lines of application code, no whisper.cpp changes.

---

## Performance: The Chunking Problem

**Problem:** whisper's encoder always processes a fixed 30-second mel window regardless of input
length. A 2-second VAD segment still costs one full encoder pass. With 46 segments on a 5-minute
file, that's 46 encoder runs vs ~10 for full-audio decode — roughly 4-5x slower.

**Solution: greedy bin-packing of VAD segments into ~25s chunks.**

Group adjacent VAD segments until the next one would push the chunk past ~25 seconds. Decode
each chunk as a single `whisper_full_with_state()` call. Each sub-segment within the chunk still
gets its own offset (`chunk_start + position_within_chunk`). Encoder runs drop from 46 → ~10,
matching full-audio speed, while still avoiding long silence-only windows that cause
hallucinations.

```cpp
// Pack VAD segments into ~25s chunks
std::vector<std::vector<int>> chunks;
std::vector<int> cur;
float cur_dur = 0;
for (int i = 0; i < n_segs; i++) {
    float dur = (t1[i] - t0[i]) / 100.0f; // seconds
    if (!cur.empty() && cur_dur + dur > 25.0f) {
        chunks.push_back(cur); cur.clear(); cur_dur = 0;
    }
    cur.push_back(i); cur_dur += dur;
}
if (!cur.empty()) chunks.push_back(cur);

// For each chunk: slice PCM from chunk_t0 to chunk_t1, decode once,
// offset each result segment by chunk_t0.
```

This is the same strategy faster-whisper uses internally ("chunking").

**This is still pure user-space** — no whisper.cpp internals needed. However, it adds another
~30 lines of non-obvious logic (bin-packing, PCM slicing per chunk, per-sub-segment offsets).
This reinforces the PR's value: `--vad` handles all of this automatically with correct behavior.

---

## Final Results

| Metric                  | baseline | v2    | v3    | v4 (PR) | user_vad (public API) | stable-ts |
|-------------------------|----------|-------|-------|---------|-----------------------|-----------|
| n_segments              | 161      | 45    | 45    | 46      | 46                    | 46        |
| n_words_overlap_any     | 745      | 144   | 52    | 5       | 6                     | 22        |
| pct_words_overlap %     | 41.1     | 22.6  | 10.3  | 0.89    | 1.27                  | 5.7       |
| pass_segments_threshold | False    | False | False | True    | True                  | False     |

---

## Conclusion

**The capability is user-replicable with ~50 lines of application code.**

Per-segment VAD decoding via public APIs achieves **1.27% word overlap** — well below stable-ts
(5.7%) and essentially equivalent to the PR's v4 (0.89%).

The 0.38 percentage point gap between user_vad and v4 comes from the snapping step:
`whisper_stable_snap_segments()` nudges token boundaries away from silence edges after decode.
This is only available via the PR's new `whisper-stable` module.

**What the PR uniquely adds:**
1. `--stable-timestamps --vad` CLI flags — zero user code required
2. `whisper_stable_snap_segments()` — silence-edge boundary snapping (closes 1.27% → 0.89%)

**What end-users can do today without the PR (simple solution):**
- Call `whisper_vad_segments_from_samples()` to get speech segments
- Loop: call `whisper_full_with_state()` per segment with `params.vad = false`
- Add `offset_ms = segment_start_seconds * 1000` to every token timestamp
- Result beats stable-ts with no internal whisper.cpp changes needed

**PR justification:** Convenience + snapping. The loop pattern is user-space. The snap module is
the differentiated value-add and closes the remaining quality gap to best-in-class (0.89%).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable timestamps #13

Stable Timestamps — End-to-End Journey & Final Conclusion

The Problem

Iteration History

v1 — Post-hoc snapping only

v2 — Constrained decoding + snapping

v3 — VAD mapping remapping

v4 — Per-segment VAD decode (current PR)

The Core Question for the PR

Test

Performance: The Chunking Problem

Final Results

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	baseline	v2	v3	v4 (PR)	user_vad (public API)	stable-ts
n_segments	161	45	45	46	46	46
n_words_overlap_any	745	144	52	5	6	22
pct_words_overlap %	41.1	22.6	10.3	0.89	1.27	5.7
pass_segments_threshold	False	False	False	True	True	False

Stable timestamps #13

Description

Stable Timestamps — End-to-End Journey & Final Conclusion

The Problem

Iteration History

v1 — Post-hoc snapping only

v2 — Constrained decoding + snapping

v3 — VAD mapping remapping

v4 — Per-segment VAD decode (current PR)

The Core Question for the PR

Test

Performance: The Chunking Problem

Final Results

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions