Skip to content

Commit ab381e3

Browse files
thewh1teagleclaude
andcommitted
stable-timestamps: per-segment VAD decoding for subtitle-quality timestamps
Replace concatenate-decode-remap pipeline with per-segment VAD decoding, matching how stable-ts/faster-whisper works. Each VAD speech segment is decoded independently and timestamps are offset by the segment's original start time — no mapping table or interpolation needed. Results on 5-min synthetic audio (46 utterances, 7x 20s pauses): pct_words_overlap: 0.89% (vs 5.7% stable-ts, 22.6% previous v2) n_words_overlap: 5 (vs 22 stable-ts, 144 previous v2) Wall time: 22.8s (vs 43.2s stable-ts — 1.9x faster via Metal) Code removed: - whisper_vad() concatenation + mapping table building - vad_time_mapping struct, vad_mapping_table, has_vad_segments from state - map_processed_to_original_time() in whisper.cpp - whisper_stable_map_processed_to_original() in whisper-stable.cpp - mapping params from whisper_stable_snap_segments() Code added: - whisper_full_vad_segments(): ~70-line per-segment decode loop - whisper_full_parallel() with VAD delegates to whisper_full() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 9234e47 commit ab381e3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+70373
-930
lines changed

examples/cli/cli.cpp

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ static void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params
295295
// Voice Activity Detection (VAD) parameters
296296
fprintf(stderr, "\nVoice Activity Detection (VAD) options:\n");
297297
fprintf(stderr, " --vad [%-7s] enable Voice Activity Detection (VAD)\n", params.vad ? "true" : "false");
298-
fprintf(stderr, " --stable-timestamps [%-7s] enable stable timestamps (requires --vad-model)\n", params.stable_timestamps ? "true" : "false");
298+
fprintf(stderr, " --stable-timestamps [%-7s] enable stable timestamps\n", params.stable_timestamps ? "true" : "false");
299299
fprintf(stderr, " -vm FNAME, --vad-model FNAME [%-7s] VAD model path\n", params.vad_model.c_str());
300300
fprintf(stderr, " -vt N, --vad-threshold N [%-7.2f] VAD threshold for speech recognition\n", params.vad_threshold);
301301
fprintf(stderr, " -vspd N, --vad-min-speech-duration-ms N [%-7d] VAD min speech duration (0.0-1.0)\n", params.vad_min_speech_duration_ms);
@@ -1005,12 +1005,6 @@ int main(int argc, char ** argv) {
10051005
exit(0);
10061006
}
10071007

1008-
if (params.stable_timestamps && params.vad_model.empty()) {
1009-
fprintf(stderr, "error: --stable-timestamps requires --vad-model\n");
1010-
whisper_print_usage(argc, argv, params);
1011-
return 2;
1012-
}
1013-
10141008
if (params.no_prints) {
10151009
whisper_log_set(cb_log_disable, NULL);
10161010
}
60.3 MB
Binary file not shown.

0 commit comments

Comments
 (0)