Add faster-whisper (ctranslate2) as option for Whisper annotation workflow#1017
Add faster-whisper (ctranslate2) as option for Whisper annotation workflow#1017entn-at wants to merge 11 commits intolhotse-speech:masterfrom
Conversation
|
Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper. |
1610802 to
0f5a2e1
Compare
|
Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a
Quick benchmark on mini-librispeech dev-clean-2: OpenAI Whisper, RTX2080Ti: faster-whisper/ctranslate2,
|
|
I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions. Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper. |
pzelasko
left a comment
There was a problem hiding this comment.
I ran into a few issues running it, can you make the suggested changes that fix this?
| @click.option( | ||
| "-d", "--device", default="cpu", help="Device on which to run the inference." | ||
| ) | ||
| @click.option( |
There was a problem hiding this comment.
Please change to:
@click.option(
"--faster-whisper/--normal-whisper",
default=True,
help="If True, use faster-whisper's implementation based on CTranslate2.",
)Otherwise it can't be turned off.
| ) | ||
| @click.option( | ||
| "--faster-whisper-compute-type", | ||
| default="float16", |
There was a problem hiding this comment.
| default="float16", | |
| default="auto", |
Otherwise it won't work on (some?) CPUs.
| device_index=device_index, | ||
| compute_type=compute_type, | ||
| num_workers=num_workers, | ||
| download_root=download_root, |
There was a problem hiding this comment.
Since the change that enables this option is still not released in pip, I suggest a bit of workaround here, otherwise it cannot be ran:
+ opt_kwargs = {}
+ if download_root is not None:
+ opt_kwargs["download_root"] = download_root
model = WhisperModel(
model_name,
device=device,
device_index=device_index,
compute_type=compute_type,
num_workers=num_workers,
- download_root=download_root,
+ **opt_kwargs,
)
- model.logger.setLevel(logging.WARNING)
+ if hasattr(model, "logger"):
+ model.logger.setLevel(logging.WARNING)
Note I also suggested a check for logger, on my installation model did not have the attribute logger defined.
|
Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes. |
|
@entn-at any updates on this? |

This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.
This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.