Add faster-whisper (ctranslate2) as option for Whisper annotation workflow by entn-at · Pull Request #1017 · lhotse-speech/lhotse

entn-at · 2023-04-06T05:39:35Z

This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.

This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.

desh2608 · 2023-04-06T13:07:24Z

Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper.

entn-at · 2023-04-08T06:18:24Z

Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a --faster-whisper flag. I also added additional off-by-default feature flags for faster-whisper:

--faster-whisper-add-alignments: Whether to use faster-whisper's built-in method for obtaining word alignments (using cross-attention pattern and dynamic time warping; generally not as accurate as forced alignment).
--faster-whisper-use-vad: Whether to apply speech activity detection (SileroVAD) before Whisper to reduce repetitions/spurious transcriptions (what is often referred to as "hallucinations").
--faster-whisper-num-workers: Number of workers for parallelization across multiple GPUs.

Quick benchmark on mini-librispeech dev-clean-2:

OpenAI Whisper, RTX2080Ti:

$ time lhotse workflows annotate-with-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    44m31.647s
user    46m5.540s
sys     0m10.869s

faster-whisper/ctranslate2, float16 on RTX2080Ti:

time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    18m15.743s
user    34m47.594s
sys     30m18.775s

faster-whisper allows parallelization across multiple GPUs. With --faster-whisper-num-workers 4 on 4x RTX2080Ti:

$ time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 --faster-whisper-num-workers 4 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    6m34.545s
user    35m50.779s
sys     25m48.421s

The only incompatibility with the current Whisper method is that faster-whisper doesn't expose a way to set the download location for the models. I submitted a PR to faster-whisper and once that's merged/published in a new version, the currently commented-out line 116 in faster_whisper.py can be changed to enable that. PR to faster-whisper has been merged.

pzelasko

Thanks for a great contribution, this is super interesting. The code looks good to me. I would love to enable it by default.

lhotse/bin/modes/workflows.py

…lignments

pzelasko · 2023-04-11T19:04:32Z

I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions.

Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper.

pzelasko

I ran into a few issues running it, can you make the suggested changes that fix this?

pzelasko · 2023-04-11T19:06:35Z

lhotse/bin/modes/workflows.py

 @click.option(
    "-d", "--device", default="cpu", help="Device on which to run the inference."
 )
+@click.option(


Please change to:

@click.option( "--faster-whisper/--normal-whisper", default=True, help="If True, use faster-whisper's implementation based on CTranslate2.", )

Otherwise it can't be turned off.

pzelasko · 2023-04-11T19:06:56Z

lhotse/bin/modes/workflows.py

+)
+@click.option(
+    "--faster-whisper-compute-type",
+    default="float16",


Suggested change

default="float16",

default="auto",

Otherwise it won't work on (some?) CPUs.

pzelasko · 2023-04-11T19:08:19Z

lhotse/workflows/faster_whisper.py

+        device_index=device_index,
+        compute_type=compute_type,
+        num_workers=num_workers,
+        download_root=download_root,


Since the change that enables this option is still not released in pip, I suggest a bit of workaround here, otherwise it cannot be ran:

+ opt_kwargs = {} + if download_root is not None: + opt_kwargs["download_root"] = download_root model = WhisperModel( model_name, device=device, device_index=device_index, compute_type=compute_type, num_workers=num_workers, - download_root=download_root, + **opt_kwargs, ) - model.logger.setLevel(logging.WARNING) + if hasattr(model, "logger"): + model.logger.setLevel(logging.WARNING)

Note I also suggested a check for logger, on my installation model did not have the attribute logger defined.

entn-at · 2023-05-04T14:56:32Z

Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes.

desh2608 · 2023-07-31T15:45:53Z

@entn-at any updates on this?

entn-at mentioned this pull request Apr 6, 2023

Multi-channel input for annotate_with_whisper #865

Open

entn-at added 2 commits April 7, 2023 18:52

Add workflow for faster-whisper (ctranslate2)

363c756

Combine whisper and faster-whisper into a single CLI/method

0f5a2e1

entn-at force-pushed the feature/whisper-ctranslate2 branch from 1610802 to 0f5a2e1 Compare April 8, 2023 01:53

entn-at added 4 commits April 7, 2023 18:55

sort imports

706a33a

Enable parallelization over multiple GPUs

79e47d8

Add faster-whisper number of workers flag

d722e5b

Add checks for onnxruntime when using VAD

f4a28af

entn-at marked this pull request as ready for review April 8, 2023 06:19

entn-at changed the title ~~[WIP] Add annotation workflow for faster-whisper (ctranslate2)~~ Add faster-whisper (ctranslate2) as option for Whisper annotation workflow Apr 8, 2023

entn-at added 2 commits April 8, 2023 12:33

Enable download_root argument

b98703e

Make faster-whisper logging less verbose

3c052f8

pzelasko reviewed Apr 10, 2023

View reviewed changes

lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved

lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved

entn-at and others added 3 commits April 10, 2023 21:26

Merge branch 'master' into feature/whisper-ctranslate2

381709c

Use VAD by default, warn if used together with --faster-whisper-add-a…

bbd556c

…lignments

Set ctranslate2 device_index in accordance to device and num_workers

8903a7c

pzelasko requested changes Apr 11, 2023

View reviewed changes

pzelasko mentioned this pull request Apr 27, 2023

non-latin text is not aligned #1046

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add faster-whisper (ctranslate2) as option for Whisper annotation workflow#1017

Add faster-whisper (ctranslate2) as option for Whisper annotation workflow#1017
entn-at wants to merge 11 commits intolhotse-speech:masterfrom
entn-at:feature/whisper-ctranslate2

entn-at commented Apr 6, 2023

Uh oh!

desh2608 commented Apr 6, 2023

Uh oh!

entn-at commented Apr 8, 2023 •

edited

Loading

Uh oh!

pzelasko left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pzelasko commented Apr 11, 2023 •

edited

Loading

Uh oh!

pzelasko left a comment

Uh oh!

pzelasko Apr 11, 2023

Uh oh!

pzelasko Apr 11, 2023

Uh oh!

pzelasko Apr 11, 2023

Uh oh!

entn-at commented May 4, 2023

Uh oh!

desh2608 commented Jul 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

entn-at commented Apr 6, 2023

Uh oh!

desh2608 commented Apr 6, 2023

Uh oh!

entn-at commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pzelasko commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

entn-at commented May 4, 2023

Uh oh!

desh2608 commented Jul 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

entn-at commented Apr 8, 2023 •

edited

Loading

pzelasko left a comment •

edited

Loading

pzelasko commented Apr 11, 2023 •

edited

Loading