Skip to content

Commit 7c2414f

Browse files
authored
Merge pull request #39 from NavodPeiris/dev
added fp16 support when running regular whisper on gpu
2 parents f8a4d03 + d94bc22 commit 7c2414f

7 files changed

Lines changed: 80 additions & 273 deletions

File tree

README.md

Lines changed: 24 additions & 243 deletions
Large diffs are not rendered by default.

examples/transcribe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from speechlib import Transcriptor
22

3-
file = "obama1.wav" # your audio file
3+
file = "obama_zach.wav" # your audio file
44
voices_folder = "voices" # voices folder containing voice samples for recognition
55
language = "en" # language code
66
log_folder = "logs" # log folder for storing transcripts

library.md

Lines changed: 36 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
### Run your IDE as administrator
2+
3+
you will get following error if administrator permission is not there:
4+
5+
**OSError: [WinError 1314] A required privilege is not held by the client**
6+
17
### Requirements
28

39
* Python 3.8 or greater
@@ -31,13 +37,13 @@ This library does speaker diarization, speaker recognition, and transcription on
3137

3238
This library contains following audio preprocessing functions:
3339

34-
1. convert mp3 to wav
40+
1. convert other audio formats to wav
3541

3642
2. convert stereo wav file to mono
3743

3844
3. re-encode the wav file to have 16-bit PCM encoding
3945

40-
Transcriptor method takes 6 arguments.
46+
Transcriptor method takes 7 arguments.
4147

4248
1. file to transcribe
4349

@@ -47,9 +53,11 @@ Transcriptor method takes 6 arguments.
4753

4854
4. model size ("tiny", "small", "medium", "large", "large-v1", "large-v2", "large-v3")
4955

50-
5. voices_folder (contains speaker voice samples for speaker recognition)
56+
5. ACCESS_TOKEN: huggingface acccess token (also get permission to access `pyannote/speaker-diarization@2.1`)
57+
58+
6. voices_folder (contains speaker voice samples for speaker recognition)
5159

52-
6. quantization: this determine whether to use int8 quantization or not. Quantization may speed up the process but lower the accuracy.
60+
7. quantization: this determine whether to use int8 quantization or not. Quantization may speed up the process but lower the accuracy.
5361

5462
voices_folder should contain subfolders named with speaker names. Each subfolder belongs to a speaker and it can contain many voice samples. This will be used for speaker recognition to identify the speaker.
5563

@@ -64,26 +72,34 @@ transcript will also indicate the timeframe in seconds where each speaker speaks
6472
```
6573
from speechlib import Transcriptor
6674
67-
file = "obama_zach.wav"
68-
voices_folder = "voices"
69-
language = "en"
70-
log_folder = "logs"
71-
modelSize = "medium"
75+
file = "obama_zach.wav" # your audio file
76+
voices_folder = "voices" # voices folder containing voice samples for recognition
77+
language = "en" # language code
78+
log_folder = "logs" # log folder for storing transcripts
79+
modelSize = "tiny" # size of model to be used [tiny, small, medium, large-v1, large-v2, large-v3]
7280
quantization = False # setting this 'True' may speed up the process but lower the accuracy
81+
ACCESS_TOKEN = "your huggingface access token" # get permission to access pyannote/speaker-diarization@2.1 on huggingface
7382
74-
transcriptor = Transcriptor(file, log_folder, language, modelSize, voices_folder, quantization)
83+
# quantization only works on faster-whisper
84+
transcriptor = Transcriptor(file, log_folder, language, modelSize, ACCESS_TOKEN, voices_folder, quantization)
7585
76-
res = transcriptor.transcribe()
86+
# use normal whisper
87+
res = transcriptor.whisper()
88+
89+
# use faster-whisper (simply faster)
90+
res = transcriptor.faster_whisper()
7791
7892
res --> [["start", "end", "text", "speaker"], ["start", "end", "text", "speaker"]...]
7993
```
8094

95+
#### if you don't want speaker names: keep voices_folder as an empty string ""
96+
8197
start: starting time of speech in seconds
8298
end: ending time of speech in seconds
8399
text: transcribed text for speech during start and end
84100
speaker: speaker of the text
85101

86-
voices_folder structure:
102+
#### voices folder structure:
87103
```
88104
voices_folder
89105
|---> person1
@@ -116,15 +132,16 @@ supported language names:
116132
from speechlib import PreProcessor
117133
118134
file = "obama1.mp3"
119-
135+
#initialize
136+
prep = PreProcessor()
120137
# convert mp3 to wav
121-
wav_file = PreProcessor.convert_to_wav(file)
138+
wav_file = prep.convert_to_wav(file)
122139
123140
# convert wav file from stereo to mono
124-
PreProcessor.convert_to_mono(wav_file)
141+
prep.convert_to_mono(wav_file)
125142
126143
# re-encode wav file to have 16-bit PCM encoding
127-
PreProcessor.re_encode(wav_file)
144+
prep.re_encode(wav_file)
128145
```
129146

130147
### Performance
@@ -170,6 +187,9 @@ metrics for faster-whisper "large" model:
170187
transcription time: 343s
171188
```
172189

190+
#### why not using pyannote/speaker-diarization-3.1, speechbrain >= 1.0.0, faster-whisper >= 1.0.0:
191+
192+
because older versions give more accurate transcriptions. this was tested.
173193

174194
This library uses following huggingface models:
175195

requirements.txt

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
transformers
2-
torch
3-
torchaudio
4-
pydub
5-
pyannote.audio
6-
speechbrain
7-
accelerate
8-
faster-whisper
1+
transformers==4.36.2
2+
torch==2.1.2
3+
torchaudio==2.1.2
4+
pydub==0.25.1
5+
pyannote.audio==3.1.1
6+
speechbrain==0.5.16
7+
accelerate==0.26.1
8+
faster-whisper==0.10.1
9+
openai-whisper==20231117

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setup(
77
name="speechlib",
8-
version="1.1.0",
8+
version="1.1.2",
99
description="speechlib is a library that can do speaker diarization, transcription and speaker recognition on an audio file to create transcripts with actual speaker names. This library also contain audio preprocessor functions.",
1010
packages=find_packages(),
1111
long_description=long_description,

setup_instruction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ for publishing:
99
pip install twine
1010

1111
for install locally for testing:
12-
pip install dist/speechlib-1.1.0-py3-none-any.whl
12+
pip install dist/speechlib-1.1.2-py3-none-any.whl
1313

1414
finally run:
1515
twine upload dist/*

speechlib/transcribe.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,14 @@ def transcribe(file, language, model_size, whisper_type, quantization):
3232
Exception("Language code not supported.\nThese are the supported languages:\n", model.supported_languages)
3333
else:
3434
try:
35-
model = whisper.load_model(model_size)
36-
result = model.transcribe(file, language=language)
37-
res = result["text"]
35+
if torch.cuda.is_available():
36+
model = whisper.load_model(model_size, device="cuda")
37+
result = model.transcribe(file, language=language, fp16=True)
38+
res = result["text"]
39+
else:
40+
model = whisper.load_model(model_size, device="cpu")
41+
result = model.transcribe(file, language=language, fp16=False)
42+
res = result["text"]
3843

3944
return res
4045
except Exception as err:

0 commit comments

Comments
 (0)