Skip to content

Commit 45d273d

Browse files
gorkacheaeustlb
andauthored
📚 docs(granite-speech): add comprehensive usage examples (#42125)
* 📚 docs(granite-speech): add comprehensive usage examples Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers. Added examples for: - Basic speech transcription - Speech-to-text with additional context - Batch processing multiple audio files - Tips for best results (audio format, LoRA adapter, memory optimization) This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases. Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs. * Address review feedback: add chat template usage and move model-specific tips - Added proper chat template formatting in the second example (per @zucchini-nlp feedback) - Removed generic LLM tips (temperature, batch size, memory) - Moved Granite Speech-specific tips (audio format, LoRA adapter) to Usage tips section This keeps the documentation focused on model-specific features rather than general LLM knowledge. * docs: use datasets library for working audio examples Address review feedback by replacing placeholder audio paths with real examples using hf-internal-testing/librispeech_asr_dummy dataset. This makes all code examples copy-paste ready and reproducible. - Add datasets import to all three examples - Replace 'path/to/audio.wav' with actual dataset loading - Ensure proper audio sampling rate handling Co-authored-by: eustlb <[email protected]> * 📚 docs: use modern chat template pattern with tokenize=True for audio --------- Co-authored-by: eustlb <[email protected]>
1 parent 672dc07 commit 45d273d

File tree

1 file changed

+108
-1
lines changed

1 file changed

+108
-1
lines changed

docs/source/en/model_doc/granite_speech.md

Lines changed: 108 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,115 @@ This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9
4040
## Usage tips
4141

4242
- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!
43+
- The model expects 16kHz sampling rate audio. The processor will automatically resample if needed.
44+
- The LoRA adapter is automatically enabled when audio features are present and disabled for text-only inputs, so you don't need to manage it manually.
45+
46+
## Usage example
47+
48+
Granite Speech is a multimodal speech-to-text model that can transcribe audio and respond to text prompts. Here's how to use it:
49+
50+
### Basic Speech Transcription
51+
52+
```python
53+
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
54+
from datasets import load_dataset, Audio
55+
import torch
56+
57+
# Load model and processor
58+
model = GraniteSpeechForConditionalGeneration.from_pretrained(
59+
"ibm-granite/granite-3.2-8b-speech",
60+
torch_dtype=torch.bfloat16,
61+
device_map="auto"
62+
)
63+
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")
64+
65+
# Load audio from dataset (16kHz sampling rate required)
66+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
67+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
68+
audio = ds['audio'][0]['array']
69+
70+
# Process audio
71+
inputs = processor(audio=audio, return_tensors="pt").to(model.device)
72+
73+
# Generate transcription
74+
generated_ids = model.generate(**inputs, max_new_tokens=256)
75+
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
76+
print(transcription)
77+
```
78+
79+
### Speech-to-Text with Chat Template
80+
81+
For instruction-following with audio, use the chat template with audio directly in the conversation format:
82+
83+
```python
84+
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
85+
from datasets import load_dataset, Audio
86+
import torch
87+
88+
model = GraniteSpeechForConditionalGeneration.from_pretrained(
89+
"ibm-granite/granite-3.2-8b-speech",
90+
torch_dtype=torch.bfloat16,
91+
device_map="auto"
92+
)
93+
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")
94+
95+
# Load audio from dataset
96+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
97+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
98+
audio = ds['audio'][0]
99+
100+
# Prepare conversation with audio and text
101+
conversation = [
102+
{
103+
"role": "user",
104+
"content": [
105+
{"type": "audio", "audio": audio},
106+
{"type": "text", "text": "Transcribe the following audio:"},
107+
],
108+
}
109+
]
110+
111+
# Apply chat template with audio - processor handles both tokenization and audio processing
112+
inputs = processor.apply_chat_template(conversation, tokenize=True, return_tensors="pt").to(model.device)
113+
114+
# Generate transcription
115+
generated_ids = model.generate(**inputs, max_new_tokens=512)
116+
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
117+
print(output_text)
118+
```
119+
120+
### Batch Processing
121+
122+
Process multiple audio files efficiently:
123+
124+
```python
125+
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
126+
from datasets import load_dataset, Audio
127+
import torch
128+
129+
model = GraniteSpeechForConditionalGeneration.from_pretrained(
130+
"ibm-granite/granite-3.2-8b-speech",
131+
torch_dtype=torch.bfloat16,
132+
device_map="auto"
133+
)
134+
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")
135+
136+
# Load multiple audio samples from dataset
137+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
138+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
139+
audio_samples = [ds['audio'][i]['array'] for i in range(3)]
140+
141+
# Process batch
142+
inputs = processor(audio=audio_samples, return_tensors="pt", padding=True).to(model.device)
143+
144+
# Generate for all inputs
145+
generated_ids = model.generate(**inputs, max_new_tokens=256)
146+
transcriptions = processor.batch_decode(generated_ids, skip_special_tokens=True)
147+
148+
for i, transcription in enumerate(transcriptions):
149+
print(f"Audio {i+1}: {transcription}")
150+
```
43151

44-
<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->
45152

46153
## GraniteSpeechConfig
47154

0 commit comments

Comments
 (0)