-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Summary
Setting audio_in_sample_rate=8000 in PipelineParams (as recommended by the Twilio WebSocket integration guide) silently breaks Smart Turn v3 end-of-turn detection. The model receives 8kHz audio but its internal WhisperFeatureExtractor hardcodes sampling_rate=16000, causing it to interpret speech at 2x speed with shifted pitch. This leads to aggressive/incorrect turn predictions with no errors or warnings.
The Documentation Conflict
The Twilio guide recommends:
task = PipelineTask(
pipeline,
params=PipelineParams(
audio_in_sample_rate=8000,
audio_out_sample_rate=8000,
),
)The Smart Turn v3 README states:
Smart Turn takes 16kHz PCM audio as input.
Since Smart Turn is auto-injected as the default stop strategy via UserTurnStrategies.__post_init__, anyone following the Twilio docs will unknowingly break turn detection.
Root Cause
TwilioFrameSerializerupsamples 8kHz μ-law to the pipeline's configuredaudio_in_sample_rate- When
audio_in_sample_rate=16000(the default), the serializer upsamples to 16kHz — Smart Turn works correctly - When
audio_in_sample_rate=8000, the serializer outputs 8kHz PCM — Smart Turn receives 8kHz audio LocalSmartTurnAnalyzerV3._predict_endpointusesWhisperFeatureExtractor(sampling_rate=16000)with no internal resampling- The model processes 8kHz audio as if it were 16kHz — no error, no warning, just wrong classifications
Evidence
We ran A/B tests comparing Smart Turn predictions on the same utterances at 8kHz vs 16kHz:
- 6 out of 20 utterances had classification flips (Complete ↔ Incomplete)
- Max probability delta: 0.9391 (a confident "complete" at 8kHz flipped to confident "incomplete" at 16kHz)
- Mean turn duration dropped 51% in production (2.33s → 1.14s) when using 8kHz
- Phone numbers were fragmented across multiple turns — the model couldn't recognize digit sequences as incomplete at 8kHz
Community Impact
Multiple Discord threads and issues show users hitting this without identifying the root cause:
- "Chipmunk audio" reports with Twilio + OpenAI Realtime (audio at 2x speed)
- "Issues with Local Smart Turn Analyzer" — Smart Turn marking turns COMPLETE incorrectly
- "Premature turn finalization" — bot replying twice per utterance
- Telnyx users finding mysterious "2x mismatch" workarounds
- Support staff giving the correct workaround ("just don't set it") without explaining the Smart Turn connection
Suggested Fixes
- Update Twilio/telephony docs to warn against setting
audio_in_sample_rate=8000when using Smart Turn, or recommend only settingaudio_out_sample_rate=8000 - Add a runtime warning in
LocalSmartTurnAnalyzerV3whenset_sample_rateis called with a value != 16000 - Add internal resampling in the Smart Turn analyzer to handle non-16kHz input gracefully
Workaround
Leave audio_in_sample_rate at the default (16000). Only set audio_out_sample_rate=8000 for telephony. The TwilioFrameSerializer handles the 8kHz → 16kHz upsampling internally.
task = PipelineTask(
pipeline,
params=PipelineParams(
audio_out_sample_rate=8000, # Twilio needs 8kHz output
# Do NOT set audio_in_sample_rate=8000 — breaks Smart Turn v3
),
)Environment
- pipecat-ai 0.0.103
- LocalSmartTurnAnalyzerV3 (bundled ONNX weights)
- TwilioFrameSerializer
- Silero VAD with
stop_secs=0.2