Skip to content

Smart Turn v3 silently breaks when audio_in_sample_rate=8000 (Twilio/telephony) #3844

@drew-royster

Description

@drew-royster

Summary

Setting audio_in_sample_rate=8000 in PipelineParams (as recommended by the Twilio WebSocket integration guide) silently breaks Smart Turn v3 end-of-turn detection. The model receives 8kHz audio but its internal WhisperFeatureExtractor hardcodes sampling_rate=16000, causing it to interpret speech at 2x speed with shifted pitch. This leads to aggressive/incorrect turn predictions with no errors or warnings.

The Documentation Conflict

The Twilio guide recommends:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=8000,
        audio_out_sample_rate=8000,
    ),
)

The Smart Turn v3 README states:

Smart Turn takes 16kHz PCM audio as input.

Since Smart Turn is auto-injected as the default stop strategy via UserTurnStrategies.__post_init__, anyone following the Twilio docs will unknowingly break turn detection.

Root Cause

  1. TwilioFrameSerializer upsamples 8kHz μ-law to the pipeline's configured audio_in_sample_rate
  2. When audio_in_sample_rate=16000 (the default), the serializer upsamples to 16kHz — Smart Turn works correctly
  3. When audio_in_sample_rate=8000, the serializer outputs 8kHz PCM — Smart Turn receives 8kHz audio
  4. LocalSmartTurnAnalyzerV3._predict_endpoint uses WhisperFeatureExtractor(sampling_rate=16000) with no internal resampling
  5. The model processes 8kHz audio as if it were 16kHz — no error, no warning, just wrong classifications

Evidence

We ran A/B tests comparing Smart Turn predictions on the same utterances at 8kHz vs 16kHz:

  • 6 out of 20 utterances had classification flips (Complete ↔ Incomplete)
  • Max probability delta: 0.9391 (a confident "complete" at 8kHz flipped to confident "incomplete" at 16kHz)
  • Mean turn duration dropped 51% in production (2.33s → 1.14s) when using 8kHz
  • Phone numbers were fragmented across multiple turns — the model couldn't recognize digit sequences as incomplete at 8kHz

Community Impact

Multiple Discord threads and issues show users hitting this without identifying the root cause:

  • "Chipmunk audio" reports with Twilio + OpenAI Realtime (audio at 2x speed)
  • "Issues with Local Smart Turn Analyzer" — Smart Turn marking turns COMPLETE incorrectly
  • "Premature turn finalization" — bot replying twice per utterance
  • Telnyx users finding mysterious "2x mismatch" workarounds
  • Support staff giving the correct workaround ("just don't set it") without explaining the Smart Turn connection

Suggested Fixes

  1. Update Twilio/telephony docs to warn against setting audio_in_sample_rate=8000 when using Smart Turn, or recommend only setting audio_out_sample_rate=8000
  2. Add a runtime warning in LocalSmartTurnAnalyzerV3 when set_sample_rate is called with a value != 16000
  3. Add internal resampling in the Smart Turn analyzer to handle non-16kHz input gracefully

Workaround

Leave audio_in_sample_rate at the default (16000). Only set audio_out_sample_rate=8000 for telephony. The TwilioFrameSerializer handles the 8kHz → 16kHz upsampling internally.

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_out_sample_rate=8000,  # Twilio needs 8kHz output
        # Do NOT set audio_in_sample_rate=8000 — breaks Smart Turn v3
    ),
)

Environment

  • pipecat-ai 0.0.103
  • LocalSmartTurnAnalyzerV3 (bundled ONNX weights)
  • TwilioFrameSerializer
  • Silero VAD with stop_secs=0.2

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions