An extensible, lightweight browser SDK for building AI voice agents. CompositeVoice provides a unified interface for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) providers with support for both REST and WebSocket communication patterns.
npm install @lukeocodes/composite-voice
# or
pnpm add @lukeocodes/composite-voice
# or
yarn add @lukeocodes/composite-voiceInstall provider SDKs as needed:
# For OpenAI providers
pnpm add openai
# For Anthropic LLM
pnpm add @anthropic-ai/sdk
# For Deepgram providers
pnpm add @deepgram/sdkimport { CompositeVoice, NativeSTT, NativeTTS } from '@lukeocodes/composite-voice';
// Create a simple voice agent using browser APIs
const agent = new CompositeVoice({
mode: 'composite',
stt: new NativeSTT({ language: 'en-US' }),
llm: new OpenAILLM({
apiKey: 'your-api-key',
model: 'gpt-4',
}),
tts: new NativeTTS({ voice: 'Google US English' }),
audio: {
input: { sampleRate: 16000 },
output: { bufferSize: 4096 },
},
});
// Initialize the agent
await agent.initialize();
// Listen for events
agent.on('transcription.final', (event) => {
console.log('You said:', event.text);
});
agent.on('llm.complete', (event) => {
console.log('AI responded:', event.text);
});
agent.on('agent.stateChange', (event) => {
console.log('State changed:', event.previousState, '->', event.state);
});
// Start listening for user input
await agent.startListening();
// When done, stop listening
await agent.stopListening();
// Clean up
await agent.dispose();import { CompositeVoice } from '@lukeocodes/composite-voice';
import { DeepgramSTT } from '@lukeocodes/composite-voice/providers/stt/deepgram';
import { OpenAILLM } from '@lukeocodes/composite-voice/providers/llm/openai';
import { ElevenLabsTTS } from '@lukeocodes/composite-voice/providers/tts/elevenlabs';
const agent = new CompositeVoice({
mode: 'composite',
stt: new DeepgramSTT({
apiKey: process.env.DEEPGRAM_API_KEY,
model: 'nova-2',
language: 'en-US',
}),
llm: new OpenAILLM({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4-turbo',
temperature: 0.7,
systemPrompt: 'You are a helpful voice assistant.',
}),
tts: new ElevenLabsTTS({
apiKey: process.env.ELEVENLABS_API_KEY,
voice: 'adam',
}),
});
await agent.initialize();
await agent.startListening();import { CompositeVoice } from '@lukeocodes/composite-voice';
import { DeepgramAura } from '@lukeocodes/composite-voice/providers/all-in-one/deepgram';
const agent = new CompositeVoice({
mode: 'all-in-one',
provider: new DeepgramAura({
apiKey: process.env.DEEPGRAM_API_KEY,
model: 'aura-asteria-en',
systemPrompt: 'You are a helpful assistant.',
}),
});
await agent.initialize();
await agent.startListening();CompositeVoice supports two modes:
Uses separate providers for STT, LLM, and TTS. Provides maximum flexibility and allows mixing providers from different services.
User Speech → STT Provider → LLM Provider → TTS Provider → Audio Output
Uses a single provider that handles the entire pipeline (STT → LLM → TTS). Provides lower latency and simpler configuration.
User Speech → All-in-One Provider → Audio Output
The SDK uses a type-safe event system to communicate with your application:
agent.ready: SDK is initialized and readyagent.stateChange: Agent state changedagent.error: System-level error occurred
transcription.start: Transcription startedtranscription.interim: Partial transcription (streaming only)transcription.final: Complete transcriptiontranscription.error: Transcription error
llm.start: LLM processing startedllm.chunk: Text chunk received (streaming)llm.complete: LLM response completellm.error: LLM error
tts.start: TTS generation startedtts.audio: Audio chunk readytts.metadata: Audio metadata receivedtts.complete: TTS generation completetts.error: TTS error
audio.capture.start: Microphone capture startedaudio.capture.stop: Microphone capture stoppedaudio.capture.error: Audio capture erroraudio.playback.start: Audio playback startedaudio.playback.end: Audio playback endedaudio.playback.error: Audio playback error
The agent transitions through these states:
idle: Not initializedready: Initialized and ready for interactionlistening: Actively capturing audiothinking: Processing input with LLMspeaking: Playing back audio responseerror: Error state (can recover)
- NativeSTT: Browser Web Speech API (no API key required)
- DeepgramSTT: Deepgram streaming STT (requires
@deepgram/sdk) - OpenAISTT: OpenAI Whisper (requires
openai)
- OpenAILLM: OpenAI GPT models (requires
openai) - AnthropicLLM: Anthropic Claude models (requires
@anthropic-ai/sdk)
- NativeTTS: Browser Speech Synthesis API (no API key required)
- DeepgramTTS: Deepgram streaming TTS (requires
@deepgram/sdk) - ElevenLabsTTS: ElevenLabs voices (requires SDK)
- Deepgram: Complete voice agent pipeline (requires
@deepgram/sdk)
You can create custom providers by extending the base classes:
import { BaseSTTProvider } from '@lukeocodes/composite-voice';
class MyCustomSTT extends BaseSTTProvider {
protected async onInitialize(): Promise<void> {
// Initialize your provider
}
protected async onDispose(): Promise<void> {
// Clean up resources
}
async transcribe(audio: Blob): Promise<string> {
// Implement transcription logic
return 'transcribed text';
}
}Check the examples directory for complete, standalone example applications:
- Basic Browser - Simple HTML/JS with native browser APIs
- Vite + TypeScript - Modern setup with real providers
- Custom Provider - Coming soon
- All-in-One - Coming soon
Each example has its own README with detailed setup instructions.
- Chrome/Edge: Full support
- Firefox: Full support (with limitations on Web Speech API)
- Safari: Partial support (Web Speech API limited)
Contributions are welcome! Please read our contributing guidelines first.
MIT © Luke Oliff
Warts and all experiment into complex architecture development almost entirely through AI-prompting a code editor. Cursor using claude-4.5-sonnet. See my prompt log for exported prompts.