Skip to content

Add Web Audio API as LLM configuration for catalog loader #228

@jchris

Description

@jchris

Problem

The current LLM catalog system only supports text and image generation models, but lacks support for audio generation and processing capabilities. This limits the types of AI-powered applications that can be built within the vibes ecosystem.

Current LLM Configuration Structure

The app currently loads LLM configurations from JSON files in app/llms/:

app/llms/
├── callai.json     # Text generation models
├── fireproof.json  # Database-related configs
└── image-gen.json  # Image generation models

Proposed Addition

Add a new web-audio.json configuration file to support audio generation and processing models:

{
  "name": "Web Audio APIs",
  "description": "Audio generation and processing models",
  "models": [
    {
      "id": "elevenlabs-speech",
      "name": "ElevenLabs Text-to-Speech",
      "description": "High-quality voice synthesis",
      "type": "audio-generation",
      "category": "speech",
      "endpoint": "https://api.elevenlabs.io/v1/text-to-speech",
      "inputTypes": ["text"],
      "outputTypes": ["audio/mpeg", "audio/wav"],
      "capabilities": ["text-to-speech", "voice-cloning"],
      "pricing": {
        "model": "per-character",
        "free_tier": 10000,
        "paid_tiers": ["starter", "creator", "pro"]
      }
    },
    {
      "id": "openai-whisper",
      "name": "OpenAI Whisper",
      "description": "Speech recognition and transcription",
      "type": "audio-processing",
      "category": "transcription",
      "endpoint": "https://api.openai.com/v1/audio/transcriptions",
      "inputTypes": ["audio/mpeg", "audio/wav", "audio/mp4"],
      "outputTypes": ["text"],
      "capabilities": ["speech-to-text", "translation", "language-detection"],
      "pricing": {
        "model": "per-minute",
        "rate": 0.006
      }
    },
    {
      "id": "musicgen",
      "name": "Meta MusicGen",
      "description": "AI music generation from text prompts",
      "type": "audio-generation",
      "category": "music",
      "endpoint": "https://api.replicate.com/v1/predictions",
      "inputTypes": ["text"],
      "outputTypes": ["audio/wav"],
      "capabilities": ["music-generation", "melody-continuation"],
      "pricing": {
        "model": "per-generation",
        "rate": 0.025
      }
    },
    {
      "id": "bark-audio",
      "name": "Suno Bark",
      "description": "Text-to-audio generation with voice effects",
      "type": "audio-generation", 
      "category": "speech-effects",
      "endpoint": "https://api.replicate.com/v1/predictions",
      "inputTypes": ["text"],
      "outputTypes": ["audio/wav"],
      "capabilities": ["text-to-speech", "sound-effects", "background-music"],
      "pricing": {
        "model": "per-generation",
        "rate": 0.035
      }
    }
  ],
  "categories": {
    "speech": {
      "name": "Speech Synthesis",
      "description": "Text-to-speech and voice generation"
    },
    "transcription": {
      "name": "Speech Recognition",
      "description": "Audio-to-text transcription and analysis"
    },
    "music": {
      "name": "Music Generation",
      "description": "AI-powered music and sound creation"
    },
    "speech-effects": {
      "name": "Audio Effects",
      "description": "Voice effects and audio manipulation"
    }
  },
  "use_cases": [
    "Voice-over generation for videos",
    "Podcast transcription and editing",
    "Music creation for games and apps",
    "Accessibility features (text-to-speech)",
    "Language learning applications",
    "Audio content analysis",
    "Real-time voice effects",
    "Audio-based chatbots and assistants"
  ]
}

Benefits

For Developers

  • Audio-First Apps: Enable creation of voice-based applications
  • Accessibility: Built-in text-to-speech and speech-to-text capabilities
  • Creative Tools: Music and audio effect generation for creative apps
  • Multimodal Experiences: Combine text, image, and audio in single applications

For Users

  • Voice Interfaces: Natural voice interaction with AI applications
  • Audio Content: Podcast generators, voice-over tools, music creators
  • Accessibility: Screen readers, voice navigation, audio descriptions
  • Entertainment: Interactive audio experiences, voice games

For Ecosystem

  • Model Diversity: Expand beyond text/image to audio capabilities
  • Market Expansion: Reach audio-focused use cases and industries
  • Innovation: Enable new types of AI applications
  • Standardization: Consistent audio API patterns across the platform

Implementation Details

Catalog Loader Integration

// Extend existing catalog loader to include audio models
interface AudioModel extends BaseModel {
  type: 'audio-generation' | 'audio-processing';
  category: 'speech' | 'transcription' | 'music' | 'speech-effects';
  inputTypes: string[];
  outputTypes: string[];
  capabilities: string[];
}

Web Audio API Integration

// Add audio handling utilities
export interface AudioConfig {
  sampleRate?: number;
  channels?: number;
  bitDepth?: number;
  format: 'wav' | 'mp3' | 'ogg';
}

export async function generateAudio(
  prompt: string,
  modelId: string,
  config?: AudioConfig
): Promise<AudioBuffer> {
  // Audio generation implementation
}

export async function processAudio(
  audioBuffer: AudioBuffer,
  modelId: string,
  operation: string
): Promise<string | AudioBuffer> {
  // Audio processing implementation
}

Component Integration

// Audio player/recorder components for vibes apps
export function AudioPlayer({ src, controls }: AudioPlayerProps) {
  // Web Audio API integration
}

export function AudioRecorder({ onRecording }: AudioRecorderProps) {
  // MediaRecorder API integration
}

Technical Considerations

Browser Compatibility

  • Web Audio API support across modern browsers
  • Fallbacks for audio format compatibility
  • Progressive enhancement for audio features

Performance

  • Streaming audio for large generations
  • Audio compression and optimization
  • Efficient audio buffer management

Security

  • CORS handling for audio resources
  • Secure audio upload/download
  • User permission handling for microphone access

Use Cases and Examples

Voice-Powered Chat Assistant

// Example vibe using audio LLMs
function VoiceChatBot() {
  const [isListening, setIsListening] = useState(false);
  const [response, setResponse] = useState('');
  
  const handleSpeech = async (audioBuffer) => {
    // Use Whisper for speech-to-text
    const text = await processAudio(audioBuffer, 'openai-whisper', 'transcribe');
    
    // Generate text response
    const reply = await generateText(text, 'claude-sonnet-4');
    
    // Convert to speech
    const audioResponse = await generateAudio(reply, 'elevenlabs-speech');
    
    // Play audio response
    playAudio(audioResponse);
  };
  
  return (
    <div>
      <AudioRecorder onRecording={handleSpeech} />
      <AudioPlayer src={response} autoPlay />
    </div>
  );
}

Music Generation Tool

function MusicGenerator() {
  const [prompt, setPrompt] = useState('');
  const [generatedMusic, setGeneratedMusic] = useState(null);
  
  const generateMusic = async () => {
    const music = await generateAudio(prompt, 'musicgen', {
      format: 'wav',
      sampleRate: 44100
    });
    setGeneratedMusic(music);
  };
  
  return (
    <div>
      <input 
        value={prompt} 
        onChange={(e) => setPrompt(e.target.value)}
        placeholder="Describe the music you want..."
      />
      <button onClick={generateMusic}>Generate Music</button>
      {generatedMusic && <AudioPlayer src={generatedMusic} controls />}
    </div>
  );
}

Migration and Rollout

Phase 1: Configuration Setup

  1. Create web-audio.json configuration file
  2. Update catalog loader to include audio models
  3. Add basic audio types and interfaces

Phase 2: Core Audio Support

  1. Implement audio generation/processing functions
  2. Add Web Audio API utilities
  3. Create basic audio components

Phase 3: Model Integration

  1. Integrate first audio model (e.g., ElevenLabs TTS)
  2. Test end-to-end audio workflows
  3. Create example audio-powered vibes

Phase 4: Ecosystem Expansion

  1. Add multiple audio model providers
  2. Create audio component library
  3. Document audio development patterns

Success Metrics

  • Number of audio-powered vibes created
  • Audio model API usage and adoption
  • Developer feedback on audio tooling
  • User engagement with audio features
  • Performance benchmarks for audio operations

Related Issues

  • Enhanced multimedia capabilities across the platform
  • Accessibility improvements with audio support
  • Creative tooling expansion beyond text/image

Next Steps

  1. Research audio model APIs and pricing
  2. Design audio configuration schema
  3. Create proof-of-concept audio integration
  4. Plan audio component architecture
  5. Test browser audio API compatibility

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions