-
Notifications
You must be signed in to change notification settings - Fork 13k
feat(cli): introduce experimental voice mode architecture skeleton #20779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Sangini-spec
wants to merge
1
commit into
google-gemini:main
Choose a base branch
from
Sangini-spec:feat/voice-architecture-skeleton
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| { | ||
| "permissions": { | ||
| "allow": [ | ||
| "Bash(done)", | ||
| "Bash(xargs grep -l \"mode\")", | ||
| "Bash(xargs grep -l \"hook\\\\|middleware\\\\|plugin\")", | ||
| "Bash(xargs grep -l \"mode\\\\|Mode\")", | ||
| "Bash(xargs grep -l \"SecurityModel\\\\|sandbox\")", | ||
| "Bash(xargs -I {} bash -c 'echo \"\"=== {} ===\"\" && head -50 \"\"{}\"\"')", | ||
| "Bash(node --version)", | ||
| "Bash(npm install)", | ||
| "Bash(npm run build)", | ||
| "Bash(npm start)", | ||
| "Bash(node packages/cli/bundle/gemini.js --version)", | ||
| "Bash(node bundle/gemini.js --version)", | ||
| "Bash(node bundle/gemini.js --help)", | ||
| "Bash(npm run build --workspace=@google/gemini-cli-core)", | ||
| "Bash(npm run build --workspace=@google/gemini-cli)", | ||
| "Bash(node bundle/gemini.js --voice)", | ||
| "Bash(npm run bundle)", | ||
| "Bash(npm start -- --voice)", | ||
| "Bash(node bundle/gemini.js)", | ||
| "Bash(echo \"EXIT CODE: $?\")", | ||
| "Bash(npm run test --workspace=@google/gemini-cli-core)", | ||
| "Bash(npm run test --workspace=@google/gemini-cli)", | ||
| "Bash(npx vitest run packages/cli/src/gemini.test.tsx)", | ||
| "Bash(git checkout -- 'packages/cli/src/ui/components/__snapshots__/ConfigInitDisplay.test.tsx.snap')", | ||
| "Bash(git checkout -- package-lock.json)" | ||
| ] | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| /** | ||
| * @license | ||
| * Copyright 2026 Google LLC | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| /** | ||
| * Contracts for the Hands-Free Voice Mode pipeline. | ||
| * | ||
| * Architecture: | ||
| * Mic → AudioInputProvider → SpeechToTextAdapter → [Gemini API] → TextToSpeechAdapter → Speaker | ||
| * | ||
| * Each interface is designed to be swappable — the initial implementation | ||
| * will use no-op stubs, and real backends (native audio, WebSocket bridges, | ||
| * MCP audio servers, Gemini Live API) can be plugged in later. | ||
| */ | ||
|
|
||
| /** PCM audio chunk emitted by an AudioInputProvider. */ | ||
| export interface AudioChunk { | ||
| /** Raw PCM sample data. */ | ||
| readonly samples: Buffer; | ||
| /** Sample rate in Hz (e.g. 16000). */ | ||
| readonly sampleRate: number; | ||
| /** Number of audio channels (1 = mono, 2 = stereo). */ | ||
| readonly channels: number; | ||
| } | ||
|
|
||
| /** Captures audio from a microphone or other input device. */ | ||
| export interface AudioInputProvider { | ||
| /** Begin capturing audio. Implementations should emit chunks via the callback. */ | ||
| start(onChunk: (chunk: AudioChunk) => void): Promise<void>; | ||
| /** Stop capturing and release resources. */ | ||
| stop(): Promise<void>; | ||
| /** Whether the provider is currently capturing. */ | ||
| isActive(): boolean; | ||
| } | ||
|
|
||
| /** Converts an audio chunk to text (speech-to-text). */ | ||
| export interface SpeechToTextAdapter { | ||
| /** Transcribe a single audio chunk. Returns the transcribed text. */ | ||
| transcribe(chunk: AudioChunk): Promise<string>; | ||
| } | ||
|
|
||
| /** Converts text to audible speech (text-to-speech). */ | ||
| export interface TextToSpeechAdapter { | ||
| /** Synthesize text into audio and play it back. Resolves when playback ends. */ | ||
| speak(text: string): Promise<void>; | ||
| /** Interrupt any in-progress playback. */ | ||
| cancel(): Promise<void>; | ||
| } | ||
|
|
||
| /** Configuration for a voice session. */ | ||
| export interface VoiceSessionConfig { | ||
| /** Sample rate in Hz for audio capture (default: 16000). */ | ||
| sampleRate?: number; | ||
| /** Locale/language code for STT/TTS (e.g. "en-US"). */ | ||
| locale?: string; | ||
| } | ||
|
|
||
| /** Lifecycle states for the voice mode controller. */ | ||
| export enum VoiceState { | ||
| Idle = 'idle', | ||
| Listening = 'listening', | ||
| Processing = 'processing', | ||
| Speaking = 'speaking', | ||
| Error = 'error', | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| /** | ||
| * @license | ||
| * Copyright 2026 Google LLC | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| import { debugLogger } from '../utils/debugLogger.js'; | ||
| import type { | ||
| AudioInputProvider, | ||
| SpeechToTextAdapter, | ||
| TextToSpeechAdapter, | ||
| VoiceSessionConfig, | ||
| } from './types.js'; | ||
| import { VoiceState } from './types.js'; | ||
|
|
||
| /** | ||
| * Orchestrates the voice mode lifecycle. | ||
| * | ||
| * Wires together an AudioInputProvider, SpeechToTextAdapter, and | ||
| * TextToSpeechAdapter into a coherent listen→transcribe→respond→speak loop. | ||
| * | ||
| * This is a skeleton — real audio backends will be injected later. | ||
| * The controller is intentionally thin so it can be tested without hardware. | ||
| */ | ||
| export class VoiceModeController { | ||
| private state: VoiceState = VoiceState.Idle; | ||
| private readonly audioInput: AudioInputProvider; | ||
| private readonly stt: SpeechToTextAdapter; | ||
| private readonly tts: TextToSpeechAdapter; | ||
| private readonly config: VoiceSessionConfig; | ||
|
|
||
| constructor( | ||
| audioInput: AudioInputProvider, | ||
| stt: SpeechToTextAdapter, | ||
| tts: TextToSpeechAdapter, | ||
| config: VoiceSessionConfig = {}, | ||
| ) { | ||
| this.audioInput = audioInput; | ||
| this.stt = stt; | ||
| this.tts = tts; | ||
| this.config = config; | ||
| } | ||
|
|
||
| /** Current lifecycle state. */ | ||
| getState(): VoiceState { | ||
| return this.state; | ||
| } | ||
|
|
||
| /** | ||
| * Start the voice session. | ||
| * Opens the audio input and begins the listen loop. | ||
| */ | ||
| async start(): Promise<void> { | ||
| if (this.state !== VoiceState.Idle) { | ||
| debugLogger.warn( | ||
| `VoiceModeController.start() called in state "${this.state}", ignoring.`, | ||
| ); | ||
| return; | ||
| } | ||
|
|
||
| debugLogger.log( | ||
| `[voice] Starting voice mode (locale=${this.config.locale ?? 'default'}, ` + | ||
| `sampleRate=${String(this.config.sampleRate ?? 16000)})`, | ||
| ); | ||
|
|
||
| this.state = VoiceState.Listening; | ||
|
|
||
| await this.audioInput.start(async (chunk) => { | ||
| if (this.state !== VoiceState.Listening) return; | ||
|
|
||
| try { | ||
| this.state = VoiceState.Processing; | ||
| const transcript = await this.stt.transcribe(chunk); | ||
|
|
||
| if (transcript.trim().length === 0) { | ||
| this.state = VoiceState.Listening; | ||
| return; | ||
| } | ||
|
|
||
| debugLogger.log(`[voice] Transcript: "${transcript}"`); | ||
|
|
||
| // In the future this is where the transcript feeds into the Gemini | ||
| // conversation loop (GeminiClient.sendMessageStream). For now, we | ||
| // echo it back through TTS as a proof-of-lifecycle. | ||
| this.state = VoiceState.Speaking; | ||
| await this.tts.speak(transcript); | ||
| } catch (err) { | ||
| debugLogger.error('[voice] Error in voice pipeline:', err); | ||
| this.state = VoiceState.Error; | ||
| } finally { | ||
| if ( | ||
| this.state === VoiceState.Speaking || | ||
| this.state === VoiceState.Processing | ||
| ) { | ||
| this.state = VoiceState.Listening; | ||
| } | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| /** Stop the voice session and release resources. */ | ||
| async stop(): Promise<void> { | ||
| debugLogger.log('[voice] Stopping voice mode.'); | ||
| await this.tts.cancel(); | ||
| await this.audioInput.stop(); | ||
| this.state = VoiceState.Idle; | ||
| } | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.