Skip to content

Add FrameForge Studio voice command integration via Whisper.cpp + Llama.cpp#2

Merged
TheOriginalBytePlayer merged 5 commits intomasterfrom
copilot/integrate-whisper-into-llama
Dec 29, 2025
Merged

Add FrameForge Studio voice command integration via Whisper.cpp + Llama.cpp#2
TheOriginalBytePlayer merged 5 commits intomasterfrom
copilot/integrate-whisper-into-llama

Conversation

Copy link
Copy Markdown

Copilot AI commented Dec 29, 2025

Description

Implements voice command system for FrameForge Studio using Whisper for speech-to-text and Llama for intent classification. Audio input is transcribed, parsed into structured JSON commands, validated, and sent to a 32-bit bridge via named pipes.

Architecture:

  • Schema: JSON command structure with 4 action groups (CAMERA_CONTROL, ACTOR_POSE, OBJECT_MGMT, SHOT_MGMT) and 14 verbs (PAN, TILT, DOLLY, etc.)
  • Intent Engine: Whisper transcribes audio → Llama classifies intent with structured prompt → outputs JSON with verb, subject, parameters
  • Validation: CommandValidator checks required parameters per verb, generates clarification requests for missing data
  • IPC: Cross-platform named pipes (Windows/Linux) with length-prefixed protocol

Key Features:

  • Handles misspellings (e.g., "PIN" → "PAN")
  • Supports complex pose commands with joint rotation arrays
  • Test mode for audio file processing
  • Server mode for IPC daemon

Example:

# Test mode
./frameforge-sidecar -wm whisper-base.en.bin -lm llama-3-8b.gguf -a command.wav

# Server mode
./frameforge-sidecar -wm whisper-base.en.bin -lm llama-3-8b.gguf -p frameforge_pipe

Input: "Pan the camera left"
Output: {"verb": "PAN", "subject": "Camera1", "parameters": {"direction": "LEFT"}}

Changes

  • Add tools/frameforge/: Schema, validator, JSON serialization, IPC implementation, main sidecar app
  • Add tests/test-frameforge-validator.cpp: Unit tests (9 test cases)
  • Update tools/CMakeLists.txt: Include frameforge subdirectory
  • Update tests/CMakeLists.txt: Add frameforge validator tests

Components:

  • frameforge-schema.{h,cpp}: Command schema with enum converters, action group mapping, required parameter lookup
  • frameforge-validator.{h,cpp}: Validation logic, JSON parsing, clarification request generation
  • frameforge-json.{h,cpp}: JSON serialization using vendored nlohmann/json
  • frameforge-ipc.{h,cpp}: Named pipe server/client with 1MB message limit
  • frameforge-sidecar.cpp: Main application integrating Whisper + Llama, WAV file reader, tokenization

Note: This PR contains AI-generated code that has been tested and validated. All tests pass with 100% success rate.

Original prompt

With whisper.cpp now a submodule of llama.cpp, review the original directive, which follows below, and integrate the whisper functionality as it describes into the same single project.

  1. The Schema: Define a JSON-based command schema. Every command must have a Verb (e.g., PAN, LEAN, ADD, DELETE, SHOT) and a Subject (e.g., Camera1, Tom). Verbs must have associated Parameters (Direction, Degrees, Speed, Target, PoseDescription).

  2. The Intent Engine: When audio is received and transcribed by Whisper, send the text to Llama-3 with a strict system prompt for Intent Classification:

'Analyze the following user input for FrameForge Studio. Map the input to one of these Action Groups: [CAMERA_CONTROL, ACTOR_POSE, OBJECT_MGMT, SHOT_MGMT].'

'Extract parameters. If the user says "PIN LEFT", match it to the nearest valid verb "PAN" with direction "LEFT".'

'If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose.'

'Return ONLY a valid JSON object.'

  1. Command Validation Logic: Implement a C++ CommandValidator class that checks the LLM's JSON against the required slots for that specific verb. If a required slot (like Direction for PAN) is missing, return an error object asking the user for clarification.

  2. Communication: Maintain the 64-bit Sidecar as a resident process. Use a high-speed IPC (Inter-Process Communication) like Named Pipes to send the validated JSON objects to the 32-bit FrameForge Bridge."


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Integrate whisper functionality into llama.cpp project Add FrameForge Studio voice command integration via Whisper.cpp + Llama.cpp Dec 29, 2025
@TheOriginalBytePlayer TheOriginalBytePlayer marked this pull request as ready for review December 29, 2025 12:51
Copilot AI review requested due to automatic review settings December 29, 2025 12:51
@TheOriginalBytePlayer TheOriginalBytePlayer merged commit 27fd48b into master Dec 29, 2025
2 checks passed
@TheOriginalBytePlayer TheOriginalBytePlayer deleted the copilot/integrate-whisper-into-llama branch December 29, 2025 12:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds voice command integration to FrameForge Studio by implementing a sidecar process that combines Whisper.cpp for speech-to-text transcription and Llama.cpp for intent classification. The system processes voice commands through a structured pipeline: audio transcription → intent classification → command validation → IPC transmission to a 32-bit bridge.

Key changes include:

  • Command schema system with 4 action groups and 14 verbs covering camera control, actor posing, object management, and shot management
  • CommandValidator with JSON parsing, parameter validation, and clarification request generation
  • Cross-platform IPC implementation using named pipes (Windows/Linux) with length-prefixed protocol

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
tools/frameforge/frameforge-schema.h Defines command schema enums (ActionGroup, Verb, Direction) and core data structures
tools/frameforge/frameforge-schema.cpp Implements enum conversions, action group mappings, and required parameter lookup with misspelling support
tools/frameforge/frameforge-validator.h Declares CommandValidator interface for validation and clarification generation
tools/frameforge/frameforge-validator.cpp Implements validation logic with JSON parsing and parameter checking
tools/frameforge/frameforge-json.h Declares JSON serialization/deserialization functions
tools/frameforge/frameforge-json.cpp Implements bidirectional JSON conversion for Command objects
tools/frameforge/frameforge-ipc.h Declares cross-platform IPC server/client classes
tools/frameforge/frameforge-ipc.cpp Implements named pipe communication with platform-specific code for Windows/Unix
tools/frameforge/frameforge-sidecar.cpp Main application integrating Whisper transcription, Llama classification, WAV processing, and command handling
tools/frameforge/CMakeLists.txt Build configuration linking common, llama, and whisper libraries
tools/CMakeLists.txt Adds frameforge subdirectory to build system
tools/frameforge/README.md Comprehensive documentation covering architecture, usage, examples, and development guidelines
tests/test-frameforge-validator.cpp Unit tests for schema, validation, and JSON serialization (9 test cases)
tests/CMakeLists.txt Adds frameforge validator test executable to test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +325 to +327
if (msg_size == 0 || msg_size > MAX_MESSAGE_SIZE) {
return "";
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message size validation allows sizes up to MAX_MESSAGE_SIZE (1MB), but there's no check to ensure msg_size is greater than 0 before attempting to read. While msg_size == 0 is checked, the allocation of an empty string and the read attempt could be avoided by returning early when msg_size is 0.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +348 to +350
if (msg_size == 0 || msg_size > MAX_MESSAGE_SIZE) {
return "";
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same message size validation issue exists here. When msg_size is 0 or exceeds MAX_MESSAGE_SIZE, an early return happens after the check, but the logic could be clearer by checking msg_size > 0 before the upper bound check to avoid allocating an empty string.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +36 to +39
1. If user says "PIN", map it to "PAN" verb
2. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose
3. Infer missing subjects when context is clear (e.g., "camera" for camera commands)
4. Return ONLY a valid JSON object with this structure:
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INTENT_SYSTEM_PROMPT describes mapping misspellings like "PIN" to "PAN", but this is already handled in the string_to_verb function. Having this instruction in the LLM prompt may be redundant and could cause confusion if the LLM tries to do the correction itself when it's already handled at the parsing layer.

Suggested change
1. If user says "PIN", map it to "PAN" verb
2. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose
3. Infer missing subjects when context is clear (e.g., "camera" for camera commands)
4. Return ONLY a valid JSON object with this structure:
1. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose
2. Infer missing subjects when context is clear (e.g., "camera" for camera commands)
3. Return ONLY a valid JSON object with this structure:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Please implement this suggestion.

Comment on lines +157 to +162
pipe_fd_ = open(pipe_path.c_str(), O_RDWR | O_NONBLOCK);
if (pipe_fd_ < 0) {
std::cerr << "Failed to open named pipe: " << strerror(errno) << std::endl;
unlink(pipe_path.c_str());
return false;
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unix pipe implementation opens the pipe with O_RDWR | O_NONBLOCK, but non-blocking mode could cause write and read operations to fail with EAGAIN/EWOULDBLOCK. The code doesn't handle these errors, which could lead to incomplete message transmission. Consider using blocking mode or implementing proper retry logic for non-blocking I/O.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +224 to +227
// First get the required size
const int n_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
std::vector<llama_token> tokens(n_tokens);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), tokens.size(), true, true) < 0) {
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenization call uses negative return value handling that assumes the function returns a negative value to indicate the required buffer size. However, this pattern should be documented or verified against the llama.cpp API, as it's non-standard. The comment "First get the required size" suggests this is intentional, but the unusual pattern could lead to confusion.

Suggested change
// First get the required size
const int n_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
std::vector<llama_token> tokens(n_tokens);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), tokens.size(), true, true) < 0) {
// First get the required size by calling with a null buffer
const int n_tokens = llama_tokenize(vocab, prompt.c_str(), prompt.size(), nullptr, 0, true, true);
if (n_tokens < 0) {
fprintf(stderr, "Error: Failed to get token count for prompt\n");
return "";
}
std::vector<llama_token> tokens(n_tokens);
const int n_tokens_actual = llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), n_tokens, true, true);
if (n_tokens_actual < 0 || n_tokens_actual != n_tokens) {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please implement this suggestion

constexpr int WAV_HEADER_SIZE = 44;
constexpr int WAV_SAMPLE_RATE_OFFSET = 24;

char buf[256];
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAV header reading uses a char buffer of size 256 but only reads WAV_HEADER_SIZE (44) bytes. The buffer size is unnecessarily large. Consider using a buffer sized exactly to WAV_HEADER_SIZE (44 bytes) to better reflect the actual usage.

Suggested change
char buf[256];
char buf[WAV_HEADER_SIZE];

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Please implement this suggestion

}

// Get sample rate from header (offset 24 in WAV spec)
sample_rate = *(int32_t *) (buf + WAV_SAMPLE_RATE_OFFSET);
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses pointer arithmetic (buf + WAV_SAMPLE_RATE_OFFSET) and casts to int32_t pointer without checking alignment. On some architectures, unaligned access can cause crashes or undefined behavior. Consider using memcpy to safely copy the 4 bytes into an int32_t variable.

Suggested change
sample_rate = *(int32_t *) (buf + WAV_SAMPLE_RATE_OFFSET);
int32_t sample_rate_raw;
std::memcpy(&sample_rate_raw, buf + WAV_SAMPLE_RATE_OFFSET, sizeof(sample_rate_raw));
sample_rate = sample_rate_raw;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Please implement this suggestion

}
} else if (arg == "-t" || arg == "--threads") {
if (i + 1 < argc) {
params.n_threads = std::stoi(argv[++i]);
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stoi function can throw std::invalid_argument or std::out_of_range exceptions if the input is not a valid integer or is out of range. The code should validate the input or catch these exceptions to provide better error messages to the user.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +139 to +140
- **Unix/Linux**: `/tmp/frameforge_pipe`

Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed pipe name in /tmp for Unix (/tmp/frameforge_pipe) means any local user on the same host can open this IPC endpoint and send or read FrameForge control messages. This allows unauthorized local processes to impersonate the 32-bit bridge or eavesdrop on commands, breaking isolation between different users’ sessions. Consider creating the FIFO in a per-user directory and with restrictive permissions (e.g., owner-only) so that only the intended user/process can access this IPC channel.

Suggested change
- **Unix/Linux**: `/tmp/frameforge_pipe`
- **Unix/Linux**: use a per-user FIFO, for example:
- `$XDG_RUNTIME_DIR/frameforge/frameforge_pipe`
- or `/tmp/frameforge-$UID/frameforge_pipe` if `XDG_RUNTIME_DIR` is not set
The per-user directory should be created with owner-only permissions (for example, `chmod 700` on the directory and `chmod 600` on the FIFO) so that only the intended user/process can access this IPC channel.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Please implement this suggestion

TheOriginalBytePlayer added a commit that referenced this pull request Dec 30, 2025
…olved-comments

Document unresolved code review comments from merged PRs #1 and #2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants