Add FrameForge Studio voice command integration via Whisper.cpp + Llama.cpp#2
Conversation
Co-authored-by: TheOriginalBytePlayer <[email protected]>
Co-authored-by: TheOriginalBytePlayer <[email protected]>
Co-authored-by: TheOriginalBytePlayer <[email protected]>
Co-authored-by: TheOriginalBytePlayer <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR adds voice command integration to FrameForge Studio by implementing a sidecar process that combines Whisper.cpp for speech-to-text transcription and Llama.cpp for intent classification. The system processes voice commands through a structured pipeline: audio transcription → intent classification → command validation → IPC transmission to a 32-bit bridge.
Key changes include:
- Command schema system with 4 action groups and 14 verbs covering camera control, actor posing, object management, and shot management
- CommandValidator with JSON parsing, parameter validation, and clarification request generation
- Cross-platform IPC implementation using named pipes (Windows/Linux) with length-prefixed protocol
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/frameforge/frameforge-schema.h | Defines command schema enums (ActionGroup, Verb, Direction) and core data structures |
| tools/frameforge/frameforge-schema.cpp | Implements enum conversions, action group mappings, and required parameter lookup with misspelling support |
| tools/frameforge/frameforge-validator.h | Declares CommandValidator interface for validation and clarification generation |
| tools/frameforge/frameforge-validator.cpp | Implements validation logic with JSON parsing and parameter checking |
| tools/frameforge/frameforge-json.h | Declares JSON serialization/deserialization functions |
| tools/frameforge/frameforge-json.cpp | Implements bidirectional JSON conversion for Command objects |
| tools/frameforge/frameforge-ipc.h | Declares cross-platform IPC server/client classes |
| tools/frameforge/frameforge-ipc.cpp | Implements named pipe communication with platform-specific code for Windows/Unix |
| tools/frameforge/frameforge-sidecar.cpp | Main application integrating Whisper transcription, Llama classification, WAV processing, and command handling |
| tools/frameforge/CMakeLists.txt | Build configuration linking common, llama, and whisper libraries |
| tools/CMakeLists.txt | Adds frameforge subdirectory to build system |
| tools/frameforge/README.md | Comprehensive documentation covering architecture, usage, examples, and development guidelines |
| tests/test-frameforge-validator.cpp | Unit tests for schema, validation, and JSON serialization (9 test cases) |
| tests/CMakeLists.txt | Adds frameforge validator test executable to test suite |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (msg_size == 0 || msg_size > MAX_MESSAGE_SIZE) { | ||
| return ""; | ||
| } |
There was a problem hiding this comment.
The message size validation allows sizes up to MAX_MESSAGE_SIZE (1MB), but there's no check to ensure msg_size is greater than 0 before attempting to read. While msg_size == 0 is checked, the allocation of an empty string and the read attempt could be avoided by returning early when msg_size is 0.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| if (msg_size == 0 || msg_size > MAX_MESSAGE_SIZE) { | ||
| return ""; | ||
| } |
There was a problem hiding this comment.
The same message size validation issue exists here. When msg_size is 0 or exceeds MAX_MESSAGE_SIZE, an early return happens after the check, but the logic could be clearer by checking msg_size > 0 before the upper bound check to avoid allocating an empty string.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| 1. If user says "PIN", map it to "PAN" verb | ||
| 2. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose | ||
| 3. Infer missing subjects when context is clear (e.g., "camera" for camera commands) | ||
| 4. Return ONLY a valid JSON object with this structure: |
There was a problem hiding this comment.
The INTENT_SYSTEM_PROMPT describes mapping misspellings like "PIN" to "PAN", but this is already handled in the string_to_verb function. Having this instruction in the LLM prompt may be redundant and could cause confusion if the LLM tries to do the correction itself when it's already handled at the parsing layer.
| 1. If user says "PIN", map it to "PAN" verb | |
| 2. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose | |
| 3. Infer missing subjects when context is clear (e.g., "camera" for camera commands) | |
| 4. Return ONLY a valid JSON object with this structure: | |
| 1. If Action Group is ACTOR_POSE, generate a JSON array of joint rotations for the described pose | |
| 2. Infer missing subjects when context is clear (e.g., "camera" for camera commands) | |
| 3. Return ONLY a valid JSON object with this structure: |
There was a problem hiding this comment.
@copilot Please implement this suggestion.
| pipe_fd_ = open(pipe_path.c_str(), O_RDWR | O_NONBLOCK); | ||
| if (pipe_fd_ < 0) { | ||
| std::cerr << "Failed to open named pipe: " << strerror(errno) << std::endl; | ||
| unlink(pipe_path.c_str()); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
The Unix pipe implementation opens the pipe with O_RDWR | O_NONBLOCK, but non-blocking mode could cause write and read operations to fail with EAGAIN/EWOULDBLOCK. The code doesn't handle these errors, which could lead to incomplete message transmission. Consider using blocking mode or implementing proper retry logic for non-blocking I/O.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| // First get the required size | ||
| const int n_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true); | ||
| std::vector<llama_token> tokens(n_tokens); | ||
| if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), tokens.size(), true, true) < 0) { |
There was a problem hiding this comment.
The tokenization call uses negative return value handling that assumes the function returns a negative value to indicate the required buffer size. However, this pattern should be documented or verified against the llama.cpp API, as it's non-standard. The comment "First get the required size" suggests this is intentional, but the unusual pattern could lead to confusion.
| // First get the required size | |
| const int n_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true); | |
| std::vector<llama_token> tokens(n_tokens); | |
| if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), tokens.size(), true, true) < 0) { | |
| // First get the required size by calling with a null buffer | |
| const int n_tokens = llama_tokenize(vocab, prompt.c_str(), prompt.size(), nullptr, 0, true, true); | |
| if (n_tokens < 0) { | |
| fprintf(stderr, "Error: Failed to get token count for prompt\n"); | |
| return ""; | |
| } | |
| std::vector<llama_token> tokens(n_tokens); | |
| const int n_tokens_actual = llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens.data(), n_tokens, true, true); | |
| if (n_tokens_actual < 0 || n_tokens_actual != n_tokens) { |
There was a problem hiding this comment.
@copilot please implement this suggestion
| constexpr int WAV_HEADER_SIZE = 44; | ||
| constexpr int WAV_SAMPLE_RATE_OFFSET = 24; | ||
|
|
||
| char buf[256]; |
There was a problem hiding this comment.
The WAV header reading uses a char buffer of size 256 but only reads WAV_HEADER_SIZE (44) bytes. The buffer size is unnecessarily large. Consider using a buffer sized exactly to WAV_HEADER_SIZE (44 bytes) to better reflect the actual usage.
| char buf[256]; | |
| char buf[WAV_HEADER_SIZE]; |
There was a problem hiding this comment.
@copilot Please implement this suggestion
| } | ||
|
|
||
| // Get sample rate from header (offset 24 in WAV spec) | ||
| sample_rate = *(int32_t *) (buf + WAV_SAMPLE_RATE_OFFSET); |
There was a problem hiding this comment.
The function uses pointer arithmetic (buf + WAV_SAMPLE_RATE_OFFSET) and casts to int32_t pointer without checking alignment. On some architectures, unaligned access can cause crashes or undefined behavior. Consider using memcpy to safely copy the 4 bytes into an int32_t variable.
| sample_rate = *(int32_t *) (buf + WAV_SAMPLE_RATE_OFFSET); | |
| int32_t sample_rate_raw; | |
| std::memcpy(&sample_rate_raw, buf + WAV_SAMPLE_RATE_OFFSET, sizeof(sample_rate_raw)); | |
| sample_rate = sample_rate_raw; |
There was a problem hiding this comment.
@copilot Please implement this suggestion
| } | ||
| } else if (arg == "-t" || arg == "--threads") { | ||
| if (i + 1 < argc) { | ||
| params.n_threads = std::stoi(argv[++i]); |
There was a problem hiding this comment.
The stoi function can throw std::invalid_argument or std::out_of_range exceptions if the input is not a valid integer or is out of range. The code should validate the input or catch these exceptions to provide better error messages to the user.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| - **Unix/Linux**: `/tmp/frameforge_pipe` | ||
|
|
There was a problem hiding this comment.
Using a fixed pipe name in /tmp for Unix (/tmp/frameforge_pipe) means any local user on the same host can open this IPC endpoint and send or read FrameForge control messages. This allows unauthorized local processes to impersonate the 32-bit bridge or eavesdrop on commands, breaking isolation between different users’ sessions. Consider creating the FIFO in a per-user directory and with restrictive permissions (e.g., owner-only) so that only the intended user/process can access this IPC channel.
| - **Unix/Linux**: `/tmp/frameforge_pipe` | |
| - **Unix/Linux**: use a per-user FIFO, for example: | |
| - `$XDG_RUNTIME_DIR/frameforge/frameforge_pipe` | |
| - or `/tmp/frameforge-$UID/frameforge_pipe` if `XDG_RUNTIME_DIR` is not set | |
| The per-user directory should be created with owner-only permissions (for example, `chmod 700` on the directory and `chmod 600` on the FIFO) so that only the intended user/process can access this IPC channel. |
There was a problem hiding this comment.
@copilot Please implement this suggestion
Description
Implements voice command system for FrameForge Studio using Whisper for speech-to-text and Llama for intent classification. Audio input is transcribed, parsed into structured JSON commands, validated, and sent to a 32-bit bridge via named pipes.
Architecture:
Key Features:
Example:
Input: "Pan the camera left"
Output:
{"verb": "PAN", "subject": "Camera1", "parameters": {"direction": "LEFT"}}Changes
tools/frameforge/: Schema, validator, JSON serialization, IPC implementation, main sidecar apptests/test-frameforge-validator.cpp: Unit tests (9 test cases)tools/CMakeLists.txt: Include frameforge subdirectorytests/CMakeLists.txt: Add frameforge validator testsComponents:
frameforge-schema.{h,cpp}: Command schema with enum converters, action group mapping, required parameter lookupframeforge-validator.{h,cpp}: Validation logic, JSON parsing, clarification request generationframeforge-json.{h,cpp}: JSON serialization using vendored nlohmann/jsonframeforge-ipc.{h,cpp}: Named pipe server/client with 1MB message limitframeforge-sidecar.cpp: Main application integrating Whisper + Llama, WAV file reader, tokenizationNote: This PR contains AI-generated code that has been tested and validated. All tests pass with 100% success rate.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.