-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add Support For Gemini TTS #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: groxaxo <[email protected]>
Co-authored-by: groxaxo <[email protected]>
Co-authored-by: groxaxo <[email protected]>
…tion Add OpenAI speech API endpoint with Gemini TTS backend
✅ Deploy Preview for gemini-pro ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
WalkthroughAdds Text-to-Speech docs for /v1/audio/speech and implements a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant Worker as /audio/speech Handler
participant Gemini as Gemini API
participant Converter as Format Converter
Client->>Worker: POST /audio/speech (model, input, voice, response_format)
Worker->>Worker: Map model → Gemini TTS\nMap voice → Gemini voice\nValidate input & voice
Worker->>Gemini: generateContent (speech synthesis request)
alt Success
Gemini-->>Worker: Base64-encoded audio or PCM payload
Worker->>Converter: Convert/unwrap to requested format (mp3/opus/aac/flac/wav/pcm)
Converter-->>Worker: Audio bytes
Worker-->>Client: 200 OK (audio bytes, Content-Type, CORS)
else API Error
Gemini-->>Worker: Error response
Worker-->>Client: Error response (preserve CORS)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
README.md (1)
230-246: Consider fixing markdown list indentation for consistency.The nested list items under
audio/speechuse 6-space indentation instead of the expected 4 spaces, which is inconsistent with markdown best practices.Apply this diff to fix the indentation:
- [x] `audio/speech` (Text-to-Speech) <details> - - [x] `model` - - `tts-1` => `gemini-2.5-flash-preview-tts` - - `tts-1-hd` => `gemini-2.5-pro-preview-tts` - - Can also specify Gemini model names directly - - [x] `input` (required) - - [x] `voice` (required) - - Supported: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer` - - Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede - - [x] `response_format` - - Supported: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm` - - Default: `mp3` - - [ ] `speed` (not yet implemented) + - [x] `model` + - `tts-1` => `gemini-2.5-flash-preview-tts` + - `tts-1-hd` => `gemini-2.5-pro-preview-tts` + - Can also specify Gemini model names directly + - [x] `input` (required) + - [x] `voice` (required) + - Supported: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer` + - Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede + - [x] `response_format` + - Supported: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm` + - Default: `mp3` + - [ ] `speed` (not yet implemented)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (2)
README.md(2 hunks)src/worker.mjs(2 hunks)
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md
[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.
(curl-auth-header)
🪛 markdownlint-cli2 (0.18.1)
README.md
234-234: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
235-235: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
236-236: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
239-239: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
240-240: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
242-242: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
243-243: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
🔇 Additional comments (6)
README.md (1)
149-181: Excellent documentation for the new TTS feature.The documentation clearly explains the endpoint usage, model mappings, and voice mappings. The example is helpful and comprehensive.
Note: The Gitleaks warning about the authorization token on line 157-158 is a false positive—this is example documentation with a placeholder value.
src/worker.mjs (5)
34-37: LGTM! Route handler follows existing patterns.The new
/audio/speechendpoint is correctly integrated into the routing logic, consistent with other endpoints in terms of method assertion and error handling.
149-160: Voice mapping is well-defined with clear documentation.The mapping between OpenAI and Gemini voices is sensible and well-commented. Note that both
novaandshimmermap to the same Gemini voice (Aoede), which is acceptable given Gemini's available voice options.
161-189: Model mapping and input validation are correctly implemented.The model name mapping handles all expected cases (OpenAI model names, Gemini model names, and defaults), and required fields are properly validated with appropriate error messages.
191-233: Gemini API integration looks correct.The request construction properly configures audio response with voice settings, and error handling preserves CORS headers while logging errors. The response validation ensures audio data is present before proceeding.
277-307: Hardcoded audio parameters are correct for Gemini's TTS output.Gemini TTS API uses 24000 Hz sample rate, mono (1 channel), and 16-bit PCM encoding (LINEAR16) by default, which matches the implementation exactly. The parameters do not need to be made configurable or extracted from metadata for standard speech generation use cases.
Co-authored-by: groxaxo <[email protected]>
Fix TTS audio format handling - limit to WAV/PCM, reject unsupported formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
README.md (1)
157-158: Clarify the authorization header in the curl example.The curl example shows
Authorization: Bearer YOUR_GEMINI_API_KEY. While this is a placeholder, it may be clearer to use angle brackets (e.g.,<YOUR_GEMINI_API_KEY>) to explicitly indicate it's a placeholder and should not be copied literally. This avoids potential confusion or accidental exposure if users copy the example without substitution.curl https://your-endpoint.com/v1/audio/speech \ - -H "Authorization: Bearer YOUR_GEMINI_API_KEY" \ + -H "Authorization: Bearer <YOUR_GEMINI_API_KEY>" \
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
README.md(2 hunks)src/worker.mjs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/worker.mjs
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md
[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.
(curl-auth-header)
🪛 markdownlint-cli2 (0.18.1)
README.md
234-234: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
235-235: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
236-236: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
239-239: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
240-240: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
242-242: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
243-243: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
244-244: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
🔇 Additional comments (1)
README.md (1)
150-181: Comprehensive TTS documentation—verify alignment with implementation.The new TTS section provides clear documentation with model mappings, voice mappings, and usage examples. The parameter detail list (lines 231–247) and the note about PCM format (line 244) and external conversion tools are helpful for users.
Please verify that:
- The model names (
gemini-2.5-flash-preview-ttsandgemini-2.5-pro-preview-tts) match those used insrc/worker.mjs.- The voice mappings (alloy→Puck, echo→Charon, etc.) match the
VOICE_MAPimplementation.- The supported response formats (wav, pcm) and their behavior match the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
README.md (1)
239-240: Clarify voice mapping for shimmer in the supported endpoints section.The list shows 6 OpenAI voices but only 5 Gemini voice names. While the TTS section (lines 175-180) makes it explicit that both
novaandshimmermap to Aoede, this could be clearer in the supported endpoints section. Consider adding a note: "Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede (nova and shimmer both map to Aoede)".
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(2 hunks)
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md
[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.
(curl-auth-header)
🔇 Additional comments (2)
README.md (2)
157-158: Consider using environment variables for the Authorization header example.The Gitleaks security scanner flagged the bare
Authorization: Bearer YOUR_GEMINI_API_KEYheader in the curl example. While this uses a placeholder, it's a best practice to avoid showing credential patterns directly in command examples, even with placeholders. Consider documenting the use of environment variables instead.Example improvement:
curl https://your-endpoint.com/v1/audio/speech \ -H "Authorization: Bearer $GEMINI_API_KEY" \ ...This makes it clearer that the key should never be hardcoded in commands.
242-244: Verify audio format support discrepancy.The documentation states that "For mp3, opus, aac, or flac, use external conversion tools like ffmpeg," implying these formats are not natively supported. However, the AI summary indicates the implementation includes "converts returned audio to requested formats (mp3/opus/aac/flac/wav/pcm)."
This is a critical discrepancy—if the implementation supports these formats, the documentation should be updated to reflect that. If only wav/pcm are supported, the implementation summary should be corrected.
Speech endpoints for Gemini TTS Flash and pro integrated into the project.
Summary by CodeRabbit
New Features
Documentation