Skip to content

Conversation

@groxaxo
Copy link

@groxaxo groxaxo commented Nov 13, 2025

Speech endpoints for Gemini TTS Flash and pro integrated into the project.

Summary by CodeRabbit

  • New Features

    • Added a Text-to-Speech endpoint to convert text to audio with multiple voice options, model aliases, and output formats (MP3, Opus, AAC, FLAC, WAV, PCM). Response headers preserve CORS. Playback speed parameter not yet available.
  • Documentation

    • Added comprehensive TTS docs with usage examples, model/voice mappings, parameter guidance, and response format notes.

@netlify
Copy link

netlify bot commented Nov 13, 2025

Deploy Preview for gemini-pro ready!

Name Link
🔨 Latest commit 8f637ab
🔍 Latest deploy log https://app.netlify.com/projects/gemini-pro/deploys/6915a145731d8400079db6c0
😎 Deploy Preview https://deploy-preview-81--gemini-pro.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

Walkthrough

Adds Text-to-Speech docs for /v1/audio/speech and implements a new /audio/speech POST handler in the worker that maps OpenAI-like models/voices to Gemini TTS, calls Gemini generateContent, converts returned audio to requested formats (mp3/opus/aac/flac/wav/pcm), and returns audio with CORS.

Changes

Cohort / File(s) Summary
Documentation
README.md
Added TTS documentation describing the /v1/audio/speech endpoint with example usage, model mappings (tts-1, tts-1-hd and Gemini model names), voice mappings, parameter guidance (input, voice, response_format), and note that speed is not yet implemented.
Speech Synthesis Implementation
src/worker.mjs
Added handleSpeech(req, apiKey) and routed POST /audio/speech to it. Introduced DEFAULT_SPEECH_MODEL = "gemini-2.5-flash-preview-tts" and VOICE_MAP (alloy→Puck, echo→Charon, fable→Kore, onyx→Fenrir, nova→Aoede, shimmer→Aoede). Implements Gemini generateContent request, validates input and voice, parses base64/PCM responses, supports output formats (mp3, opus, aac, flac, wav, pcm), adds convertPCMToWAV(pcmData) helper, and preserves CORS while returning audio or errors.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Worker as /audio/speech Handler
    participant Gemini as Gemini API
    participant Converter as Format Converter

    Client->>Worker: POST /audio/speech (model, input, voice, response_format)
    Worker->>Worker: Map model → Gemini TTS\nMap voice → Gemini voice\nValidate input & voice
    Worker->>Gemini: generateContent (speech synthesis request)
    alt Success
        Gemini-->>Worker: Base64-encoded audio or PCM payload
        Worker->>Converter: Convert/unwrap to requested format (mp3/opus/aac/flac/wav/pcm)
        Converter-->>Worker: Audio bytes
        Worker-->>Client: 200 OK (audio bytes, Content-Type, CORS)
    else API Error
        Gemini-->>Worker: Error response
        Worker-->>Client: Error response (preserve CORS)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify correctness of base64 decoding and PCM handling.
  • Inspect convertPCMToWAV for header/sample calculations.
  • Confirm model/voice mapping fallbacks and validation logic.
  • Check error propagation and CORS header preservation.

Poem

🐰 I found a WAV beneath the hill,

Mapped voices, hummed a tiny trill.
Bytes and base64 danced in a row,
From text to tone the rabbit knows—
Hop, play, and let the audio flow! 🎶

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Support For Gemini TTS' directly and clearly describes the main change: implementing Gemini Text-to-Speech functionality via a new /audio/speech endpoint with model/voice mappings.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
README.md (1)

230-246: Consider fixing markdown list indentation for consistency.

The nested list items under audio/speech use 6-space indentation instead of the expected 4 spaces, which is inconsistent with markdown best practices.

Apply this diff to fix the indentation:

 - [x] `audio/speech` (Text-to-Speech)
   <details>
 
-  - [x] `model`
-      - `tts-1` => `gemini-2.5-flash-preview-tts`
-      - `tts-1-hd` => `gemini-2.5-pro-preview-tts`
-      - Can also specify Gemini model names directly
-  - [x] `input` (required)
-  - [x] `voice` (required)
-      - Supported: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
-      - Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede
-  - [x] `response_format`
-      - Supported: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`
-      - Default: `mp3`
-  - [ ] `speed` (not yet implemented)
+    - [x] `model`
+        - `tts-1` => `gemini-2.5-flash-preview-tts`
+        - `tts-1-hd` => `gemini-2.5-pro-preview-tts`
+        - Can also specify Gemini model names directly
+    - [x] `input` (required)
+    - [x] `voice` (required)
+        - Supported: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
+        - Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede
+    - [x] `response_format`
+        - Supported: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`
+        - Default: `mp3`
+    - [ ] `speed` (not yet implemented)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe22245 and d6d63aa.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (2)
  • README.md (2 hunks)
  • src/worker.mjs (2 hunks)
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md

[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🪛 markdownlint-cli2 (0.18.1)
README.md

234-234: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


235-235: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


236-236: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


239-239: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


240-240: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


242-242: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


243-243: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

🔇 Additional comments (6)
README.md (1)

149-181: Excellent documentation for the new TTS feature.

The documentation clearly explains the endpoint usage, model mappings, and voice mappings. The example is helpful and comprehensive.

Note: The Gitleaks warning about the authorization token on line 157-158 is a false positive—this is example documentation with a placeholder value.

src/worker.mjs (5)

34-37: LGTM! Route handler follows existing patterns.

The new /audio/speech endpoint is correctly integrated into the routing logic, consistent with other endpoints in terms of method assertion and error handling.


149-160: Voice mapping is well-defined with clear documentation.

The mapping between OpenAI and Gemini voices is sensible and well-commented. Note that both nova and shimmer map to the same Gemini voice (Aoede), which is acceptable given Gemini's available voice options.


161-189: Model mapping and input validation are correctly implemented.

The model name mapping handles all expected cases (OpenAI model names, Gemini model names, and defaults), and required fields are properly validated with appropriate error messages.


191-233: Gemini API integration looks correct.

The request construction properly configures audio response with voice settings, and error handling preserves CORS headers while logging errors. The response validation ensures audio data is present before proceeding.


277-307: Hardcoded audio parameters are correct for Gemini's TTS output.

Gemini TTS API uses 24000 Hz sample rate, mono (1 channel), and 16-bit PCM encoding (LINEAR16) by default, which matches the implementation exactly. The parameters do not need to be made configurable or extracted from metadata for standard speech generation use cases.

Copilot AI and others added 3 commits November 13, 2025 01:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
README.md (1)

157-158: Clarify the authorization header in the curl example.

The curl example shows Authorization: Bearer YOUR_GEMINI_API_KEY. While this is a placeholder, it may be clearer to use angle brackets (e.g., <YOUR_GEMINI_API_KEY>) to explicitly indicate it's a placeholder and should not be copied literally. This avoids potential confusion or accidental exposure if users copy the example without substitution.

 curl https://your-endpoint.com/v1/audio/speech \
-  -H "Authorization: Bearer YOUR_GEMINI_API_KEY" \
+  -H "Authorization: Bearer <YOUR_GEMINI_API_KEY>" \
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6d63aa and 033b40a.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • src/worker.mjs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/worker.mjs
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md

[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🪛 markdownlint-cli2 (0.18.1)
README.md

234-234: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


235-235: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


236-236: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


239-239: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


240-240: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


242-242: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


243-243: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)


244-244: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

🔇 Additional comments (1)
README.md (1)

150-181: Comprehensive TTS documentation—verify alignment with implementation.

The new TTS section provides clear documentation with model mappings, voice mappings, and usage examples. The parameter detail list (lines 231–247) and the note about PCM format (line 244) and external conversion tools are helpful for users.

Please verify that:

  • The model names (gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts) match those used in src/worker.mjs.
  • The voice mappings (alloy→Puck, echo→Charon, etc.) match the VOICE_MAP implementation.
  • The supported response formats (wav, pcm) and their behavior match the actual implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
README.md (1)

239-240: Clarify voice mapping for shimmer in the supported endpoints section.

The list shows 6 OpenAI voices but only 5 Gemini voice names. While the TTS section (lines 175-180) makes it explicit that both nova and shimmer map to Aoede, this could be clearer in the supported endpoints section. Consider adding a note: "Maps to Gemini voices: Puck, Charon, Kore, Fenrir, Aoede (nova and shimmer both map to Aoede)".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 033b40a and 8f637ab.

📒 Files selected for processing (1)
  • README.md (2 hunks)
🧰 Additional context used
🪛 Gitleaks (8.29.0)
README.md

[high] 157-158: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🔇 Additional comments (2)
README.md (2)

157-158: Consider using environment variables for the Authorization header example.

The Gitleaks security scanner flagged the bare Authorization: Bearer YOUR_GEMINI_API_KEY header in the curl example. While this uses a placeholder, it's a best practice to avoid showing credential patterns directly in command examples, even with placeholders. Consider documenting the use of environment variables instead.

Example improvement:

curl https://your-endpoint.com/v1/audio/speech \
  -H "Authorization: Bearer $GEMINI_API_KEY" \
  ...

This makes it clearer that the key should never be hardcoded in commands.


242-244: Verify audio format support discrepancy.

The documentation states that "For mp3, opus, aac, or flac, use external conversion tools like ffmpeg," implying these formats are not natively supported. However, the AI summary indicates the implementation includes "converts returned audio to requested formats (mp3/opus/aac/flac/wav/pcm)."

This is a critical discrepancy—if the implementation supports these formats, the documentation should be updated to reflect that. If only wav/pcm are supported, the implementation summary should be corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant