-
Notifications
You must be signed in to change notification settings - Fork 13k
docs(local model routing): add docs on how to use Gemma for local model routing #21365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
e0d0d14
2524dc8
53b02a6
6c98228
1dfe9a0
bde3299
124be61
9564fbe
2576f8d
f88f2fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| # Local Model Routing (experimental) | ||
|
|
||
| Gemini CLI supports using a local model for | ||
| [routing decisions](../cli/model-routing.md). When configured, Gemini CLI will | ||
| use a locally-running **Gemma** model to make routing decisions (instead of | ||
| sending routing decisions to a hosted model). | ||
|
|
||
| This feature can help reduce costs associated with hosted model usage while | ||
| offering similar routing decision latency and quality. | ||
|
|
||
| > **Note: Local model routing is currently an experimental feature.** | ||
|
|
||
| ## Setup | ||
|
|
||
| Using a Gemma model for routing decisions requires that an implementation of a | ||
| Gemma model be running locally on your machine, served behind an HTTP endpoint | ||
| and accessed via the Gemini API. | ||
|
|
||
| To serve the Gemma model, follow these steps: | ||
|
|
||
| ### Download the LiteRT-LM runtime | ||
|
|
||
| The [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) runtime offers | ||
| pre-built binaries for locally-serving models. Download the binary appropriate | ||
| for your system. | ||
|
|
||
| #### Windows | ||
|
|
||
| 1. Download | ||
| [lit.windows_x86_64.exe](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.windows_x86_64.exe). | ||
| 2. Using GPU on Windows requires the DirectXShaderCompiler. Download the | ||
| [dxc zip from the latest release](https://github.com/microsoft/DirectXShaderCompiler/releases/download/v1.8.2505.1/dxc_2025_07_14.zip). | ||
| Unzip the archive and from the architecture-appropriate `bin\` directory, and | ||
| copy the `dxil.dll` and `dxcompiler.dll` into the same location as you saved | ||
| `lit.windows_x86_64.exe`. | ||
| 3. (Optional) Test starting the runtime: | ||
| `.\lit.windows_x86_64.exe serve --verbose` | ||
|
|
||
| #### Linux | ||
|
|
||
| 1. Download | ||
| [lit.linux_x86_64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.linux_x86_64). | ||
| 2. Ensure the binary is executable: `chmod a+x lit.linux_x86_64` | ||
| 3. (Optional) Test starting the runtime: `./lit.linux_x86_64 serve --verbose` | ||
|
|
||
| #### MacOS | ||
|
|
||
| 1. Download | ||
| [lit-macos-arm64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.macos_arm64). | ||
| 2. Ensure the binary is executable: `chmod a+x lit.macos_arm64` | ||
| 3. (Optional) Test starting the runtime: `./lit.macos_arm64 serve --verbose` | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran this on a fresh mac OS device today and got tripped up by Mac OS security settings. By default mac os only allows binaries from "App Store and Known Developers" so when I tried to run the server it would fail with a message that offered to move it to the trash. I had to go to Settings -> Privacy & Security and click "Allow Anyway" unders "lit.macos_arm64" was blocked to protect your Mac.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added language to this effect. PTAL. |
||
| ### Download the Gemma Model | ||
|
|
||
| Before using Gemma, you will need to download the model (and agree to the Terms | ||
| of Service). This can be done via the LiteRT-LM runtime via: | ||
|
|
||
| ```bash | ||
| $ ./lit.linux_x86_64 pull gemma3-1b-gpu-custom | ||
douglas-reid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| [Legal] The model you are about to download is governed by | ||
| the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing. | ||
|
|
||
| Full Terms: https://ai.google.dev/gemma/terms | ||
| Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy | ||
|
|
||
| Do you accept these terms? (Y/N): Y | ||
|
|
||
| Terms accepted. | ||
| Downloading model 'gemma3-1b-gpu-custom' ... | ||
| Downloading... 968.6 MB | ||
| Download complete. | ||
| ``` | ||
|
|
||
| ### Start LiteRT-LM Runtime | ||
|
|
||
| Using the command appropriate to your system, start the LiteRT-LM runtime. | ||
| Configure the port that you want to use for your Gemma model. For the purposes | ||
| of this document, we will use port `9379`. | ||
|
|
||
| Example command for MacOS: `./lit.macos_arm64 serve --port=9379 --verbose` | ||
|
|
||
| ### (Optional) Verify Model Serving | ||
|
|
||
| Send a quick prompt to the model via HTTP to validate successful model serving. | ||
| This will cause the runtime to download the model and run it once. | ||
|
|
||
| You should see a short joke in the server output as an indicator of success. | ||
|
|
||
| #### Windows | ||
|
|
||
| ``` | ||
| # Run this in PowerShell to send a request to the server | ||
|
|
||
| $uri = "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent" | ||
| $body = @{contents = @( @{ | ||
| role = "user" | ||
| parts = @( @{ text = "Tell me a joke." } ) | ||
| })} | ConvertTo-Json -Depth 10 | ||
|
|
||
| Invoke-RestMethod -Uri $uri -Method Post -Body $body -ContentType "application/json" | ||
| ``` | ||
|
|
||
| #### Linux/MacOS | ||
|
|
||
| ```bash | ||
| $ curl "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent" \ | ||
| -H 'Content-Type: application/json' \ | ||
| -X POST \ | ||
| -d '{"contents":[{"role":"user","parts":[{"text":"Tell me a joke."}]}]}' | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| To use a local Gemma model for routing, you must explicitly enable it in your | ||
| `settings.json`: | ||
|
|
||
| ```json | ||
| { | ||
| "experimental": { | ||
| "gemmaModelRouter": { | ||
| "enabled": true, | ||
| "classifier": { | ||
| "host": "http://localhost:9379", | ||
| "model": "gemma3-1b-gpu-custom" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| > Use the port you started your LiteRT-LM runtime on in the setup steps. | ||
|
|
||
| ### Configuration schema | ||
|
|
||
| | Field | Type | Required | Description | | ||
| | :----------------- | :------ | :------- | :----------------------------------------------------------------------------------------- | | ||
| | `enabled` | boolean | Yes | Must be `true` to enable the feature. | | ||
| | `classifier` | object | Yes | The configuration for the local model endpoint. It includes the host and model specifiers. | | ||
| | `classifier.host` | string | Yes | The URL to the local model server. Should be `http://localhost:<port>`. | | ||
| | `classifier.model` | string | Yes | The model name to use for decisions. Must be `"gemma3-1b-gpu-custom"`. | | ||
|
|
||
| > **Note: You will need to restart after configuration changes for local model | ||
| > routing to take effect.** | ||
Uh oh!
There was an error while loading. Please reload this page.