generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 206
docs: added examples to address various generative AI application scenarios by using gateway api inference extension #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
k8s-ci-robot
merged 13 commits into
kubernetes-sigs:main
from
capri-xiyue:capri-xiyue/add-common-use-cases
May 22, 2025
Merged
Changes from 9 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
14f98db
added common cases
capri-xiyue 86f7399
added more details
capri-xiyue a042406
fixed comments
capri-xiyue 52354f4
changed file location
capri-xiyue c95e23a
resolve merge conclicts
capri-xiyue 0cb4f52
fixed typo
capri-xiyue c8f2c0e
Update site-src/guides/serve-multiple-lora-adapters.md
capri-xiyue 759841d
Update site-src/guides/serve-multiple-lora-adapters.md
capri-xiyue b970ca9
Update mkdocs.yml
capri-xiyue 46b1b08
Update site-src/guides/serve-multiple-lora-adapters.md
capri-xiyue d17783d
Update site-src/guides/serve-multiple-genai-models.md
capri-xiyue a49e1ec
added subsession
capri-xiyue 6fa496f
fixed wording
capri-xiyue File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # Serve multiple generative AI models | ||
| A company wants to deploy multiple large language models (LLMs) to serve different workloads. | ||
| For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application. | ||
| The company needs to ensure optimal serving performance for these LLMs. | ||
| Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`. | ||
| You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property. | ||
|
|
||
| The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name. | ||
|  | ||
|
|
||
| This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`. | ||
| ```yaml | ||
| apiVersion: gateway.networking.k8s.io/v1 | ||
| kind: HTTPRoute | ||
| metadata: | ||
| name: routes-to-llms | ||
| spec: | ||
| parentRefs: | ||
| - name: inference-gateway | ||
| rules: | ||
| - matches: | ||
| - headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. | ||
| name: X-Gateway-Model-Name | ||
| value: chatbot | ||
| path: | ||
| type: PathPrefix | ||
| value: / | ||
| backendRefs: | ||
| - name: gemma3 | ||
| kind: InferencePool | ||
| - matches: | ||
| - headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. | ||
| name: X-Gateway-Model-Name | ||
| value: recommender | ||
| path: | ||
| type: PathPrefix | ||
| value: / | ||
| backendRefs: | ||
| - name: deepseek-r1 | ||
| kind: InferencePool | ||
| ``` | ||
|
|
||
| Try it out: | ||
|
|
||
| 1. Get the gateway IP: | ||
| ```bash | ||
| IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 | ||
| ``` | ||
| 2. Send a few requests to model "chatbot" as follows: | ||
| ```bash | ||
| curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ | ||
| "model": "chatbot", | ||
| "prompt": "What is the color of the sky", | ||
| "max_tokens": 100, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
| 3. Send a few requests to model "recommender" as follows: | ||
| ```bash | ||
| curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ | ||
| "model": "chatbot", | ||
| "prompt": "Give me restaurant recommendations in Paris", | ||
| "max_tokens": 100, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| # Serve LoRA adapters on a shared pool | ||
| A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish. | ||
| They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity. | ||
| You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator. | ||
| This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool. | ||
|
|
||
| The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool. | ||
|  | ||
| This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool. | ||
| ```yaml | ||
| apiVersion: gateway.networking.x-k8s.io/v1alpha1 | ||
| kind: InferencePool | ||
| metadata: | ||
| name: gemma3 | ||
| spec: | ||
| selector: | ||
| pool: gemma3 | ||
| ``` | ||
| Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model. | ||
| You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource. | ||
| In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above. | ||
|
|
||
| ```yaml | ||
| apiVersion: inference.networking.x-k8s.io/v1alpha2 | ||
| kind: InferenceModel | ||
| metadata: | ||
| name: english-bot | ||
| spec: | ||
| modelName: english-bot | ||
| criticality: Standard | ||
| poolRef: | ||
| name: gemma3 | ||
|
|
||
| --- | ||
| apiVersion: inference.networking.x-k8s.io/v1alpha2 | ||
| kind: InferenceModel | ||
| metadata: | ||
| name: spanish-bot | ||
| spec: | ||
| modelName: spanish-bot | ||
| criticality: Critical | ||
| poolRef: | ||
| name: gemma3 | ||
|
|
||
| ``` | ||
| Now, you can route your requests from the gateway using the `HTTPRoute` object. | ||
| ```yaml | ||
| apiVersion: gateway.networking.k8s.io/v1 | ||
| kind: Gateway | ||
| metadata: | ||
| name: inference-gateway | ||
| spec: | ||
| listeners: | ||
| - protocol: HTTP | ||
| port: 80 | ||
| name: http | ||
|
|
||
| --- | ||
| kind: HTTPRoute | ||
| apiVersion: gateway.networking.k8s.io/v1 | ||
| metadata: | ||
| name: routes-to-llms | ||
| spec: | ||
| parentRefs: | ||
| - name: inference-gateway | ||
| rules: | ||
| - matches: | ||
| path: | ||
| type: PathPrefix | ||
| value: / | ||
| backendRefs: | ||
| - name: gemma3 | ||
| kind: InferencePool | ||
| ``` | ||
|
|
||
| Try it out: | ||
capri-xiyue marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. Get the gateway IP: | ||
| ```bash | ||
| IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 | ||
| ``` | ||
| 2. Send a few requests to model "english-bot" as follows: | ||
| ```bash | ||
| curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ | ||
| "model": "english-bot", | ||
| "prompt": "What is the color of the sky", | ||
| "max_tokens": 100, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
| 3. Send a few requests to model "spanish-bot" as follows: | ||
| ```bash | ||
| curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ | ||
| "model": "spanish-bot", | ||
| "prompt": "¿De qué color es...?", | ||
| "max_tokens": 100, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.