Skip to content
Merged
7 changes: 5 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,12 @@ nav:
- Guides:
- User Guides:
- Getting started: guides/index.md
- Use Cases:
- Serve Multiple GenAI models: guides/serve-multiple-genai-models.md
- Serve Multiple LoRA adapters: guides/serve-multiple-lora-adapters.md
- Rollout:
- Adapter Rollout: guides/adapter-rollout.md
- InferencePool Rollout: guides/inferencepool-rollout.md
- Adapter Rollout: guides/adapter-rollout.md
- InferencePool Rollout: guides/inferencepool-rollout.md
- Metrics: guides/metrics.md
- Implementer's Guide: guides/implementers.md
- Performance:
Expand Down
70 changes: 70 additions & 0 deletions site-src/guides/serve-multiple-genai-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Serve multiple generative AI models
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
The company needs to ensure optimal serving performance for these LLMs.
Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.

The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name.
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)

This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: routes-to-llms
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name
value: chatbot
path:
type: PathPrefix
value: /
backendRefs:
- name: gemma3
kind: InferencePool
- matches:
- headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name
value: recommender
path:
type: PathPrefix
value: /
backendRefs:
- name: deepseek-r1
kind: InferencePool
```

Try it out:

1. Get the gateway IP:
```bash
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
```
2. Send a few requests to model "chatbot" as follows:
```bash
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "chatbot",
"prompt": "What is the color of the sky",
"max_tokens": 100,
"temperature": 0
}'
```
3. Send a few requests to model "recommender" as follows:
```bash
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "chatbot",
"prompt": "Give me restaurant recommendations in Paris",
"max_tokens": 100,
"temperature": 0
}'
```
99 changes: 99 additions & 0 deletions site-src/guides/serve-multiple-lora-adapters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Serve LoRA adapters on a shared pool
A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish.
They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity.
You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator.
This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool.

The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool.
![Serving LoRA adapters on a shared pool](../images/serve-LoRA-adapters.png)
This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool.
```yaml
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
name: gemma3
spec:
selector:
pool: gemma3
```
Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model.
You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource.
In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above.

```yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: english-bot
spec:
modelName: english-bot
criticality: Standard
poolRef:
name: gemma3

---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: spanish-bot
spec:
modelName: spanish-bot
criticality: Critical
poolRef:
name: gemma3

```
Now, you can route your requests from the gateway using the `HTTPRoute` object.
```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
listeners:
- protocol: HTTP
port: 80
name: http

---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
metadata:
name: routes-to-llms
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
path:
type: PathPrefix
value: /
backendRefs:
- name: gemma3
kind: InferencePool
```

Try it out:

1. Get the gateway IP:
```bash
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
```
2. Send a few requests to model "english-bot" as follows:
```bash
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "english-bot",
"prompt": "What is the color of the sky",
"max_tokens": 100,
"temperature": 0
}'
```
3. Send a few requests to model "spanish-bot" as follows:
```bash
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "spanish-bot",
"prompt": "¿De qué color es...?",
"max_tokens": 100,
"temperature": 0
}'
```
Binary file added site-src/images/serve-LoRA-adapters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site-src/images/serve-mul-gen-AI-models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.