kubernetes-sigs
diff --git a/‎.github/ISSUE_TEMPLATE/blank_issue.md‎
Lines changed: 8 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/blank_issue.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/bug_request.md‎
Lines changed: 3 additions & 1 deletion b/‎.github/ISSUE_TEMPLATE/bug_request.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/feature_request.md‎
Lines changed: 1 addition & 2 deletions b/‎.github/ISSUE_TEMPLATE/feature_request.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/new-release.md‎
Lines changed: 1 addition & 0 deletions b/‎.github/ISSUE_TEMPLATE/new-release.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 5 additions & 1 deletion b/‎Makefile‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 54 additions & 1 deletion b/‎README.md‎
Lines changed: 54 additions & 1 deletion
diff --git a/‎api/v1alpha2/inferencemodel_types.go‎
Lines changed: 1 addition & 1 deletion b/‎api/v1alpha2/inferencemodel_types.go‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cmd/epp/main.go‎
Lines changed: 9 additions & 5 deletions b/‎cmd/epp/main.go‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎config/charts/inferencepool/README.md‎
Lines changed: 13 additions & 1 deletion b/‎config/charts/inferencepool/README.md‎
Lines changed: 13 additions & 1 deletion
@@ -0,0 +1,8 @@
+---
+name: Blank Issue
+about: Create a new issue from scratch
+title: ''
+labels: needs-triage
+assignees: ''
+
+---
@@ -1,7 +1,9 @@
 ---
 name: Bug Report
 about: Report a bug you encountered
-labels: kind/bug
+title: ''
+labels: kind/bug, needs-triage
+assignees: ''
 
 ---
 
 
@@ -0,0 +1 @@
+blank_issues_enabled: false
@@ -2,7 +2,7 @@
 name: Feature request
 about: Suggest an idea for this project
 title: ''
-labels: ''
+labels: needs-triage
 assignees: ''
 
 ---
@@ -12,4 +12,3 @@ assignees: ''
 **What would you like to be added**:
 
 **Why is this needed**:
-
@@ -4,6 +4,7 @@ about: Propose a new release
 title: Release v0.x.0
 labels: ''
 assignees: ''
+
 ---
 
 - [Introduction](#introduction)
 
@@ -123,8 +123,12 @@ vet: ## Run go vet against code.
 test: manifests generate fmt vet envtest image-build ## Run tests.
 	KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test $$(go list ./... | grep -v /e2e) -race -coverprofile cover.out
 
+.PHONY: test-unit
+test-unit: ## Run unit tests.
+	KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test ./pkg/... -race -coverprofile cover.out
+
 .PHONY: test-integration
-test-integration: ## Run tests.
+test-integration: ## Run integration tests.
 	KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test ./test/integration/epp/... -race -coverprofile cover.out
 
 .PHONY: test-e2e
 
@@ -1,4 +1,57 @@
-# Gateway API Inference Extension 
+[![Go Report Card](https://goreportcard.com/badge/sigs.k8s.io/gateway-api-inference-extension)](https://goreportcard.com/report/sigs.k8s.io/gateway-api-inference-extension)
+[![Go Reference](https://pkg.go.dev/badge/sigs.k8s.io/gateway-api-inference-extension.svg)](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension)
+[![License](https://img.shields.io/github/license/kubernetes-sigs/gateway-api-inference-extension)](/LICENSE)
+
+# Gateway API Inference Extension (GIE)
+
+This project offers tools for AI Inference, enabling developers to build [Inference Gateways].
+
+[Inference Gateways]:#concepts-and-definitions
+
+## Concepts and Definitions
+
+The following are some key industry terms that are important to understand for
+this project:
+
+- **Model**: A generative AI model that has learned patterns from data and is
+  used for inference. Models vary in size and architecture, from smaller
+  domain-specific models to massive multi-billion parameter neural networks that
+  are optimized for diverse language tasks.
+- **Inference**: The process of running a generative AI model, such as a large
+  language model, diffusion model etc, to generate text, embeddings, or other
+  outputs from input data.
+- **Model server**: A service (in our case, containerized) responsible for
+  receiving inference requests and returning predictions from a model.
+- **Accelerator**: specialized hardware, such as Graphics Processing Units
+  (GPUs) that can be attached to Kubernetes nodes to speed up computations,
+  particularly for training and inference tasks.
+
+And the following are more specific terms to this project:
+
+- **Scheduler**: Makes decisions about which endpoint is optimal (best cost /
+  best performance) for an inference request based on `Metrics and Capabilities`
+  from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
+- **Metrics and Capabilities**: Data provided by model serving platforms about
+  performance, availability and capabilities to optimize routing. Includes
+  things like [Prefix Cache] status or [LoRA Adapters] availability.
+- **Endpoint Selector**: A `Scheduler` combined with `Metrics and Capabilities`
+  systems is often referred to together as an [Endpoint Selection Extension]
+  (this is also sometimes referred to as an "endpoint picker", or "EPP").
+- **Inference Gateway**: A proxy/load-balancer which has been coupled with a
+  `Endpoint Selector`. It provides optimized routing and load balancing for
+  serving Kubernetes self-hosted generative Artificial Intelligence (AI)
+  workloads. It simplifies the deployment, management, and observability of AI
+  inference workloads.
+
+For deeper insights and more advanced concepts, refer to our [proposals](/docs/proposals).
+
+[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization
+[Gateway API]:https://github.com/kubernetes-sigs/gateway-api
+[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html
+[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html
+[Endpoint Selection Extension]:https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-selection-extension
+
+## Technical Overview
 
 This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
 
 
@@ -126,7 +126,7 @@ type PoolObjectReference struct {
 }
 
 // Criticality defines how important it is to serve the model compared to other models.
-// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default.
+// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default.
 // This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
 // +kubebuilder:validation:Enum=Critical;Standard;Sheddable
 type Criticality string
 
@@ -30,6 +30,7 @@ import (
 	"go.uber.org/zap/zapcore"
 	"google.golang.org/grpc"
 	healthPb "google.golang.org/grpc/health/grpc_health_v1"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/rest"
 	"k8s.io/component-base/metrics/legacyregistry"
 	ctrl "sigs.k8s.io/controller-runtime"
@@ -140,14 +141,16 @@ func run() error {
 		return err
 	}
 
-	mgr, err := runserver.NewDefaultManager(*poolNamespace, *poolName, cfg)
+	poolNamespacedName := types.NamespacedName{
+		Name:      *poolName,
+		Namespace: *poolNamespace,
+	}
+	mgr, err := runserver.NewDefaultManager(poolNamespacedName, cfg)
 	if err != nil {
 		setupLog.Error(err, "Failed to create controller manager")
 		return err
 	}
 
-	ctx := ctrl.SetupSignalHandler()
-
 	// Set up mapper for metric scraping.
 	mapping, err := backendmetrics.NewMetricMapping(
 		*totalQueuedRequestsMetric,
@@ -162,14 +165,15 @@ func run() error {
 
 	pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.PodMetricsClientImpl{MetricMapping: mapping}, *refreshMetricsInterval)
 	// Setup runner.
+	ctx := ctrl.SetupSignalHandler()
+
 	datastore := datastore.NewDatastore(ctx, pmf)
 
 	serverRunner := &runserver.ExtProcServerRunner{
 		GrpcPort:                                 *grpcPort,
 		DestinationEndpointHintMetadataNamespace: *destinationEndpointHintMetadataNamespace,
 		DestinationEndpointHintKey:               *destinationEndpointHintKey,
-		PoolName:                                 *poolName,
-		PoolNamespace:                            *poolNamespace,
+		PoolNamespacedName:                       poolNamespacedName,
 		Datastore:                                datastore,
 		SecureServing:                            *secureServing,
 		CertPath:                                 *certPath,
 
@@ -2,7 +2,6 @@
 
 A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) deployment.  
 
-
 ## Install
 
 To install an InferencePool named `vllm-llama3-8b-instruct`  that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
@@ -23,6 +22,18 @@ $ helm install vllm-llama3-8b-instruct \
 
 Note that the provider name is needed to deploy provider-specific resources. If no provider is specified, then only the InferencePool object and the EPP are deployed.
 
+### Install for Triton TensorRT-LLM
+
+Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install for Triton TensorRT-LLM, e.g.,
+
+```txt
+$ helm install triton-llama3-8b-instruct \
+  --set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
+  --set inferencePool.modelServerType=triton-tensorrt-llm \
+  --set provider.name=[none|gke] \
+  oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
+```
+
 ## Uninstall
 
 Run the following command to uninstall the chart:
@@ -38,6 +49,7 @@ The following table list the configurable parameters of the chart.
 | **Parameter Name**                          | **Description**                                                                                                        |
 |---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
 | `inferencePool.targetPortNumber`            | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
+| `inferencePool.modelServerType`            | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
 | `inferencePool.modelServers.matchLabels`    | Label selector to match vllm backends managed by the inference pool.                                                   |
 | `inferenceExtension.replicas`               | Number of replicas for the endpoint picker extension service. Defaults to `1`.                                         |
 | `inferenceExtension.image.name`             | Name of the container image used for the endpoint picker.                                                              |
Original file line number	Diff line number	Diff line change
`@@ -126,7 +126,7 @@ type PoolObjectReference struct {`
`126`	`126`	`}`
`127`	`127`
`128`	`128`	`// Criticality defines how important it is to serve the model compared to other models.`
`129`		`-// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default.`
	`129`	`+// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default.`
`130`	`130`	`// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.`
`131`	`131`	`// +kubebuilder:validation:Enum=Critical;Standard;Sheddable`
`132`	`132`	`type Criticality string`