-
Notifications
You must be signed in to change notification settings - Fork 86
[RFC] OPEA Inference Microservices (OIM) #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
2263e8d
[RFC] OPEA Inference Microservices (OIM)
poussa fa3ce75
review fixes
poussa 66e8fb5
review fixes
poussa ed10953
review fixes
poussa 67ae8cf
RFC: add OIM operator diagram
poussa 78b930a
review fixes, add picture
poussa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # RFC: OPEA Inference Microservices (OIM) | ||
|
|
||
| This RFC talks about creating OPEA inference microservices (OIM) similar to NVIDIA NIMs. | ||
|
|
||
| ## Author(s) | ||
|
|
||
| [Poussa, Sakari](https://github.com/poussa), | ||
| [Bhandaru, Malini](https://github.com/mkbhanda), | ||
| [Lehtonen, Markus](https://github.com/marquiz), | ||
| [Kanevsky, Alexander](https://github.com/kad), | ||
| [Lu, Lianhao](https://github.com/lianhao), | ||
| [Yang, Yuhan](https://github.com/PeterYang12) | ||
|
|
||
| ## Status | ||
|
|
||
| `Under Review` | ||
|
|
||
| ## Objective | ||
|
|
||
| OPEA seeks to ease enterprise GenAI adoption in a landscape that is fast moving, has skills gaps, and is cost conscious. A look at OPEA's GenAIExamples illustrates the choices a user must make: whether to use Docker or Kubernetes, the model serving framework, whether to optimize for latency or throughput or quantization, in addition to the hardware they want to run on. Can we abstract some of this complexity for inference services? | ||
|
|
||
| Nvidia's NIM does just this. It expects a user to only specify a model and what they would like to optimize for, their key performance indicator (KPI) type, be it low latency, high throughput or quantization. Nvidia abstracts away both the need to choose a model serving framework and determining its optimum parameters for the same to achieve the KPIs on provided hardware. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Nvidia delivers this ease of use through tooling that first detects the runtime hardware, which is then used to filter for profiles matching the desired model and KPIs. A profile is a recipe specifying how to launch an inference service. It specifies not only which model server framework (such as TGI or vLLM or Ollama ) to use but also number of GPU cards, amount of memory, degree of tensor parallelism, batch size etc. Lastly the profile is used to launch the inference service on the given hardware. | ||
|
|
||
| The profiles are established through offline experimentation using tools like MLPerf. For a model, one or more model frameworks are explored on different hardware to determine settings providing the best performance for a given KPIs. Constructing profiles is non-trivial, resource intensive work. | ||
|
|
||
| ## Design Proposal | ||
|
|
||
| We seek to offer a similar ease of use experience in OPEA for the inference services. Nvidia’s open source NIM Operator is a tool that eases launching inference services. OPEA plans to extend it to deliver an Operator that provides similar functionality but also supports multi-vendor hardware and shall endeavor to work with Kubernetes and its flavors such as Red Hat OpenShift and SUSE Rancher, and their AI variants by working with our partners and potentially integrating with platforms such as KServe and RayAI. The first release will provide minimal functionality and over time with community contributions grow richer. We anticipate providers of AI platforms that operate on a multitude of hardware platforms will find the HW optimized profiles valuable in providing cost-effective performant inference services. | ||
yinghu5 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| With regard to profiles we shall start with supporting vLLM as the model framework and build profiles for a few popular models. Profiles may also contain more accurate resource usage information for Kubernetes deployments (allowing better Kubernetes scheduling decisions and preventing failures on resource contention). We do anticipate constant enhancements and new offerings in the model serving space and a steady influx of new models. | ||
|
|
||
| The following diagram shows the OIM operator main use cases including model caching, inference service creation and usage. | ||
|
|
||
|  | ||
|
|
||
| The work will entail | ||
|
|
||
| 1) Auto-detecting hardware type and what is allocated. | ||
yinghu5 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 2) Taking the user specified model and profile (if not specified using a default profile) to retrieve configurations that are possible on the allocated hardware type and quantity. For instance, a configuration that leverages tensor parallelism is infeasible if only one GPU card is allocated. | ||
|
|
||
poussa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 3) Automating performance studies to identify optimum settings for a given hardware with various resource allocations for a given model and constructing profile specific configurations. These studies will be repeated to take advantage of new models, frameworks, and other enhancements. | ||
|
|
||
| 4) Providing an open-source tool, such as [ORAS](https://oras.land/), to discover and list available profiles/configurations for a model. | ||
|
|
||
| 5) Publishing the profiles and other artifacts to an OCI compatible registry, such as Docker Hub or ghcr.io. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| Creating everything from scratch was rejected because an open-source alternative exists and provides most of the features OPEA project needs. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| OPEA project will try to stay as close as possible with the original APIs and behaviors. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.