Architecture for Jan to support multiple Inference Engines #1271

freelerobot · 2023-12-31T08:33:46Z

freelerobot
Dec 31, 2023
Maintainer

Previous thread: #771

Context

Jan's architecture will default to Nitro, but have flexibility to incorporate other Model Backends / Inference Engines
We see very fast movement around a few ecosystems
- Intel BigDL
- Intel Extensions for Transformers
- TensorRT-LLM for WIndows
- And many more to come
Jan's Architecture needs to be able to incorporate modular Inference Engines
- Parallelize efforts to support different inference engines
- Allow us to hedge technical architectural risks

Solution

I envision an architecture in Jan that has the following:

Model Extension
Inference Extension
[Many] Extensions for each Inference Engine

Models Extension

Handles Models filesystem and CRUD
/models API endpoint

Inference Extension

Handles generic endpoints (for now, /chat/completions, later /audio/speech)
Routes an inference request to the correct Inference Engine extension (defined by model.json)

Extension for each Inference Engine

Each Inference Engine exposes a /chat/completions endpoint
Each Inference Engine is expected to have its own OpenAI compliant API
Each Inference Engine is expected to do its own automatic resource management (e.g. loading, unloading)
Each Inference Engine is expected to stream events to the Frontend UI
- e.g. Model Loading, Unloading events (i.e. "please wait")
- e.g. SSEs for model response
Extensions for Inference Engine are available via The Hub
- These are NPM packages that build on our Extensions API

Example

File Tree

/jan
    /models
        /llama2-70b
             llama2-gguf-q4_k_m.bin     # uses Nitro
             model.json
        /llama2-70b-intel-bigdl
              #pytorch files
              model.json
    /engines
        /nitro
            engine.json
        /openai

model.json gpt4-32k-1603

{
    engine: "openai"    // If not specified, defaults to Nitro
}

engine.json example for Nitro

{
    settings: { ...default_gguf_settings},
    parameters: { ...default_gguf_params},
}

Execution Path

User makes inference request to llama2-70b-intel-bigdl
Inference Extension loads the model.json for llama2-70b-intel-bigdl and sees engine is intel-bigdl
Inference Extension routes it to intel-bigdl Inference Engine Extension
intel-bigdl Inference Engine Extension takes in /chat/completions request, runs inference, and returns result through SSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan

Architecture for Jan to support multiple Inference Engines #1271

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Jan

Architecture for Jan to support multiple Inference Engines #1271

Uh oh!

freelerobot Dec 31, 2023 Maintainer

Context

Solution

Example

Replies: 0 comments

freelerobot
Dec 31, 2023
Maintainer