Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
[![Discord](https://img.shields.io/discord/1341627368581628004?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FmaMY7QjG)](https://discord.gg/Bzz9hax9Jq)
[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Echo9Zulu-yellow)](https://huggingface.co/Echo9Zulu)
[![Devices](https://img.shields.io/badge/Devices-CPU%2FGPU%2FNPU-blue)](https://github.com/openvinotoolkit/openvino)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/SearchSavior/OpenArc)

> [!NOTE]
> OpenArc is under active development.
Expand Down Expand Up @@ -41,11 +42,10 @@ Thanks to everyone on Discord for their continued support!
- [Converting Models to OpenVINO IR](#converting-models-to-openvino-ir)
- [Learning Resources](#learning-resources)
- [Acknowledgments](#acknowledgments)
- [Codebase Documentation](./docs/index.md)

## Features

**OpenArc 2.0** arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

- Multi GPU Pipeline Paralell
- CPU offload/Hybrid device
- NPU device support
Expand Down Expand Up @@ -183,7 +183,7 @@ openarc --help
> Need help installing drivers? [Join our Discord](https://discord.gg/Bzz9hax9Jq) or open an issue.

> [!NOTE]
> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.
> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv.

## OpenArc CLI

Expand Down
Empty file removed docs/data_types.md
Empty file.
78 changes: 78 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# OpenArc Documentation

Welcome to OpenArc documentation!

This document collects information about the codebase structure, APIs, architecture and design patterns to help you explore the codebase.


- **[Server](./server.md)** - FastAPI server documentation with endpoint details
- **[Model Registration](./model_registration.md)** - How models are registered, loaded, and managed
- **[Worker Orchestration](./worker_orchestration.md)** - Worker system architecture and request routing
- **[Inference](./inference.md)** - Inference engines, class structure, and implementation details

### Architecture Overview

```
┌─────────────────┐
│ FastAPI │ HTTP API Layer
│ Server │ (OpenAI-compatible endpoints)
└────────┬────────┘
┌─────────────────┐
│ WorkerRegistry │ Request Routing & Orchestration
└────────┬────────┘
┌─────────────────┐
│ ModelRegistry │ Model Lifecycle Management
└────────┬────────┘
┌─────────────────┐
│ Inference │ Engine-specific implementations
│ Engines │ (OVGenAI, Optimum, OpenVINO)
└─────────────────┘
```

### Key Components

1. **Server** (`src/server/main.py`)
- FastAPI application with OpenAI-compatible endpoints
- Authentication middleware
- Request/response handling

2. **Model Registry** (`src/server/model_registry.py`)
- Model lifecycle management (load/unload)
- Status tracking
- Factory pattern for engine instantiation

3. **Worker Registry** (`src/server/worker_registry.py`)
- Per-model worker queues
- Request routing and orchestration
- Async packet processing

4. **Inference Engines** (`src/engine/`)
- **OVGenAI**: LLM, VLM, Whisper models
- **Optimum**: Embedding, Reranker models
- **OpenVINO**: Kokoro TTS models

## Supported Model Types

- **LLM**: Text-to-text language models
- **VLM**: Vision-language models (image-to-text)
- **Whisper**: Automatic speech recognition
- **Kokoro**: Text-to-speech
- **Embedding**: Text-to-vector embeddings
- **Reranker**: Document reranking

## Supported Libraries

- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
- **Optimum**: Optimum-Intel (Embedding, Reranker)
- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)

This project is about intel devices, so expect we may expand to other frameworks/libraries in the future.



137 changes: 137 additions & 0 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Inference Engines Documentation


OpenArc supports three inference engines, each optimized for different model types:

- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
- **Optimum**: Optimum-Intel (Embedding, Reranker)
- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)

## Engine Architecture

```
src/engine/
├── ov_genai/
│ ├── llm.py # OVGenAI_LLM
│ ├── vlm.py # OVGenAI_VLM
│ ├── whisper.py # OVGenAI_Whisper
│ ├── streamers.py # ChunkStreamer
│ ├── continuous_batch_llm.py
│ └── continuous_batch_vlm.py
├── optimum/
│ ├── optimum_llm.py # Optimum_LLM
│ ├── optimum_vlm.py # Optimum_VLM
│ ├── optimum_emb.py # Optimum_EMB
│ └── optimum_rr.py # Optimum_RR
└── openvino/
├── kokoro.py # OV_Kokoro
└── kitten.py
```

## Class Hierarchy

### OVGenAI Engine

#### OVGenAI_LLM (`src/engine/ov_genai/llm.py`)

Text-to-text language model using OpenVINO GenAI LLMPipeline.

**Key Features:**
- Supports OpenAI-compatible chat message format with chat templates
- Tool calling support (tools parameter in messages)
- Streaming and non-streaming generation modes
- Multiple input formats: pre-encoded input_ids, raw prompts, and chat messages
- ChunkStreamer for batched token streaming (chunk_size > 1)
- Performance metrics collection (ttft, throughput, etc.)
- Uses AutoTokenizer for encoding, model tokenizer for decoding

#### OVGenAI_VLM (`src/engine/ov_genai/vlm.py`)

Vision-language model using OpenVINO GenAI VLMPipeline.

**Key Features:**
- Supports OpenAI-compatible multimodal message format with embedded images
- Tool calling support (tools parameter in messages)
- Streaming and non-streaming generation modes
- Extracts base64-encoded images from OpenAI message format
- Converts images to OpenVINO tensors for inference
- Inserts model-specific vision tokens at image positions
- Supports multiple images per request with proper token indexing
- ChunkStreamer for batched token streaming (chunk_size > 1)
- Performance metrics collection (ttft, throughput, etc.)
- Uses chat templates with vision token insertion

**Vision Token Types:**
- `internvl2`: `<image>`
- `llava15`: `<image>`
- `llavanext`: `<image>`
- `minicpmv26`: `(<image>./</image>)`
- `phi3vision`: `<|image_{i}|>`
- `phi4mm`: `<|image_{i}|>`
- `qwen2vl`: `<|vision_start|><|image_pad|><|vision_end|>`
- `qwen25vl`: `<|vision_start|><|image_pad|><|vision_end|>`
- `gemma3`: `<start_of_image>`

#### OVGenAI_Whisper (`src/engine/ov_genai/whisper.py`)

Automatic speech recognition using OpenVINO GenAI Whisper

**Key Features:**
- Processes base64-encoded audio
- Returns transcribed text and metrics
- Non-streaming only (Whisper processes entire audio)

#### ChunkStreamer (`src/engine/ov_genai/streamers.py`)

Custom streamer for chunked token streaming. Uses OpenVINO tokenizer, not AutoTokenizer for decode.

**Features:**
- Accumulates tokens into chunks
- Yields chunks when chunk_size reached
- Supports chunk_size > 1 for batched streaming

### Optimum Engine

#### Optimum_EMB (`src/engine/optimum/optimum_emb.py`)

Text-to-vector embedding model using Optimum-Intel.

**Key Features:**
- Uses `OVModelForFeatureExtraction`
- Implements last token pooling for embeddings
- Normalizes embeddings (L2 normalization)
- Supports flexible tokenizer configuration

**Token Pooling:**
- Handles left-padding vs right-padding
- Extracts last non-padding token embedding
- Normalizes to unit vectors

#### Optimum_RR (`src/engine/optimum/optimum_rr.py`)

Document reranking model using Optimum-Intel.

**Key Features:**
- Reranks documents based on query relevance
- Supports custom prefix/suffix/instruction
- Returns ranked document lists

### OpenVINO Engine

#### OV_Kokoro (`src/engine/openvino/kokoro.py`)

Text-to-speech model using native OpenVINO runtime.

**Key Features:**
- Processes text in chunks (character_count_chunk)
- Generates audio tensors per chunk
- Supports voice selection and language codes
- Speed control for speech generation
- Returns WAV audio format

**Voice Support:**
- Multiple languages (English, Japanese, Chinese, Spanish, etc.)
- Multiple voices per language
- Gender-specific voices

#
101 changes: 101 additions & 0 deletions docs/model_registration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Model Registration Documentation

This document describes the model registration system, lifecycle management, and architectural patterns.

## Overview

The Model Registry (`src/server/model_registry.py`) manages the lifecycle of all models in OpenArc using a registry pattern with async background loading and a factory pattern for engine instantiation.

## Architecture Patterns

### Registry Pattern

The `ModelRegistry` maintains a central dictionary of all loaded models, tracking their status and lifecycle state. It is a volatile in memory datastore used internally.

**Key Components:**
- **ModelRecord**: Tracks model state (LOADING, LOADED, FAILED)
- **Async Lock**: Ensures thread-safe concurrent access
- **Event System**: Callbacks for lifecycle events

### Factory Pattern

Models are instantiated via a factory that maps `(engine, model_type)` tuples to concrete engine classes:

The factory dynamically imports and instantiates the appropriate class based on configuration.

### Event System

The registry fires events when models are loaded or unloaded, allowing other components (like `WorkerRegistry`) to react:

```python
# Subscribe to events
registry.add_on_loaded(on_model_loaded)
registry.add_on_unloaded(on_model_unloaded)
```

## Model Lifecycle

```
┌─────────────┐
│ REQUEST │
│ LOAD MODEL │
└──────┬──────┘
┌─────────────┐
│ CREATE │
│ MODEL RECORD│
│ (LOADING) │
└──────┬──────┘
┌─────────────┐
│ SPAWN │
│ LOAD TASK │
└──────┬──────┘
┌─────────────┐
│ FACTORY │
│ INSTANTIATE │
└──────┬──────┘
┌─────────────┐
│ UPDATE │
│ STATUS TO │
│ LOADED │
└──────┬──────┘
┌─────────────┐
│ FIRE │
│ CALLBACKS │
└─────────────┘
```

## Key Classes

### ModelLoadConfig

Pydantic model defining model configuration.

### ModelRecord

Dataclass tracking a registered model's state, instance, and metadata. Distinguishes between private (internal) and public (API-exposed) fields.

### ModelRegistry

Central registry implementing:
- **Async Loading**: Background tasks for model loading/unloading
- **Status Tracking**: LOADING → LOADED → FAILED states
- **Factory Integration**: Delegates instantiation to factory
- **Event Notifications**: Fires callbacks on lifecycle changes

## Thread Safety

All registry operations are protected by `asyncio.Lock` for thread-safe concurrent access. The registry maintains separate private model IDs while exposing public model names for API access.

## Integration

The `WorkerRegistry` subscribes to model lifecycle events to automatically spawn workers when models load and clean up when they unload.
8 changes: 0 additions & 8 deletions docs/openarc_server.md

This file was deleted.

Loading