SearchSavior · SearchSavior · Nov 1, 2025 · Oct 31, 2025 · Nov 1, 2025 · Nov 1, 2025
diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 [![Discord](https://img.shields.io/discord/1341627368581628004?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FmaMY7QjG)](https://discord.gg/Bzz9hax9Jq)
 [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Echo9Zulu-yellow)](https://huggingface.co/Echo9Zulu)
 [![Devices](https://img.shields.io/badge/Devices-CPU%2FGPU%2FNPU-blue)](https://github.com/openvinotoolkit/openvino)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/SearchSavior/OpenArc)
 
 > [!NOTE]
 > OpenArc is under active development.
@@ -41,11 +42,10 @@ Thanks to everyone on Discord for their continued support!
 - [Converting Models to OpenVINO IR](#converting-models-to-openvino-ir)
 - [Learning Resources](#learning-resources)
 - [Acknowledgments](#acknowledgments)
+- [Codebase Documentation](./docs/index.md)
 
 ## Features
 
-**OpenArc 2.0** arrives with more endpoints, better UX, pipeline paralell, NPU support and much more! 
-
   - Multi GPU Pipeline Paralell
   - CPU offload/Hybrid device
   - NPU device support
@@ -183,7 +183,7 @@ openarc --help
 > Need help installing drivers? [Join our Discord](https://discord.gg/Bzz9hax9Jq) or open an issue.
 
 > [!NOTE] 
-> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.
+> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv.
 
 ## OpenArc CLI
 

diff --git a/docs/data_types.md b/docs/data_types.md
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,78 @@
+# OpenArc Documentation
+
+Welcome to OpenArc documentation!  
+
+This document collects information about the codebase structure, APIs, architecture and design patterns to help you explore the codebase.
+
+
+- **[Server](./server.md)** - FastAPI server documentation with endpoint details
+- **[Model Registration](./model_registration.md)** - How models are registered, loaded, and managed
+- **[Worker Orchestration](./worker_orchestration.md)** - Worker system architecture and request routing
+- **[Inference](./inference.md)** - Inference engines, class structure, and implementation details
+
+### Architecture Overview
+
+```
+┌─────────────────┐
+│   FastAPI       │  HTTP API Layer
+│   Server        │  (OpenAI-compatible endpoints)
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│ WorkerRegistry  │  Request Routing & Orchestration
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│ ModelRegistry   │  Model Lifecycle Management
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  Inference      │  Engine-specific implementations
+│  Engines        │  (OVGenAI, Optimum, OpenVINO)
+└─────────────────┘
+```
+
+### Key Components
+
+1. **Server** (`src/server/main.py`)
+   - FastAPI application with OpenAI-compatible endpoints
+   - Authentication middleware
+   - Request/response handling
+
+2. **Model Registry** (`src/server/model_registry.py`)
+   - Model lifecycle management (load/unload)
+   - Status tracking
+   - Factory pattern for engine instantiation
+
+3. **Worker Registry** (`src/server/worker_registry.py`)
+   - Per-model worker queues
+   - Request routing and orchestration
+   - Async packet processing
+
+4. **Inference Engines** (`src/engine/`)
+   - **OVGenAI**: LLM, VLM, Whisper models
+   - **Optimum**: Embedding, Reranker models
+   - **OpenVINO**: Kokoro TTS models
+
+## Supported Model Types
+
+- **LLM**: Text-to-text language models
+- **VLM**: Vision-language models (image-to-text)
+- **Whisper**: Automatic speech recognition
+- **Kokoro**: Text-to-speech
+- **Embedding**: Text-to-vector embeddings
+- **Reranker**: Document reranking
+
+## Supported Libraries
+
+- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
+- **Optimum**: Optimum-Intel (Embedding, Reranker)
+- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)
+
+This project is about intel devices, so expect we may expand to other frameworks/libraries in the future.
+
+
+
diff --git a/docs/inference.md b/docs/inference.md
@@ -0,0 +1,137 @@
+# Inference Engines Documentation
+
+
+OpenArc supports three inference engines, each optimized for different model types:
+
+- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
+- **Optimum**: Optimum-Intel (Embedding, Reranker)
+- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)
+
+## Engine Architecture
+
+```
+src/engine/
+├── ov_genai/
+│   ├── llm.py           # OVGenAI_LLM
+│   ├── vlm.py           # OVGenAI_VLM
+│   ├── whisper.py        # OVGenAI_Whisper
+│   ├── streamers.py      # ChunkStreamer
+│   ├── continuous_batch_llm.py
+│   └── continuous_batch_vlm.py
+├── optimum/
+│   ├── optimum_llm.py   # Optimum_LLM
+│   ├── optimum_vlm.py   # Optimum_VLM
+│   ├── optimum_emb.py   # Optimum_EMB
+│   └── optimum_rr.py     # Optimum_RR
+└── openvino/
+    ├── kokoro.py         # OV_Kokoro
+    └── kitten.py
+```
+
+## Class Hierarchy
+
+### OVGenAI Engine
+
+#### OVGenAI_LLM (`src/engine/ov_genai/llm.py`)
+
+Text-to-text language model using OpenVINO GenAI LLMPipeline.
+
+**Key Features:**
+- Supports OpenAI-compatible chat message format with chat templates
+- Tool calling support (tools parameter in messages)
+- Streaming and non-streaming generation modes
+- Multiple input formats: pre-encoded input_ids, raw prompts, and chat messages
+- ChunkStreamer for batched token streaming (chunk_size > 1)
+- Performance metrics collection (ttft, throughput, etc.)
+- Uses AutoTokenizer for encoding, model tokenizer for decoding
+
+#### OVGenAI_VLM (`src/engine/ov_genai/vlm.py`)
+
+Vision-language model using OpenVINO GenAI VLMPipeline.
+
+**Key Features:**
+- Supports OpenAI-compatible multimodal message format with embedded images
+- Tool calling support (tools parameter in messages)
+- Streaming and non-streaming generation modes
+- Extracts base64-encoded images from OpenAI message format
+- Converts images to OpenVINO tensors for inference
+- Inserts model-specific vision tokens at image positions
+- Supports multiple images per request with proper token indexing
+- ChunkStreamer for batched token streaming (chunk_size > 1)
+- Performance metrics collection (ttft, throughput, etc.)
+- Uses chat templates with vision token insertion
+
+**Vision Token Types:**
+- `internvl2`: `<image>`
+- `llava15`: `<image>`
+- `llavanext`: `<image>`
+- `minicpmv26`: `(<image>./</image>)`
+- `phi3vision`: `<|image_{i}|>`
+- `phi4mm`: `<|image_{i}|>`
+- `qwen2vl`: `<|vision_start|><|image_pad|><|vision_end|>`
+- `qwen25vl`: `<|vision_start|><|image_pad|><|vision_end|>`
+- `gemma3`: `<start_of_image>`
+
+#### OVGenAI_Whisper (`src/engine/ov_genai/whisper.py`)
+
+Automatic speech recognition using OpenVINO GenAI Whisper
+
+**Key Features:**
+- Processes base64-encoded audio
+- Returns transcribed text and metrics
+- Non-streaming only (Whisper processes entire audio)
+
+#### ChunkStreamer (`src/engine/ov_genai/streamers.py`)
+
+Custom streamer for chunked token streaming. Uses OpenVINO tokenizer, not AutoTokenizer for decode.
+
+**Features:**
+- Accumulates tokens into chunks
+- Yields chunks when chunk_size reached
+- Supports chunk_size > 1 for batched streaming
+
+### Optimum Engine
+
+#### Optimum_EMB (`src/engine/optimum/optimum_emb.py`)
+
+Text-to-vector embedding model using Optimum-Intel.
+
+**Key Features:**
+- Uses `OVModelForFeatureExtraction`
+- Implements last token pooling for embeddings
+- Normalizes embeddings (L2 normalization)
+- Supports flexible tokenizer configuration
+
+**Token Pooling:**
+- Handles left-padding vs right-padding
+- Extracts last non-padding token embedding
+- Normalizes to unit vectors
+
+#### Optimum_RR (`src/engine/optimum/optimum_rr.py`)
+
+Document reranking model using Optimum-Intel.
+
+**Key Features:**
+- Reranks documents based on query relevance
+- Supports custom prefix/suffix/instruction
+- Returns ranked document lists
+
+### OpenVINO Engine
+
+#### OV_Kokoro (`src/engine/openvino/kokoro.py`)
+
+Text-to-speech model using native OpenVINO runtime.
+
+**Key Features:**
+- Processes text in chunks (character_count_chunk)
+- Generates audio tensors per chunk
+- Supports voice selection and language codes
+- Speed control for speech generation
+- Returns WAV audio format
+
+**Voice Support:**
+- Multiple languages (English, Japanese, Chinese, Spanish, etc.)
+- Multiple voices per language
+- Gender-specific voices
+
+#
diff --git a/docs/model_registration.md b/docs/model_registration.md
@@ -0,0 +1,101 @@
+# Model Registration Documentation
+
+This document describes the model registration system, lifecycle management, and architectural patterns.
+
+## Overview
+
+The Model Registry (`src/server/model_registry.py`) manages the lifecycle of all models in OpenArc using a registry pattern with async background loading and a factory pattern for engine instantiation. 
+
+## Architecture Patterns
+
+### Registry Pattern
+
+The `ModelRegistry` maintains a central dictionary of all loaded models, tracking their status and lifecycle state. It is a volatile in memory datastore used internally.
+
+**Key Components:**
+- **ModelRecord**: Tracks model state (LOADING, LOADED, FAILED)
+- **Async Lock**: Ensures thread-safe concurrent access
+- **Event System**: Callbacks for lifecycle events
+
+### Factory Pattern
+
+Models are instantiated via a factory that maps `(engine, model_type)` tuples to concrete engine classes:
+
+The factory dynamically imports and instantiates the appropriate class based on configuration.
+
+### Event System
+
+The registry fires events when models are loaded or unloaded, allowing other components (like `WorkerRegistry`) to react:
+
+```python
+# Subscribe to events
+registry.add_on_loaded(on_model_loaded)
+registry.add_on_unloaded(on_model_unloaded)
+```
+
+## Model Lifecycle
+
+```
+┌─────────────┐
+│   REQUEST   │
+│ LOAD MODEL  │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   CREATE    │
+│ MODEL RECORD│
+│ (LOADING)   │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│  SPAWN      │
+│ LOAD TASK   │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│  FACTORY    │
+│ INSTANTIATE │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   UPDATE    │
+│  STATUS TO  │
+│  LOADED     │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   FIRE      │
+│  CALLBACKS  │
+└─────────────┘
+```
+
+## Key Classes
+
+### ModelLoadConfig
+
+Pydantic model defining model configuration.
+
+### ModelRecord
+
+Dataclass tracking a registered model's state, instance, and metadata. Distinguishes between private (internal) and public (API-exposed) fields.
+
+### ModelRegistry
+
+Central registry implementing:
+- **Async Loading**: Background tasks for model loading/unloading
+- **Status Tracking**: LOADING → LOADED → FAILED states
+- **Factory Integration**: Delegates instantiation to factory
+- **Event Notifications**: Fires callbacks on lifecycle changes
+
+## Thread Safety
+
+All registry operations are protected by `asyncio.Lock` for thread-safe concurrent access. The registry maintains separate private model IDs while exposing public model names for API access.
+
+## Integration
+
+The `WorkerRegistry` subscribes to model lifecycle events to automatically spawn workers when models load and clean up when they unload. 
diff --git a/docs/openarc_server.md b/docs/openarc_server.md