diff --git a/README.md b/README.md
index 16420d9..2a04f84 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,7 @@
 [![Discord](https://img.shields.io/discord/1341627368581628004?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FmaMY7QjG)](https://discord.gg/Bzz9hax9Jq)
 [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Echo9Zulu-yellow)](https://huggingface.co/Echo9Zulu)
 [![Devices](https://img.shields.io/badge/Devices-CPU%2FGPU%2FNPU-blue)](https://github.com/openvinotoolkit/openvino)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/SearchSavior/OpenArc)
 
 > [!NOTE]
 > OpenArc is under active development.
@@ -41,11 +42,10 @@ Thanks to everyone on Discord for their continued support!
 - [Converting Models to OpenVINO IR](#converting-models-to-openvino-ir)
 - [Learning Resources](#learning-resources)
 - [Acknowledgments](#acknowledgments)
+- [Codebase Documentation](./docs/index.md)
 
 ## Features
 
-**OpenArc 2.0** arrives with more endpoints, better UX, pipeline paralell, NPU support and much more! 
-
   - Multi GPU Pipeline Paralell
   - CPU offload/Hybrid device
   - NPU device support
@@ -183,7 +183,7 @@ openarc --help
 > Need help installing drivers? [Join our Discord](https://discord.gg/Bzz9hax9Jq) or open an issue.
 
 > [!NOTE] 
-> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.
+> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv.
 
 ## OpenArc CLI
 
diff --git a/docs/data_types.md b/docs/data_types.md
deleted file mode 100644
index e69de29..0000000
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..8ba52c9
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,78 @@
+# OpenArc Documentation
+
+Welcome to OpenArc documentation!  
+
+This document collects information about the codebase structure, APIs, architecture and design patterns to help you explore the codebase.
+
+
+- **[Server](./server.md)** - FastAPI server documentation with endpoint details
+- **[Model Registration](./model_registration.md)** - How models are registered, loaded, and managed
+- **[Worker Orchestration](./worker_orchestration.md)** - Worker system architecture and request routing
+- **[Inference](./inference.md)** - Inference engines, class structure, and implementation details
+
+### Architecture Overview
+
+```
+┌─────────────────┐
+│   FastAPI       │  HTTP API Layer
+│   Server        │  (OpenAI-compatible endpoints)
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│ WorkerRegistry  │  Request Routing & Orchestration
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│ ModelRegistry   │  Model Lifecycle Management
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  Inference      │  Engine-specific implementations
+│  Engines        │  (OVGenAI, Optimum, OpenVINO)
+└─────────────────┘
+```
+
+### Key Components
+
+1. **Server** (`src/server/main.py`)
+   - FastAPI application with OpenAI-compatible endpoints
+   - Authentication middleware
+   - Request/response handling
+
+2. **Model Registry** (`src/server/model_registry.py`)
+   - Model lifecycle management (load/unload)
+   - Status tracking
+   - Factory pattern for engine instantiation
+
+3. **Worker Registry** (`src/server/worker_registry.py`)
+   - Per-model worker queues
+   - Request routing and orchestration
+   - Async packet processing
+
+4. **Inference Engines** (`src/engine/`)
+   - **OVGenAI**: LLM, VLM, Whisper models
+   - **Optimum**: Embedding, Reranker models
+   - **OpenVINO**: Kokoro TTS models
+
+## Supported Model Types
+
+- **LLM**: Text-to-text language models
+- **VLM**: Vision-language models (image-to-text)
+- **Whisper**: Automatic speech recognition
+- **Kokoro**: Text-to-speech
+- **Embedding**: Text-to-vector embeddings
+- **Reranker**: Document reranking
+
+## Supported Libraries
+
+- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
+- **Optimum**: Optimum-Intel (Embedding, Reranker)
+- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)
+
+This project is about intel devices, so expect we may expand to other frameworks/libraries in the future.
+
+
+
diff --git a/docs/inference.md b/docs/inference.md
new file mode 100644
index 0000000..41762ae
--- /dev/null
+++ b/docs/inference.md
@@ -0,0 +1,137 @@
+# Inference Engines Documentation
+
+
+OpenArc supports three inference engines, each optimized for different model types:
+
+- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper)
+- **Optimum**: Optimum-Intel (Embedding, Reranker)
+- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS)
+
+## Engine Architecture
+
+```
+src/engine/
+├── ov_genai/
+│   ├── llm.py           # OVGenAI_LLM
+│   ├── vlm.py           # OVGenAI_VLM
+│   ├── whisper.py        # OVGenAI_Whisper
+│   ├── streamers.py      # ChunkStreamer
+│   ├── continuous_batch_llm.py
+│   └── continuous_batch_vlm.py
+├── optimum/
+│   ├── optimum_llm.py   # Optimum_LLM
+│   ├── optimum_vlm.py   # Optimum_VLM
+│   ├── optimum_emb.py   # Optimum_EMB
+│   └── optimum_rr.py     # Optimum_RR
+└── openvino/
+    ├── kokoro.py         # OV_Kokoro
+    └── kitten.py
+```
+
+## Class Hierarchy
+
+### OVGenAI Engine
+
+#### OVGenAI_LLM (`src/engine/ov_genai/llm.py`)
+
+Text-to-text language model using OpenVINO GenAI LLMPipeline.
+
+**Key Features:**
+- Supports OpenAI-compatible chat message format with chat templates
+- Tool calling support (tools parameter in messages)
+- Streaming and non-streaming generation modes
+- Multiple input formats: pre-encoded input_ids, raw prompts, and chat messages
+- ChunkStreamer for batched token streaming (chunk_size > 1)
+- Performance metrics collection (ttft, throughput, etc.)
+- Uses AutoTokenizer for encoding, model tokenizer for decoding
+
+#### OVGenAI_VLM (`src/engine/ov_genai/vlm.py`)
+
+Vision-language model using OpenVINO GenAI VLMPipeline.
+
+**Key Features:**
+- Supports OpenAI-compatible multimodal message format with embedded images
+- Tool calling support (tools parameter in messages)
+- Streaming and non-streaming generation modes
+- Extracts base64-encoded images from OpenAI message format
+- Converts images to OpenVINO tensors for inference
+- Inserts model-specific vision tokens at image positions
+- Supports multiple images per request with proper token indexing
+- ChunkStreamer for batched token streaming (chunk_size > 1)
+- Performance metrics collection (ttft, throughput, etc.)
+- Uses chat templates with vision token insertion
+
+**Vision Token Types:**
+- `internvl2`: `<image>`
+- `llava15`: `<image>`
+- `llavanext`: `<image>`
+- `minicpmv26`: `(<image>./</image>)`
+- `phi3vision`: `<|image_{i}|>`
+- `phi4mm`: `<|image_{i}|>`
+- `qwen2vl`: `<|vision_start|><|image_pad|><|vision_end|>`
+- `qwen25vl`: `<|vision_start|><|image_pad|><|vision_end|>`
+- `gemma3`: `<start_of_image>`
+
+#### OVGenAI_Whisper (`src/engine/ov_genai/whisper.py`)
+
+Automatic speech recognition using OpenVINO GenAI Whisper
+
+**Key Features:**
+- Processes base64-encoded audio
+- Returns transcribed text and metrics
+- Non-streaming only (Whisper processes entire audio)
+
+#### ChunkStreamer (`src/engine/ov_genai/streamers.py`)
+
+Custom streamer for chunked token streaming. Uses OpenVINO tokenizer, not AutoTokenizer for decode.
+
+**Features:**
+- Accumulates tokens into chunks
+- Yields chunks when chunk_size reached
+- Supports chunk_size > 1 for batched streaming
+
+### Optimum Engine
+
+#### Optimum_EMB (`src/engine/optimum/optimum_emb.py`)
+
+Text-to-vector embedding model using Optimum-Intel.
+
+**Key Features:**
+- Uses `OVModelForFeatureExtraction`
+- Implements last token pooling for embeddings
+- Normalizes embeddings (L2 normalization)
+- Supports flexible tokenizer configuration
+
+**Token Pooling:**
+- Handles left-padding vs right-padding
+- Extracts last non-padding token embedding
+- Normalizes to unit vectors
+
+#### Optimum_RR (`src/engine/optimum/optimum_rr.py`)
+
+Document reranking model using Optimum-Intel.
+
+**Key Features:**
+- Reranks documents based on query relevance
+- Supports custom prefix/suffix/instruction
+- Returns ranked document lists
+
+### OpenVINO Engine
+
+#### OV_Kokoro (`src/engine/openvino/kokoro.py`)
+
+Text-to-speech model using native OpenVINO runtime.
+
+**Key Features:**
+- Processes text in chunks (character_count_chunk)
+- Generates audio tensors per chunk
+- Supports voice selection and language codes
+- Speed control for speech generation
+- Returns WAV audio format
+
+**Voice Support:**
+- Multiple languages (English, Japanese, Chinese, Spanish, etc.)
+- Multiple voices per language
+- Gender-specific voices
+
+#
\ No newline at end of file
diff --git a/docs/model_registration.md b/docs/model_registration.md
new file mode 100644
index 0000000..a001bc7
--- /dev/null
+++ b/docs/model_registration.md
@@ -0,0 +1,101 @@
+# Model Registration Documentation
+
+This document describes the model registration system, lifecycle management, and architectural patterns.
+
+## Overview
+
+The Model Registry (`src/server/model_registry.py`) manages the lifecycle of all models in OpenArc using a registry pattern with async background loading and a factory pattern for engine instantiation. 
+
+## Architecture Patterns
+
+### Registry Pattern
+
+The `ModelRegistry` maintains a central dictionary of all loaded models, tracking their status and lifecycle state. It is a volatile in memory datastore used internally.
+
+**Key Components:**
+- **ModelRecord**: Tracks model state (LOADING, LOADED, FAILED)
+- **Async Lock**: Ensures thread-safe concurrent access
+- **Event System**: Callbacks for lifecycle events
+
+### Factory Pattern
+
+Models are instantiated via a factory that maps `(engine, model_type)` tuples to concrete engine classes:
+
+The factory dynamically imports and instantiates the appropriate class based on configuration.
+
+### Event System
+
+The registry fires events when models are loaded or unloaded, allowing other components (like `WorkerRegistry`) to react:
+
+```python
+# Subscribe to events
+registry.add_on_loaded(on_model_loaded)
+registry.add_on_unloaded(on_model_unloaded)
+```
+
+## Model Lifecycle
+
+```
+┌─────────────┐
+│   REQUEST   │
+│ LOAD MODEL  │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   CREATE    │
+│ MODEL RECORD│
+│ (LOADING)   │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│  SPAWN      │
+│ LOAD TASK   │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│  FACTORY    │
+│ INSTANTIATE │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   UPDATE    │
+│  STATUS TO  │
+│  LOADED     │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   FIRE      │
+│  CALLBACKS  │
+└─────────────┘
+```
+
+## Key Classes
+
+### ModelLoadConfig
+
+Pydantic model defining model configuration.
+
+### ModelRecord
+
+Dataclass tracking a registered model's state, instance, and metadata. Distinguishes between private (internal) and public (API-exposed) fields.
+
+### ModelRegistry
+
+Central registry implementing:
+- **Async Loading**: Background tasks for model loading/unloading
+- **Status Tracking**: LOADING → LOADED → FAILED states
+- **Factory Integration**: Delegates instantiation to factory
+- **Event Notifications**: Fires callbacks on lifecycle changes
+
+## Thread Safety
+
+All registry operations are protected by `asyncio.Lock` for thread-safe concurrent access. The registry maintains separate private model IDs while exposing public model names for API access.
+
+## Integration
+
+The `WorkerRegistry` subscribes to model lifecycle events to automatically spawn workers when models load and clean up when they unload. 
diff --git a/docs/openarc_server.md b/docs/openarc_server.md
deleted file mode 100644
index ceae8af..0000000
--- a/docs/openarc_server.md
+++ /dev/null
@@ -1,8 +0,0 @@
-This document describes the features of OpenArc server.
-
-
-
-
-## Model Lifecycle
-
-
diff --git a/docs/server.md b/docs/server.md
new file mode 100644
index 0000000..4d396dc
--- /dev/null
+++ b/docs/server.md
@@ -0,0 +1,593 @@
+# OpenArc Server Documentation
+
+This document describes the FastAPI server implementation, endpoints, and API structure.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Server Architecture](#server-architecture)
+  - [Key Components](#key-components)
+- [Authentication](#authentication)
+- [CORS Configuration](#cors-configuration)
+- [Endpoints](#endpoints)
+  - [OpenArc Internal Endpoints](#openarc-internal-endpoints)
+    - [`POST /openarc/load`](#post-openarcload)
+    - [`POST /openarc/unload`](#post-openarcunload)
+    - [`GET /openarc/status`](#get-openarcstatus)
+    - [`POST /openarc/bench`](#post-openarcbench)
+  - [OpenAI-Compatible Endpoints](#openai-compatible-endpoints)
+    - [`GET /v1/models`](#get-v1models)
+    - [`POST /v1/chat/completions`](#post-v1chatcompletions)
+    - [`POST /v1/completions`](#post-v1completions)
+    - [`POST /v1/audio/transcriptions`](#post-v1audiotranscriptions)
+    - [`POST /v1/audio/speech`](#post-v1audiospeech)
+    - [`POST /v1/embeddings`](#post-v1embeddings)
+    - [`POST /v1/rerank`](#post-v1rerank)
+- [Request Models](#request-models)
+  - [OpenAIChatCompletionRequest](#openaichatcompletionrequest)
+  - [OpenAICompletionRequest](#openaicompletionrequest)
+  - [OpenAIWhisperRequest](#openaiwhisperrequest)
+  - [OpenAIKokoroRequest](#openaikokororequest)
+  - [EmbeddingsRequest](#embeddingsrequest)
+  - [RerankRequest](#rerankrequest)
+- [Tool Calling Support](#tool-calling-support)
+  - [Parser Implementation](#parser-implementation)
+- [Metrics](#metrics)
+- [Startup Models](#startup-models)
+
+## Overview
+
+The OpenArc server is built with FastAPI and provides OpenAI-compatible endpoints for inference. The server is located in `src/server/main.py`.
+
+## Server Architecture
+
+
+
+### Key Components
+
+- **FastAPI Application**: Main application instance with lifespan events
+- **Model Registry**: Manages model lifecycle (load/unload)
+- **Worker Registry**: Routes requests to appropriate workers
+- **Authentication**: Bearer token authentication via `OPENARC_API_KEY`
+
+## Authentication
+
+All endpoints require authentication via Bearer token:
+
+```python
+Authorization: Bearer <OPENARC_API_KEY>
+```
+
+The API key is configured via the `OPENARC_API_KEY` environment variable.
+
+## Endpoints
+
+### OpenArc Internal Endpoints
+
+#### `POST /openarc/load`
+
+Load a model onto the server.
+
+**Request Body:**
+```json
+{
+  "model_path": "/path/to/model",
+  "model_name": "my-model",
+  "model_type": "llm",
+  "engine": "ovgenai",
+  "device": "GPU.0",
+  "runtime_config": {},
+  "vlm_type": null
+}
+```
+
+**Response:**
+```json
+{
+  "model_id": "unique-model-id",
+  "model_name": "my-model",
+  "status": "loaded"
+}
+```
+
+**Status Codes:**
+- `200`: Model loaded successfully
+- `400`: Invalid request (e.g., model name already exists)
+- `500`: Loading failed
+
+#### `POST /openarc/unload`
+
+Unload a model from the server.
+
+**Request Body:**
+```json
+{
+  "model_name": "my-model"
+}
+```
+
+**Response:**
+```json
+{
+  "model_name": "my-model",
+  "status": "unloading"
+}
+```
+
+**Status Codes:**
+- `200`: Unload initiated
+- `404`: Model not found
+- `500`: Unload failed
+
+#### `GET /openarc/status`
+
+Get status of all loaded models.
+
+**Response:**
+```json
+{
+  "total_loaded_models": 2,
+  "models": [
+    {
+      "model_name": "my-model",
+      "model_type": "llm",
+      "engine": "ovgenai",
+      "device": "GPU.0",
+      "runtime_config": {},
+      "status": "loaded",
+      "time_loaded": "2024-01-01T00:00:00"
+    }
+  ],
+  "openai_model_names": ["my-model"]
+}
+```
+
+#### `POST /openarc/bench`
+
+Benchmark model performance with pre-encoded input IDs.
+
+**Request Body:**
+```json
+{
+  "model": "my-model",
+  "input_ids": [1, 2, 3, ...],
+  "max_tokens": 512,
+  "temperature": 1.0,
+  "top_p": 1.0,
+  "top_k": 50,
+  "repetition_penalty": 1.0
+}
+```
+
+**Response:**
+```json
+{
+  "metrics": {
+    "ttft": 0.123,
+    "prefill_throughput": 100.5,
+    "decode_throughput": 50.2,
+    "decode_duration": 2.5,
+    "tpot": 0.025,
+    "input_token": 512,
+    "new_token": 128,
+    "total_token": 640
+  }
+}
+```
+
+### OpenAI-Compatible Endpoints
+
+#### `GET /v1/models`
+
+List all available models.
+
+**Response:**
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "id": "my-model",
+      "object": "model",
+      "created": 1704067200,
+      "owned_by": "OpenArc"
+    }
+  ]
+}
+```
+
+#### `POST /v1/chat/completions`
+
+Chat completions endpoint for LLM and VLM models.
+
+**Request Body:**
+```json
+{
+  "model": "my-model",
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Hello!"}
+  ],
+  "tools": [],
+  "stream": false,
+  "temperature": 1.0,
+  "max_tokens": 512,
+  "top_p": 1.0,
+  "top_k": 50,
+  "repetition_penalty": 1.0
+}
+```
+
+**Response (non-streaming):**
+```json
+{
+  "id": "ov-abc123...",
+  "object": "chat.completion",
+  "created": 1704067200,
+  "model": "my-model",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Hello! How can I help you?"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "completion_tokens": 8,
+    "total_tokens": 18
+  },
+  "metrics": {
+    "ttft": 0.123,
+    "prefill_throughput": 100.5,
+    "decode_throughput": 50.2
+  }
+}
+```
+
+**Streaming Response:**
+Server-Sent Events (SSE) format:
+```
+data: {"id": "ov-abc123...", "object": "chat.completion.chunk", ...}
+data: {"id": "ov-abc123...", "object": "chat.completion.chunk", ...}
+data: [DONE]
+```
+
+#### `POST /v1/completions`
+
+Text completions endpoint for LLM models (legacy endpoint).
+
+**Request Body:**
+```json
+{
+  "model": "my-model",
+  "prompt": "The capital of France is",
+  "stream": false,
+  "temperature": 1.0,
+  "max_tokens": 512
+}
+```
+
+**Response:**
+```json
+{
+  "id": "ov-abc123...",
+  "object": "text_completion",
+  "created": 1704067200,
+  "model": "my-model",
+  "choices": [
+    {
+      "index": 0,
+      "text": " Paris.",
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 5,
+    "completion_tokens": 2,
+    "total_tokens": 7
+  }
+}
+```
+
+#### `POST /v1/audio/transcriptions`
+
+Transcribe audio using Whisper models.
+
+**Request Body:**
+```json
+{
+  "model": "whisper-model",
+  "audio_base64": "base64-encoded-audio-data"
+}
+```
+
+**Response:**
+```json
+{
+  "text": "Transcribed text here",
+  "metrics": {
+    "input_token": 100,
+    "new_token": 50,
+    "total_token": 150
+  }
+}
+```
+
+#### `POST /v1/audio/speech`
+
+Generate speech using Kokoro TTS models.
+
+**Request Body:**
+```json
+{
+  "model": "kokoro-model",
+  "input": "Hello, world!",
+  "voice": "af_heart",
+  "speed": 1.0,
+  "language": "a",
+  "response_format": "wav"
+}
+```
+
+**Response:**
+Returns WAV audio file as binary stream with `Content-Type: audio/wav`.
+
+#### `POST /v1/embeddings`
+
+Generate text embeddings.
+
+**Request Body:**
+```json
+{
+  "model": "embedding-model",
+  "input": "Text to embed",
+  "dimensions": null,
+  "encoding_format": "float",
+  "config": {
+    "max_length": 512,
+    "padding": true,
+    "truncation": true
+  }
+}
+```
+
+**Response:**
+```json
+{
+  "id": "ov-abc123...",
+  "object": "list",
+  "created": 1704067200,
+  "model": "embedding-model",
+  "data": [
+    {
+      "index": 0,
+      "object": "embedding",
+      "embedding": [0.1, 0.2, ...]
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 5,
+    "total_tokens": 5
+  }
+}
+```
+
+#### `POST /v1/rerank`
+
+Rerank documents based on a query.
+
+**Request Body:**
+```json
+{
+  "model": "reranker-model",
+  "query": "search query",
+  "documents": ["doc1", "doc2", "doc3"],
+  "prefix": "<|im_start|>system\n...",
+  "suffix": "<|im_end|>\n...",
+  "instruction": "Given a search query..."
+}
+```
+
+**Response:**
+```json
+{
+  "id": "ov-abc123...",
+  "object": "list",
+  "created": 1704067200,
+  "model": "reranker-model",
+  "data": [
+    {
+      "index": 0,
+      "object": "ranked_documents",
+      "ranked_documents": ["doc2", "doc1", "doc3"]
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 50,
+    "total_tokens": 50
+  }
+}
+```
+
+## Request Models
+
+### OpenAIChatCompletionRequest
+- `model`: str
+- `messages`: List[Dict]
+- `tools`: Optional[List[Dict]]
+- `stream`: Optional[bool]
+- `temperature`: Optional[float]
+- `max_tokens`: Optional[int]
+- `stop`: Optional[List[str]]
+- `top_p`: Optional[float]
+- `top_k`: Optional[int]
+- `repetition_penalty`: Optional[float]
+- `do_sample`: Optional[bool]
+- `num_return_sequences`: Optional[int]
+
+### OpenAICompletionRequest
+- `model`: str
+- `prompt`: Union[str, List[str]]
+- `stream`: Optional[bool]
+- `temperature`: Optional[float]
+- `max_tokens`: Optional[int]
+- `stop`: Optional[List[str]]
+- `top_p`: Optional[float]
+- `top_k`: Optional[int]
+- `repetition_penalty`: Optional[float]
+- `do_sample`: Optional[bool]
+- `num_return_sequences`: Optional[int]
+
+### OpenAIWhisperRequest
+- `model`: str
+- `audio_base64`: str
+
+### OpenAIKokoroRequest
+- `model`: str
+- `input`: str
+- `voice`: Optional[str]
+- `speed`: Optional[float]
+- `language`: Optional[str]
+- `response_format`: Optional[str]
+
+### EmbeddingsRequest
+- `model`: str
+- `input`: Union[str, List[str], List[List[str]]]
+- `dimensions`: Optional[int]
+- `encoding_format`: Optional[str]
+- `user`: Optional[str]
+- `config`: Optional[PreTrainedTokenizerConfig]
+
+### RerankRequest
+- `model`: str
+- `query`: str
+- `documents`: List[str]
+- `prefix`: Optional[str]
+- `suffix`: Optional[str]
+- `instruction`: Optional[str]
+
+## Tool Calling Support
+
+OpenArc supports OpenAI-compatible tool calling. Tools are parsed from model output using regex pattern matching for JSON objects containing `name` and `arguments` fields.
+
+Tool calls are detected in streaming and non-streaming modes:
+- **Streaming**: Tool calls are detected incrementally and streamed as structured chunks
+- **Non-streaming**: Tool calls are parsed from the final output
+
+### Parser Implementation
+
+The `parse_tool_calls()` function searches for JSON objects in the model's text output and converts them to OpenAI-compatible tool call format.
+
+**Input Format (Model Output):**
+
+The parser expects JSON objects embedded in the text with the following structure:
+
+```json
+{
+  "name": "function_name",
+  "arguments": {
+    "arg1": "value1",
+    "arg2": "value2"
+  }
+}
+```
+
+**Input to the parser from a model:**
+
+```
+The user wants to know the weather. {"name": "get_weather", "arguments": {"location": "San Francisco", "units": "celsius"}} I'll check that for you.
+```
+
+**Output Format (OpenAI-Compatible):**
+
+Parser returns a list of tool call objects in OpenAI format:
+
+```json
+[
+  {
+    "id": "call_abc123def456...",
+    "type": "function",
+    "function": {
+      "name": "get_weather",
+      "arguments": "{\"location\": \"San Francisco\", \"units\": \"celsius\"}"
+    }
+  }
+]
+```
+
+**Parser Behavior:**
+
+- Searches for JSON objects using regex pattern: `\{(?:[^{}]|(?:\{[^{}]*\}))*\}`
+- Validates that each JSON object contains both `name` and `arguments` fields
+- Generates unique IDs in format `call_{24-char-hex}`
+- Converts `arguments` to JSON string (required by OpenAI format)
+- Returns `None` if no valid tool calls are found
+
+**Example Response (Non-Streaming):**
+
+When tool calls are detected, the response includes:
+
+```json
+{
+  "id": "ov-abc123...",
+  "object": "chat.completion",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": null,
+        "tool_calls": [
+          {
+            "id": "call_abc123def456...",
+            "type": "function",
+            "function": {
+              "name": "get_weather",
+              "arguments": "{\"location\": \"San Francisco\", \"units\": \"celsius\"}"
+            }
+          }
+        ]
+      },
+      "finish_reason": "tool_calls"
+    }
+  ]
+}
+```
+
+**Example Response (Streaming):**
+
+Tool calls are streamed as structured chunks:
+
+```
+data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"tool_calls": [{"index": 0, "id": "call_abc123...", "type": "function", "function": {"name": "get_weather", "arguments": ""}}]}}]}
+data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"tool_calls": [{"index": 0, "function": {"arguments": "{\"location\": \"San Francisco\"}"}}]}}]}
+data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "finish_reason": "tool_calls"}]}
+data: [DONE]
+```
+
+## Metrics
+
+All inference endpoints return performance metrics:
+- `ttft`: Time to first token
+- `prefill_throughput`: Prefill tokens per second
+- `decode_throughput`: Decode tokens per second
+- `decode_duration`: Total decode duration
+- `tpot`: Time per output token
+- `input_token`: Number of input tokens
+- `new_token`: Number of generated tokens
+- `total_token`: Total tokens processed
+
+## Startup Models
+
+Models can be automatically loaded on server startup via the `OPENARC_STARTUP_MODELS` environment variable:
+
+```bash
+export OPENARC_STARTUP_MODELS="model1,model2,model3"
+```
+
+The server will read `openarc_config.json` and load the specified models during startup.
+
diff --git a/docs/worker_orchestration.md b/docs/worker_orchestration.md
new file mode 100644
index 0000000..e60e10a
--- /dev/null
+++ b/docs/worker_orchestration.md
@@ -0,0 +1,59 @@
+# Worker Orchestration Documentation
+
+This document describes the worker system architecture, request routing, and how inference requests are processed.
+
+## Architecture
+
+```
+Request → WorkerRegistry → Model Queue → Queue Worker → InferWorker → Model Instance
+```
+
+## WorkerPacket
+
+Dataclass representing an inference request packet flowing through the system.
+
+
+## Error Handling
+
+### Inference Failures
+
+If inference fails (exception caught in InferWorker):
+1. Error stored in `packet.response` as `"Error: ..."`
+2. Metrics set to None
+3. QueueWorker detects error response
+4. Triggers model unload via `registry.register_unload()`
+5. Worker loop exits
+6. Server remains unblocked and no workers stall
+
+## Thread Safety
+
+- Queues are thread-safe (`asyncio.Queue`)
+- WorkerRegistry uses `asyncio.Lock` for queue/task dictionary access
+- Each model has its own queue and worker, ensuring isolation
+
+## Concurrency Model
+
+- **Per-Model Workers**: Each loaded model has its own dedicated worker
+- **Async Queues**: Requests are queued and processed asynchronously
+- **Parallel Processing**: Multiple models can process requests concurrently
+- **Streaming Support**: Streaming uses separate queue mechanism
+
+## Design Patterns
+
+### Queue-Based Processing
+- Decouples request submission from execution
+- Enables backpressure handling
+- Supports multiple concurrent requests per model
+
+### Worker Pattern
+- Dedicated worker per model
+- Long-running async loops
+- Clean shutdown via None sentinel
+
+### Future-Based Communication
+- Non-streaming uses `asyncio.Future` for result communication
+- Enables async/await pattern
+
+### Queue-Based Streaming
+- Streaming uses `asyncio.Queue` for token delivery
+- Enables async iteration pattern
diff --git a/project.md b/project.md
deleted file mode 100644
index a73f7b6..0000000
--- a/project.md
+++ /dev/null
@@ -1,9 +0,0 @@
-### OpenArc src_codex: Guide to src with dicussion
-
-This document can be used as an entry point into the OpenArc codebase.
-
-
-
-
-
-