diff --git a/README.md b/README.md index 16420d9..2a04f84 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ [![Discord](https://img.shields.io/discord/1341627368581628004?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FmaMY7QjG)](https://discord.gg/Bzz9hax9Jq) [![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Echo9Zulu-yellow)](https://huggingface.co/Echo9Zulu) [![Devices](https://img.shields.io/badge/Devices-CPU%2FGPU%2FNPU-blue)](https://github.com/openvinotoolkit/openvino) +[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/SearchSavior/OpenArc) > [!NOTE] > OpenArc is under active development. @@ -41,11 +42,10 @@ Thanks to everyone on Discord for their continued support! - [Converting Models to OpenVINO IR](#converting-models-to-openvino-ir) - [Learning Resources](#learning-resources) - [Acknowledgments](#acknowledgments) +- [Codebase Documentation](./docs/index.md) ## Features -**OpenArc 2.0** arrives with more endpoints, better UX, pipeline paralell, NPU support and much more! - - Multi GPU Pipeline Paralell - CPU offload/Hybrid device - NPU device support @@ -183,7 +183,7 @@ openarc --help > Need help installing drivers? [Join our Discord](https://discord.gg/Bzz9hax9Jq) or open an issue. > [!NOTE] -> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start. +> uv has a [pip interface](https://docs.astral.sh/uv/pip/) which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv. ## OpenArc CLI diff --git a/docs/data_types.md b/docs/data_types.md deleted file mode 100644 index e69de29..0000000 diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..8ba52c9 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,78 @@ +# OpenArc Documentation + +Welcome to OpenArc documentation! + +This document collects information about the codebase structure, APIs, architecture and design patterns to help you explore the codebase. + + +- **[Server](./server.md)** - FastAPI server documentation with endpoint details +- **[Model Registration](./model_registration.md)** - How models are registered, loaded, and managed +- **[Worker Orchestration](./worker_orchestration.md)** - Worker system architecture and request routing +- **[Inference](./inference.md)** - Inference engines, class structure, and implementation details + +### Architecture Overview + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FastAPI β”‚ HTTP API Layer +β”‚ Server β”‚ (OpenAI-compatible endpoints) +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ WorkerRegistry β”‚ Request Routing & Orchestration +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ ModelRegistry β”‚ Model Lifecycle Management +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Inference β”‚ Engine-specific implementations +β”‚ Engines β”‚ (OVGenAI, Optimum, OpenVINO) +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Key Components + +1. **Server** (`src/server/main.py`) + - FastAPI application with OpenAI-compatible endpoints + - Authentication middleware + - Request/response handling + +2. **Model Registry** (`src/server/model_registry.py`) + - Model lifecycle management (load/unload) + - Status tracking + - Factory pattern for engine instantiation + +3. **Worker Registry** (`src/server/worker_registry.py`) + - Per-model worker queues + - Request routing and orchestration + - Async packet processing + +4. **Inference Engines** (`src/engine/`) + - **OVGenAI**: LLM, VLM, Whisper models + - **Optimum**: Embedding, Reranker models + - **OpenVINO**: Kokoro TTS models + +## Supported Model Types + +- **LLM**: Text-to-text language models +- **VLM**: Vision-language models (image-to-text) +- **Whisper**: Automatic speech recognition +- **Kokoro**: Text-to-speech +- **Embedding**: Text-to-vector embeddings +- **Reranker**: Document reranking + +## Supported Libraries + +- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper) +- **Optimum**: Optimum-Intel (Embedding, Reranker) +- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS) + +This project is about intel devices, so expect we may expand to other frameworks/libraries in the future. + + + diff --git a/docs/inference.md b/docs/inference.md new file mode 100644 index 0000000..41762ae --- /dev/null +++ b/docs/inference.md @@ -0,0 +1,137 @@ +# Inference Engines Documentation + + +OpenArc supports three inference engines, each optimized for different model types: + +- **OVGenAI**: OpenVINO GenAI pipeline (LLM, VLM, Whisper) +- **Optimum**: Optimum-Intel (Embedding, Reranker) +- **OpenVINO**: Native OpenVINO runtime (Kokoro TTS) + +## Engine Architecture + +``` +src/engine/ +β”œβ”€β”€ ov_genai/ +β”‚ β”œβ”€β”€ llm.py # OVGenAI_LLM +β”‚ β”œβ”€β”€ vlm.py # OVGenAI_VLM +β”‚ β”œβ”€β”€ whisper.py # OVGenAI_Whisper +β”‚ β”œβ”€β”€ streamers.py # ChunkStreamer +β”‚ β”œβ”€β”€ continuous_batch_llm.py +β”‚ └── continuous_batch_vlm.py +β”œβ”€β”€ optimum/ +β”‚ β”œβ”€β”€ optimum_llm.py # Optimum_LLM +β”‚ β”œβ”€β”€ optimum_vlm.py # Optimum_VLM +β”‚ β”œβ”€β”€ optimum_emb.py # Optimum_EMB +β”‚ └── optimum_rr.py # Optimum_RR +└── openvino/ + β”œβ”€β”€ kokoro.py # OV_Kokoro + └── kitten.py +``` + +## Class Hierarchy + +### OVGenAI Engine + +#### OVGenAI_LLM (`src/engine/ov_genai/llm.py`) + +Text-to-text language model using OpenVINO GenAI LLMPipeline. + +**Key Features:** +- Supports OpenAI-compatible chat message format with chat templates +- Tool calling support (tools parameter in messages) +- Streaming and non-streaming generation modes +- Multiple input formats: pre-encoded input_ids, raw prompts, and chat messages +- ChunkStreamer for batched token streaming (chunk_size > 1) +- Performance metrics collection (ttft, throughput, etc.) +- Uses AutoTokenizer for encoding, model tokenizer for decoding + +#### OVGenAI_VLM (`src/engine/ov_genai/vlm.py`) + +Vision-language model using OpenVINO GenAI VLMPipeline. + +**Key Features:** +- Supports OpenAI-compatible multimodal message format with embedded images +- Tool calling support (tools parameter in messages) +- Streaming and non-streaming generation modes +- Extracts base64-encoded images from OpenAI message format +- Converts images to OpenVINO tensors for inference +- Inserts model-specific vision tokens at image positions +- Supports multiple images per request with proper token indexing +- ChunkStreamer for batched token streaming (chunk_size > 1) +- Performance metrics collection (ttft, throughput, etc.) +- Uses chat templates with vision token insertion + +**Vision Token Types:** +- `internvl2`: `` +- `llava15`: `` +- `llavanext`: `` +- `minicpmv26`: `(./)` +- `phi3vision`: `<|image_{i}|>` +- `phi4mm`: `<|image_{i}|>` +- `qwen2vl`: `<|vision_start|><|image_pad|><|vision_end|>` +- `qwen25vl`: `<|vision_start|><|image_pad|><|vision_end|>` +- `gemma3`: `` + +#### OVGenAI_Whisper (`src/engine/ov_genai/whisper.py`) + +Automatic speech recognition using OpenVINO GenAI Whisper + +**Key Features:** +- Processes base64-encoded audio +- Returns transcribed text and metrics +- Non-streaming only (Whisper processes entire audio) + +#### ChunkStreamer (`src/engine/ov_genai/streamers.py`) + +Custom streamer for chunked token streaming. Uses OpenVINO tokenizer, not AutoTokenizer for decode. + +**Features:** +- Accumulates tokens into chunks +- Yields chunks when chunk_size reached +- Supports chunk_size > 1 for batched streaming + +### Optimum Engine + +#### Optimum_EMB (`src/engine/optimum/optimum_emb.py`) + +Text-to-vector embedding model using Optimum-Intel. + +**Key Features:** +- Uses `OVModelForFeatureExtraction` +- Implements last token pooling for embeddings +- Normalizes embeddings (L2 normalization) +- Supports flexible tokenizer configuration + +**Token Pooling:** +- Handles left-padding vs right-padding +- Extracts last non-padding token embedding +- Normalizes to unit vectors + +#### Optimum_RR (`src/engine/optimum/optimum_rr.py`) + +Document reranking model using Optimum-Intel. + +**Key Features:** +- Reranks documents based on query relevance +- Supports custom prefix/suffix/instruction +- Returns ranked document lists + +### OpenVINO Engine + +#### OV_Kokoro (`src/engine/openvino/kokoro.py`) + +Text-to-speech model using native OpenVINO runtime. + +**Key Features:** +- Processes text in chunks (character_count_chunk) +- Generates audio tensors per chunk +- Supports voice selection and language codes +- Speed control for speech generation +- Returns WAV audio format + +**Voice Support:** +- Multiple languages (English, Japanese, Chinese, Spanish, etc.) +- Multiple voices per language +- Gender-specific voices + +# \ No newline at end of file diff --git a/docs/model_registration.md b/docs/model_registration.md new file mode 100644 index 0000000..a001bc7 --- /dev/null +++ b/docs/model_registration.md @@ -0,0 +1,101 @@ +# Model Registration Documentation + +This document describes the model registration system, lifecycle management, and architectural patterns. + +## Overview + +The Model Registry (`src/server/model_registry.py`) manages the lifecycle of all models in OpenArc using a registry pattern with async background loading and a factory pattern for engine instantiation. + +## Architecture Patterns + +### Registry Pattern + +The `ModelRegistry` maintains a central dictionary of all loaded models, tracking their status and lifecycle state. It is a volatile in memory datastore used internally. + +**Key Components:** +- **ModelRecord**: Tracks model state (LOADING, LOADED, FAILED) +- **Async Lock**: Ensures thread-safe concurrent access +- **Event System**: Callbacks for lifecycle events + +### Factory Pattern + +Models are instantiated via a factory that maps `(engine, model_type)` tuples to concrete engine classes: + +The factory dynamically imports and instantiates the appropriate class based on configuration. + +### Event System + +The registry fires events when models are loaded or unloaded, allowing other components (like `WorkerRegistry`) to react: + +```python +# Subscribe to events +registry.add_on_loaded(on_model_loaded) +registry.add_on_unloaded(on_model_unloaded) +``` + +## Model Lifecycle + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ REQUEST β”‚ +β”‚ LOAD MODEL β”‚ +β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ CREATE β”‚ +β”‚ MODEL RECORDβ”‚ +β”‚ (LOADING) β”‚ +β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ SPAWN β”‚ +β”‚ LOAD TASK β”‚ +β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FACTORY β”‚ +β”‚ INSTANTIATE β”‚ +β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ UPDATE β”‚ +β”‚ STATUS TO β”‚ +β”‚ LOADED β”‚ +β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FIRE β”‚ +β”‚ CALLBACKS β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Key Classes + +### ModelLoadConfig + +Pydantic model defining model configuration. + +### ModelRecord + +Dataclass tracking a registered model's state, instance, and metadata. Distinguishes between private (internal) and public (API-exposed) fields. + +### ModelRegistry + +Central registry implementing: +- **Async Loading**: Background tasks for model loading/unloading +- **Status Tracking**: LOADING β†’ LOADED β†’ FAILED states +- **Factory Integration**: Delegates instantiation to factory +- **Event Notifications**: Fires callbacks on lifecycle changes + +## Thread Safety + +All registry operations are protected by `asyncio.Lock` for thread-safe concurrent access. The registry maintains separate private model IDs while exposing public model names for API access. + +## Integration + +The `WorkerRegistry` subscribes to model lifecycle events to automatically spawn workers when models load and clean up when they unload. diff --git a/docs/openarc_server.md b/docs/openarc_server.md deleted file mode 100644 index ceae8af..0000000 --- a/docs/openarc_server.md +++ /dev/null @@ -1,8 +0,0 @@ -This document describes the features of OpenArc server. - - - - -## Model Lifecycle - - diff --git a/docs/server.md b/docs/server.md new file mode 100644 index 0000000..4d396dc --- /dev/null +++ b/docs/server.md @@ -0,0 +1,593 @@ +# OpenArc Server Documentation + +This document describes the FastAPI server implementation, endpoints, and API structure. + +## Table of Contents + +- [Overview](#overview) +- [Server Architecture](#server-architecture) + - [Key Components](#key-components) +- [Authentication](#authentication) +- [CORS Configuration](#cors-configuration) +- [Endpoints](#endpoints) + - [OpenArc Internal Endpoints](#openarc-internal-endpoints) + - [`POST /openarc/load`](#post-openarcload) + - [`POST /openarc/unload`](#post-openarcunload) + - [`GET /openarc/status`](#get-openarcstatus) + - [`POST /openarc/bench`](#post-openarcbench) + - [OpenAI-Compatible Endpoints](#openai-compatible-endpoints) + - [`GET /v1/models`](#get-v1models) + - [`POST /v1/chat/completions`](#post-v1chatcompletions) + - [`POST /v1/completions`](#post-v1completions) + - [`POST /v1/audio/transcriptions`](#post-v1audiotranscriptions) + - [`POST /v1/audio/speech`](#post-v1audiospeech) + - [`POST /v1/embeddings`](#post-v1embeddings) + - [`POST /v1/rerank`](#post-v1rerank) +- [Request Models](#request-models) + - [OpenAIChatCompletionRequest](#openaichatcompletionrequest) + - [OpenAICompletionRequest](#openaicompletionrequest) + - [OpenAIWhisperRequest](#openaiwhisperrequest) + - [OpenAIKokoroRequest](#openaikokororequest) + - [EmbeddingsRequest](#embeddingsrequest) + - [RerankRequest](#rerankrequest) +- [Tool Calling Support](#tool-calling-support) + - [Parser Implementation](#parser-implementation) +- [Metrics](#metrics) +- [Startup Models](#startup-models) + +## Overview + +The OpenArc server is built with FastAPI and provides OpenAI-compatible endpoints for inference. The server is located in `src/server/main.py`. + +## Server Architecture + + + +### Key Components + +- **FastAPI Application**: Main application instance with lifespan events +- **Model Registry**: Manages model lifecycle (load/unload) +- **Worker Registry**: Routes requests to appropriate workers +- **Authentication**: Bearer token authentication via `OPENARC_API_KEY` + +## Authentication + +All endpoints require authentication via Bearer token: + +```python +Authorization: Bearer +``` + +The API key is configured via the `OPENARC_API_KEY` environment variable. + +## Endpoints + +### OpenArc Internal Endpoints + +#### `POST /openarc/load` + +Load a model onto the server. + +**Request Body:** +```json +{ + "model_path": "/path/to/model", + "model_name": "my-model", + "model_type": "llm", + "engine": "ovgenai", + "device": "GPU.0", + "runtime_config": {}, + "vlm_type": null +} +``` + +**Response:** +```json +{ + "model_id": "unique-model-id", + "model_name": "my-model", + "status": "loaded" +} +``` + +**Status Codes:** +- `200`: Model loaded successfully +- `400`: Invalid request (e.g., model name already exists) +- `500`: Loading failed + +#### `POST /openarc/unload` + +Unload a model from the server. + +**Request Body:** +```json +{ + "model_name": "my-model" +} +``` + +**Response:** +```json +{ + "model_name": "my-model", + "status": "unloading" +} +``` + +**Status Codes:** +- `200`: Unload initiated +- `404`: Model not found +- `500`: Unload failed + +#### `GET /openarc/status` + +Get status of all loaded models. + +**Response:** +```json +{ + "total_loaded_models": 2, + "models": [ + { + "model_name": "my-model", + "model_type": "llm", + "engine": "ovgenai", + "device": "GPU.0", + "runtime_config": {}, + "status": "loaded", + "time_loaded": "2024-01-01T00:00:00" + } + ], + "openai_model_names": ["my-model"] +} +``` + +#### `POST /openarc/bench` + +Benchmark model performance with pre-encoded input IDs. + +**Request Body:** +```json +{ + "model": "my-model", + "input_ids": [1, 2, 3, ...], + "max_tokens": 512, + "temperature": 1.0, + "top_p": 1.0, + "top_k": 50, + "repetition_penalty": 1.0 +} +``` + +**Response:** +```json +{ + "metrics": { + "ttft": 0.123, + "prefill_throughput": 100.5, + "decode_throughput": 50.2, + "decode_duration": 2.5, + "tpot": 0.025, + "input_token": 512, + "new_token": 128, + "total_token": 640 + } +} +``` + +### OpenAI-Compatible Endpoints + +#### `GET /v1/models` + +List all available models. + +**Response:** +```json +{ + "object": "list", + "data": [ + { + "id": "my-model", + "object": "model", + "created": 1704067200, + "owned_by": "OpenArc" + } + ] +} +``` + +#### `POST /v1/chat/completions` + +Chat completions endpoint for LLM and VLM models. + +**Request Body:** +```json +{ + "model": "my-model", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello!"} + ], + "tools": [], + "stream": false, + "temperature": 1.0, + "max_tokens": 512, + "top_p": 1.0, + "top_k": 50, + "repetition_penalty": 1.0 +} +``` + +**Response (non-streaming):** +```json +{ + "id": "ov-abc123...", + "object": "chat.completion", + "created": 1704067200, + "model": "my-model", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Hello! How can I help you?" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 10, + "completion_tokens": 8, + "total_tokens": 18 + }, + "metrics": { + "ttft": 0.123, + "prefill_throughput": 100.5, + "decode_throughput": 50.2 + } +} +``` + +**Streaming Response:** +Server-Sent Events (SSE) format: +``` +data: {"id": "ov-abc123...", "object": "chat.completion.chunk", ...} +data: {"id": "ov-abc123...", "object": "chat.completion.chunk", ...} +data: [DONE] +``` + +#### `POST /v1/completions` + +Text completions endpoint for LLM models (legacy endpoint). + +**Request Body:** +```json +{ + "model": "my-model", + "prompt": "The capital of France is", + "stream": false, + "temperature": 1.0, + "max_tokens": 512 +} +``` + +**Response:** +```json +{ + "id": "ov-abc123...", + "object": "text_completion", + "created": 1704067200, + "model": "my-model", + "choices": [ + { + "index": 0, + "text": " Paris.", + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 5, + "completion_tokens": 2, + "total_tokens": 7 + } +} +``` + +#### `POST /v1/audio/transcriptions` + +Transcribe audio using Whisper models. + +**Request Body:** +```json +{ + "model": "whisper-model", + "audio_base64": "base64-encoded-audio-data" +} +``` + +**Response:** +```json +{ + "text": "Transcribed text here", + "metrics": { + "input_token": 100, + "new_token": 50, + "total_token": 150 + } +} +``` + +#### `POST /v1/audio/speech` + +Generate speech using Kokoro TTS models. + +**Request Body:** +```json +{ + "model": "kokoro-model", + "input": "Hello, world!", + "voice": "af_heart", + "speed": 1.0, + "language": "a", + "response_format": "wav" +} +``` + +**Response:** +Returns WAV audio file as binary stream with `Content-Type: audio/wav`. + +#### `POST /v1/embeddings` + +Generate text embeddings. + +**Request Body:** +```json +{ + "model": "embedding-model", + "input": "Text to embed", + "dimensions": null, + "encoding_format": "float", + "config": { + "max_length": 512, + "padding": true, + "truncation": true + } +} +``` + +**Response:** +```json +{ + "id": "ov-abc123...", + "object": "list", + "created": 1704067200, + "model": "embedding-model", + "data": [ + { + "index": 0, + "object": "embedding", + "embedding": [0.1, 0.2, ...] + } + ], + "usage": { + "prompt_tokens": 5, + "total_tokens": 5 + } +} +``` + +#### `POST /v1/rerank` + +Rerank documents based on a query. + +**Request Body:** +```json +{ + "model": "reranker-model", + "query": "search query", + "documents": ["doc1", "doc2", "doc3"], + "prefix": "<|im_start|>system\n...", + "suffix": "<|im_end|>\n...", + "instruction": "Given a search query..." +} +``` + +**Response:** +```json +{ + "id": "ov-abc123...", + "object": "list", + "created": 1704067200, + "model": "reranker-model", + "data": [ + { + "index": 0, + "object": "ranked_documents", + "ranked_documents": ["doc2", "doc1", "doc3"] + } + ], + "usage": { + "prompt_tokens": 50, + "total_tokens": 50 + } +} +``` + +## Request Models + +### OpenAIChatCompletionRequest +- `model`: str +- `messages`: List[Dict] +- `tools`: Optional[List[Dict]] +- `stream`: Optional[bool] +- `temperature`: Optional[float] +- `max_tokens`: Optional[int] +- `stop`: Optional[List[str]] +- `top_p`: Optional[float] +- `top_k`: Optional[int] +- `repetition_penalty`: Optional[float] +- `do_sample`: Optional[bool] +- `num_return_sequences`: Optional[int] + +### OpenAICompletionRequest +- `model`: str +- `prompt`: Union[str, List[str]] +- `stream`: Optional[bool] +- `temperature`: Optional[float] +- `max_tokens`: Optional[int] +- `stop`: Optional[List[str]] +- `top_p`: Optional[float] +- `top_k`: Optional[int] +- `repetition_penalty`: Optional[float] +- `do_sample`: Optional[bool] +- `num_return_sequences`: Optional[int] + +### OpenAIWhisperRequest +- `model`: str +- `audio_base64`: str + +### OpenAIKokoroRequest +- `model`: str +- `input`: str +- `voice`: Optional[str] +- `speed`: Optional[float] +- `language`: Optional[str] +- `response_format`: Optional[str] + +### EmbeddingsRequest +- `model`: str +- `input`: Union[str, List[str], List[List[str]]] +- `dimensions`: Optional[int] +- `encoding_format`: Optional[str] +- `user`: Optional[str] +- `config`: Optional[PreTrainedTokenizerConfig] + +### RerankRequest +- `model`: str +- `query`: str +- `documents`: List[str] +- `prefix`: Optional[str] +- `suffix`: Optional[str] +- `instruction`: Optional[str] + +## Tool Calling Support + +OpenArc supports OpenAI-compatible tool calling. Tools are parsed from model output using regex pattern matching for JSON objects containing `name` and `arguments` fields. + +Tool calls are detected in streaming and non-streaming modes: +- **Streaming**: Tool calls are detected incrementally and streamed as structured chunks +- **Non-streaming**: Tool calls are parsed from the final output + +### Parser Implementation + +The `parse_tool_calls()` function searches for JSON objects in the model's text output and converts them to OpenAI-compatible tool call format. + +**Input Format (Model Output):** + +The parser expects JSON objects embedded in the text with the following structure: + +```json +{ + "name": "function_name", + "arguments": { + "arg1": "value1", + "arg2": "value2" + } +} +``` + +**Input to the parser from a model:** + +``` +The user wants to know the weather. {"name": "get_weather", "arguments": {"location": "San Francisco", "units": "celsius"}} I'll check that for you. +``` + +**Output Format (OpenAI-Compatible):** + +Parser returns a list of tool call objects in OpenAI format: + +```json +[ + { + "id": "call_abc123def456...", + "type": "function", + "function": { + "name": "get_weather", + "arguments": "{\"location\": \"San Francisco\", \"units\": \"celsius\"}" + } + } +] +``` + +**Parser Behavior:** + +- Searches for JSON objects using regex pattern: `\{(?:[^{}]|(?:\{[^{}]*\}))*\}` +- Validates that each JSON object contains both `name` and `arguments` fields +- Generates unique IDs in format `call_{24-char-hex}` +- Converts `arguments` to JSON string (required by OpenAI format) +- Returns `None` if no valid tool calls are found + +**Example Response (Non-Streaming):** + +When tool calls are detected, the response includes: + +```json +{ + "id": "ov-abc123...", + "object": "chat.completion", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": null, + "tool_calls": [ + { + "id": "call_abc123def456...", + "type": "function", + "function": { + "name": "get_weather", + "arguments": "{\"location\": \"San Francisco\", \"units\": \"celsius\"}" + } + } + ] + }, + "finish_reason": "tool_calls" + } + ] +} +``` + +**Example Response (Streaming):** + +Tool calls are streamed as structured chunks: + +``` +data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"tool_calls": [{"index": 0, "id": "call_abc123...", "type": "function", "function": {"name": "get_weather", "arguments": ""}}]}}]} +data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"tool_calls": [{"index": 0, "function": {"arguments": "{\"location\": \"San Francisco\"}"}}]}}]} +data: {"id": "ov-abc123...", "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "finish_reason": "tool_calls"}]} +data: [DONE] +``` + +## Metrics + +All inference endpoints return performance metrics: +- `ttft`: Time to first token +- `prefill_throughput`: Prefill tokens per second +- `decode_throughput`: Decode tokens per second +- `decode_duration`: Total decode duration +- `tpot`: Time per output token +- `input_token`: Number of input tokens +- `new_token`: Number of generated tokens +- `total_token`: Total tokens processed + +## Startup Models + +Models can be automatically loaded on server startup via the `OPENARC_STARTUP_MODELS` environment variable: + +```bash +export OPENARC_STARTUP_MODELS="model1,model2,model3" +``` + +The server will read `openarc_config.json` and load the specified models during startup. + diff --git a/docs/worker_orchestration.md b/docs/worker_orchestration.md new file mode 100644 index 0000000..e60e10a --- /dev/null +++ b/docs/worker_orchestration.md @@ -0,0 +1,59 @@ +# Worker Orchestration Documentation + +This document describes the worker system architecture, request routing, and how inference requests are processed. + +## Architecture + +``` +Request β†’ WorkerRegistry β†’ Model Queue β†’ Queue Worker β†’ InferWorker β†’ Model Instance +``` + +## WorkerPacket + +Dataclass representing an inference request packet flowing through the system. + + +## Error Handling + +### Inference Failures + +If inference fails (exception caught in InferWorker): +1. Error stored in `packet.response` as `"Error: ..."` +2. Metrics set to None +3. QueueWorker detects error response +4. Triggers model unload via `registry.register_unload()` +5. Worker loop exits +6. Server remains unblocked and no workers stall + +## Thread Safety + +- Queues are thread-safe (`asyncio.Queue`) +- WorkerRegistry uses `asyncio.Lock` for queue/task dictionary access +- Each model has its own queue and worker, ensuring isolation + +## Concurrency Model + +- **Per-Model Workers**: Each loaded model has its own dedicated worker +- **Async Queues**: Requests are queued and processed asynchronously +- **Parallel Processing**: Multiple models can process requests concurrently +- **Streaming Support**: Streaming uses separate queue mechanism + +## Design Patterns + +### Queue-Based Processing +- Decouples request submission from execution +- Enables backpressure handling +- Supports multiple concurrent requests per model + +### Worker Pattern +- Dedicated worker per model +- Long-running async loops +- Clean shutdown via None sentinel + +### Future-Based Communication +- Non-streaming uses `asyncio.Future` for result communication +- Enables async/await pattern + +### Queue-Based Streaming +- Streaming uses `asyncio.Queue` for token delivery +- Enables async iteration pattern diff --git a/project.md b/project.md deleted file mode 100644 index a73f7b6..0000000 --- a/project.md +++ /dev/null @@ -1,9 +0,0 @@ -### OpenArc src_codex: Guide to src with dicussion - -This document can be used as an entry point into the OpenArc codebase. - - - - - -