Skip to content

paiml/realizar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,042 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

realizar - Pure Rust ML Inference Engine

CI

ML inference from scratch in Rust. GGUF/SafeTensors parsing, quantization (Q4_K, Q8_0), transformer inference. SIMD/GPU via Trueno.

Quick Start

cargo install realizar
realizar serve --demo --port 8080
curl -X POST http://localhost:8080/generate -d '{"prompt": "Hello", "max_tokens": 10}'

Features

Category Details
Formats GGUF, SafeTensors, APR (native)
Quantization Q4_0, Q8_0, Q4_K, Q5_K, Q6_K
Inference Transformer, RoPE, KV cache, Flash Attention
Chat Templates ChatML, LLaMA2, Mistral, Phi, Alpaca (auto-detect)
API REST, streaming, Prometheus metrics
GPU CUDA via trueno-gpu (pure Rust PTX)
Quality 2,400+ tests, 95% function coverage

Benchmarks

APR Format (Classical ML - Pure Rust)

Model Parameters Latency Throughput
Iris 131 103ns 9.6M inferences/sec
MNIST 103K 73Β΅s 13.6K inferences/sec
Large NN 1M 410Β΅s 2.4K inferences/sec

GGUF Format (LLM Inference)

Model Size Runtime Backend Throughput
Phi-2 Q4_K_M 2.7B realizar RTX 4090 (CUDA) 276 tok/s
Phi-2 Q4_K_M 2.7B llama.cpp RTX 4090 (CUDA) 256 tok/s
Phi-2 Q4_K_M 2.7B Ollama RTX 4090 (CUDA) 228 tok/s
Phi-2 Q4_K_M 2.7B realizar CPU (AVX2) ~15 tok/s

realizar achieves 8-21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation (no LLVM/nvcc)

The Complete Benchmark Matrix

Same model (Phi-2 2.7B Q4_K) across ALL runtimes and formats:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     GGUF Format (Same Model)                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Runtime      β”‚ Backend β”‚ p50 Latency β”‚ Throughput  β”‚ Command               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ realizar     β”‚ CUDA    β”‚ ~3.6ms      β”‚ 276 tok/s   β”‚ --features cuda       β”‚
β”‚ llama.cpp    β”‚ CUDA    β”‚ 162ms       β”‚ 256 tok/s   β”‚ llama-server -ngl 99  β”‚
β”‚ Ollama       β”‚ CUDA    β”‚ ~120ms      β”‚ 228 tok/s   β”‚ ollama serve          β”‚
β”‚ realizar     β”‚ CPU     β”‚ ~500ms      β”‚ ~15 tok/s   β”‚ cargo bench gguf_real β”‚
β”‚ llama.cpp    β”‚ CPU     β”‚ ~3000ms     β”‚ ~15 tok/s   β”‚ llama-server -ngl 0   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     APR Format (Classical ML)                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ realizar     β”‚ CPU     β”‚ 103ns-410Β΅s β”‚ 2.4K-9.6M/s β”‚ cargo bench apr_real  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: realizar is a pure Rust implementation with CUDA support via trueno-gpu. With GPU acceleration, realizar achieves 8-21% faster inference than llama.cpp/Ollama while maintaining a pure Rust codebase (no C/C++ dependencies, no LLVM, no nvcc).

Run the full matrix yourself:

# 1. Start external servers
llama-server -m phi-2-q4_k_m.gguf --port 8082 -ngl 99  # GPU
llama-server -m phi-2-q4_k_m.gguf --port 8083 -ngl 0   # CPU
ollama serve && ollama pull phi2:2.7b

# 2. Run full matrix benchmark
./scripts/bench-matrix.sh --full

# 3. Run internal APR vs GGUF comparison (same model)
cargo bench --bench comparative

# 4. Convert GGUF to APR and compare
realizar convert model.gguf --output model.apr  # Coming soon

Benchmark Matrix (ELI5)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  What This Matrix Measures                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  Think of it like comparing cars:                                β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Runtime  β”‚ Which "engine" runs your model?              β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚ realizar β”‚ Our pure Rust engine (this project)         β”‚    β”‚
β”‚  β”‚ llama.cppβ”‚ Popular C++ engine (industry standard)      β”‚    β”‚
β”‚  β”‚ Ollama   β”‚ User-friendly wrapper around llama.cpp      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Backend  β”‚ Which "fuel" powers the engine?             β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚ CPU      β”‚ Regular processor (slower, always works)    β”‚    β”‚
β”‚  β”‚ CUDA     β”‚ NVIDIA GPU (fastest, needs GPU)             β”‚    β”‚
β”‚  β”‚ WGPU     β”‚ Cross-platform GPU (good balance)           β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Format   β”‚ Which "fuel type" for your model?           β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚ GGUF     β”‚ Quantized LLMs (smaller, fast)              β”‚    β”‚
β”‚  β”‚ APR      β”‚ Our native format (fastest for small ML)    β”‚    β”‚
β”‚  β”‚ SafeT    β”‚ HuggingFace format (full precision)         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β”‚  Matrix Result = Runtime Γ— Backend Γ— Format                      β”‚
β”‚                                                                  β”‚
β”‚  Example: "llama.cpp + CUDA + GGUF" = 256 tok/s on RTX 4090     β”‚
β”‚           "realizar + CPU + APR"   = 9.6M inf/s for tiny models β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Matters:

  • Small Models (Iris, MNIST): Use APR format β†’ nanosecond latency
  • Large Models (LLMs): Use GGUF format β†’ GPU acceleration essential
  • Production: Match your hardware to the right runtime/backend combo

Server Benchmark Results

Server Backend Mean Latency (ms) Throughput (tok/s)
realizar CUDA 3.6 276
llama.cpp CUDA 162 256
Ollama CUDA 120 228

Methodology: CV-based stopping per Hoefler & Belli SC15. RTX 4090, Phi-2 2.7B Q4_K_M.

Run Benchmarks

Quick Start

# Internal benchmarks (no external servers required)
cargo bench --bench apr_real      # APR format (classical ML)
cargo bench --bench gguf_real     # GGUF format (transformers)
cargo bench --bench comparative   # APR vs GGUF comparison

Benchmark Against llama.cpp and Ollama

Step 1: Start External Servers

# Terminal 1: llama.cpp with GPU (full GPU offload)
llama-server -m /path/to/phi-2-q4_k_m.gguf --host 127.0.0.1 --port 8082 -ngl 99

# Terminal 2: llama.cpp with CPU only
llama-server -m /path/to/phi-2-q4_k_m.gguf --host 127.0.0.1 --port 8083 -ngl 0

# Terminal 3: Ollama (uses GPU by default)
ollama serve   # Default port 11434
ollama pull phi2:2.7b  # Pull model first

Step 2: Run the Benchmark Matrix

# Full benchmark matrix (CV-based stopping, statistically significant)
./scripts/bench-matrix.sh --full

# Quick benchmark (fewer iterations)
./scripts/bench-matrix.sh --quick

# Programmatic benchmark via Rust
cargo bench --bench external_matrix --features bench-http

Step 3: View Results

Results are saved to benches/comparative/results/:

  • benchmark_matrix_TIMESTAMP.json - Raw data
  • benchmark_matrix_TIMESTAMP.md - Markdown table

Full Backend Γ— Runtime Matrix

What to Benchmark Command
realizar (CPU) cargo bench --bench apr_real
realizar (WGPU) cargo bench --bench gguf_real --features gpu
llama.cpp (CPU) Start server with -ngl 0, run ./scripts/bench-matrix.sh
llama.cpp (CUDA) Start server with -ngl 99, run ./scripts/bench-matrix.sh
Ollama (GPU) Start ollama serve, run ./scripts/bench-matrix.sh

Methodology

All benchmarks follow Hoefler & Belli SC'15:

  • CV-based stopping: Iterate until coefficient of variation < 10%
  • Warmup: 2-10 iterations discarded before measurement
  • Metrics: p50, p99 latency, throughput (tok/s), cold start

Example Output

╔════════════════════════════════════════════════════════════════╗
β•‘          Realizar Benchmark Matrix v1.1                        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

=== llama.cpp (GPU) ===
  [10/30] Latency: 114.2ms | TPS: 477.1
  CV stable at 0.048 after 10 iterations

=== Ollama (GPU) ===
  [12/30] Latency: 123.4ms | TPS: 258.6
  CV stable at 0.089 after 12 iterations

| Runtime | Backend | p50 Latency | p99 Latency | Throughput |
|---------|---------|-------------|-------------|------------|
| llama-cpp | gpu | 114.2ms | 161.0ms | 477.1 tok/s |
| ollama | gpu | 123.4ms | 145.2ms | 258.6 tok/s |

See docs/benchmarking-other-servers.md for full methodology.

Chat Templates

Format LLM conversations for different model families with automatic template detection:

use realizar::chat_template::{
    auto_detect_template, ChatMessage, ChatTemplateEngine
};

// Auto-detect template from model name
let template = auto_detect_template("Qwen2-0.5B-Instruct");

let messages = vec![
    ChatMessage::system("You are a helpful assistant."),
    ChatMessage::user("Hello!"),
];

let formatted = template.format_conversation(&messages)?;

Supported Formats:

Format Models System Prompt
ChatML Qwen2, Yi, OpenHermes Yes
Llama2 TinyLlama, Vicuna, LLaMA 2 Yes
Mistral Mistral-7B, Mixtral No
Phi Phi-2, Phi-3 Yes
Alpaca Alpaca, Guanaco Yes
Raw Fallback Passthrough
Custom Any (Jinja2) Configurable

See examples/chat_template.rs for complete usage.

Examples

# All examples
cargo run --example inference          # Basic inference demo
cargo run --example api_server         # HTTP server demo
cargo run --example chat_template      # Chat template formatting
cargo run --example gguf_loading       # Load GGUF models
cargo run --example apr_loading        # Load APR models
cargo run --example tokenization       # Tokenizer demo
cargo run --example safetensors_loading # SafeTensors demo
cargo run --example observability_demo  # Metrics demo
cargo run --example model_cache        # Caching demo

Usage

realizar serve --demo --port 8080     # Demo server
curl http://localhost:8080/health     # Health check
curl http://localhost:8080/metrics    # Prometheus

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":true}'

Debugging with Tracing

Use the X-Trace-Level header for inference debugging:

# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...

# Step-level: forward pass steps (embed, attention, mlp, lm_head)
curl -H "X-Trace-Level: step" -X POST http://localhost:8080/v1/chat/completions ...

# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...

Response includes trace data:

{
  "choices": [...],
  "brick_trace": {
    "level": "brick",
    "operations": 5,
    "total_time_us": 12345,
    "breakdown": [{"name": "token_0", "time_us": 2469}, ...]
  }
}

Installation

cargo install realizar                # From crates.io
cargo install --path .                # From source

Feature Flags

  • default = server + cli + gpu
  • cuda = NVIDIA CUDA support (pure Rust PTX, no nvcc)
  • minimal = Core inference only
  • bench-http = External server benchmarking

Architecture

realizar/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ gguf.rs         # GGUF parser + transformer inference
β”‚   β”œβ”€β”€ safetensors.rs  # SafeTensors parser
β”‚   β”œβ”€β”€ apr.rs          # APR format (native)
β”‚   β”œβ”€β”€ quantize.rs     # Q4_K, Q8_0 dequantization
β”‚   β”œβ”€β”€ layers.rs       # Transformer layers
β”‚   β”œβ”€β”€ tokenizer.rs    # BPE, SentencePiece
β”‚   β”œβ”€β”€ chat_template.rs # Chat templates (ChatML, LLaMA2, Mistral, etc.)
β”‚   β”œβ”€β”€ api.rs          # REST endpoints
β”‚   └── bench_preflight.rs # Deterministic benchmarking
└── benches/
    β”œβ”€β”€ apr_real.rs     # APR benchmarks
    β”œβ”€β”€ gguf_real.rs    # GGUF benchmarks
    β”œβ”€β”€ comparative.rs  # Format comparison
    └── external_matrix.rs # External server benchmarks

Contributing

See CONTRIBUTING.md for development setup, testing, and code quality requirements.

License

MIT - Pragmatic AI Labs