ML inference from scratch in Rust. GGUF/SafeTensors parsing, quantization (Q4_K, Q8_0), transformer inference. SIMD/GPU via Trueno.
cargo install realizar
realizar serve --demo --port 8080
curl -X POST http://localhost:8080/generate -d '{"prompt": "Hello", "max_tokens": 10}'| Category | Details |
|---|---|
| Formats | GGUF, SafeTensors, APR (native) |
| Quantization | Q4_0, Q8_0, Q4_K, Q5_K, Q6_K |
| Inference | Transformer, RoPE, KV cache, Flash Attention |
| Chat Templates | ChatML, LLaMA2, Mistral, Phi, Alpaca (auto-detect) |
| API | REST, streaming, Prometheus metrics |
| GPU | CUDA via trueno-gpu (pure Rust PTX) |
| Quality | 2,400+ tests, 95% function coverage |
| Model | Parameters | Latency | Throughput |
|---|---|---|---|
| Iris | 131 | 103ns | 9.6M inferences/sec |
| MNIST | 103K | 73Β΅s | 13.6K inferences/sec |
| Large NN | 1M | 410Β΅s | 2.4K inferences/sec |
| Model | Size | Runtime | Backend | Throughput |
|---|---|---|---|---|
| Phi-2 Q4_K_M | 2.7B | realizar | RTX 4090 (CUDA) | 276 tok/s |
| Phi-2 Q4_K_M | 2.7B | llama.cpp | RTX 4090 (CUDA) | 256 tok/s |
| Phi-2 Q4_K_M | 2.7B | Ollama | RTX 4090 (CUDA) | 228 tok/s |
| Phi-2 Q4_K_M | 2.7B | realizar | CPU (AVX2) | ~15 tok/s |
realizar achieves 8-21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation (no LLVM/nvcc)
Same model (Phi-2 2.7B Q4_K) across ALL runtimes and formats:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GGUF Format (Same Model) β
ββββββββββββββββ¬ββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββββ€
β Runtime β Backend β p50 Latency β Throughput β Command β
ββββββββββββββββΌββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββββββββββββ€
β realizar β CUDA β ~3.6ms β 276 tok/s β --features cuda β
β llama.cpp β CUDA β 162ms β 256 tok/s β llama-server -ngl 99 β
β Ollama β CUDA β ~120ms β 228 tok/s β ollama serve β
β realizar β CPU β ~500ms β ~15 tok/s β cargo bench gguf_real β
β llama.cpp β CPU β ~3000ms β ~15 tok/s β llama-server -ngl 0 β
ββββββββββββββββ΄ββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββ€
β APR Format (Classical ML) β
ββββββββββββββββ¬ββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββββ€
β realizar β CPU β 103ns-410Β΅s β 2.4K-9.6M/s β cargo bench apr_real β
ββββββββββββββββ΄ββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββ
Note: realizar is a pure Rust implementation with CUDA support via trueno-gpu. With GPU acceleration, realizar achieves 8-21% faster inference than llama.cpp/Ollama while maintaining a pure Rust codebase (no C/C++ dependencies, no LLVM, no nvcc).
Run the full matrix yourself:
# 1. Start external servers
llama-server -m phi-2-q4_k_m.gguf --port 8082 -ngl 99 # GPU
llama-server -m phi-2-q4_k_m.gguf --port 8083 -ngl 0 # CPU
ollama serve && ollama pull phi2:2.7b
# 2. Run full matrix benchmark
./scripts/bench-matrix.sh --full
# 3. Run internal APR vs GGUF comparison (same model)
cargo bench --bench comparative
# 4. Convert GGUF to APR and compare
realizar convert model.gguf --output model.apr # Coming soonβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β What This Matrix Measures β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Think of it like comparing cars: β
β β
β ββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Runtime β Which "engine" runs your model? β β
β ββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β realizar β Our pure Rust engine (this project) β β
β β llama.cppβ Popular C++ engine (industry standard) β β
β β Ollama β User-friendly wrapper around llama.cpp β β
β ββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Backend β Which "fuel" powers the engine? β β
β ββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β CPU β Regular processor (slower, always works) β β
β β CUDA β NVIDIA GPU (fastest, needs GPU) β β
β β WGPU β Cross-platform GPU (good balance) β β
β ββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Format β Which "fuel type" for your model? β β
β ββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β GGUF β Quantized LLMs (smaller, fast) β β
β β APR β Our native format (fastest for small ML) β β
β β SafeT β HuggingFace format (full precision) β β
β ββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Matrix Result = Runtime Γ Backend Γ Format β
β β
β Example: "llama.cpp + CUDA + GGUF" = 256 tok/s on RTX 4090 β
β "realizar + CPU + APR" = 9.6M inf/s for tiny models β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why This Matters:
- Small Models (Iris, MNIST): Use APR format β nanosecond latency
- Large Models (LLMs): Use GGUF format β GPU acceleration essential
- Production: Match your hardware to the right runtime/backend combo
| Server | Backend | Mean Latency (ms) | Throughput (tok/s) |
|---|---|---|---|
| realizar | CUDA | 3.6 | 276 |
| llama.cpp | CUDA | 162 | 256 |
| Ollama | CUDA | 120 | 228 |
Methodology: CV-based stopping per Hoefler & Belli SC15. RTX 4090, Phi-2 2.7B Q4_K_M.
# Internal benchmarks (no external servers required)
cargo bench --bench apr_real # APR format (classical ML)
cargo bench --bench gguf_real # GGUF format (transformers)
cargo bench --bench comparative # APR vs GGUF comparisonStep 1: Start External Servers
# Terminal 1: llama.cpp with GPU (full GPU offload)
llama-server -m /path/to/phi-2-q4_k_m.gguf --host 127.0.0.1 --port 8082 -ngl 99
# Terminal 2: llama.cpp with CPU only
llama-server -m /path/to/phi-2-q4_k_m.gguf --host 127.0.0.1 --port 8083 -ngl 0
# Terminal 3: Ollama (uses GPU by default)
ollama serve # Default port 11434
ollama pull phi2:2.7b # Pull model firstStep 2: Run the Benchmark Matrix
# Full benchmark matrix (CV-based stopping, statistically significant)
./scripts/bench-matrix.sh --full
# Quick benchmark (fewer iterations)
./scripts/bench-matrix.sh --quick
# Programmatic benchmark via Rust
cargo bench --bench external_matrix --features bench-httpStep 3: View Results
Results are saved to benches/comparative/results/:
benchmark_matrix_TIMESTAMP.json- Raw databenchmark_matrix_TIMESTAMP.md- Markdown table
| What to Benchmark | Command |
|---|---|
| realizar (CPU) | cargo bench --bench apr_real |
| realizar (WGPU) | cargo bench --bench gguf_real --features gpu |
| llama.cpp (CPU) | Start server with -ngl 0, run ./scripts/bench-matrix.sh |
| llama.cpp (CUDA) | Start server with -ngl 99, run ./scripts/bench-matrix.sh |
| Ollama (GPU) | Start ollama serve, run ./scripts/bench-matrix.sh |
All benchmarks follow Hoefler & Belli SC'15:
- CV-based stopping: Iterate until coefficient of variation < 10%
- Warmup: 2-10 iterations discarded before measurement
- Metrics: p50, p99 latency, throughput (tok/s), cold start
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Realizar Benchmark Matrix v1.1 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
=== llama.cpp (GPU) ===
[10/30] Latency: 114.2ms | TPS: 477.1
CV stable at 0.048 after 10 iterations
=== Ollama (GPU) ===
[12/30] Latency: 123.4ms | TPS: 258.6
CV stable at 0.089 after 12 iterations
| Runtime | Backend | p50 Latency | p99 Latency | Throughput |
|---------|---------|-------------|-------------|------------|
| llama-cpp | gpu | 114.2ms | 161.0ms | 477.1 tok/s |
| ollama | gpu | 123.4ms | 145.2ms | 258.6 tok/s |
See docs/benchmarking-other-servers.md for full methodology.
Format LLM conversations for different model families with automatic template detection:
use realizar::chat_template::{
auto_detect_template, ChatMessage, ChatTemplateEngine
};
// Auto-detect template from model name
let template = auto_detect_template("Qwen2-0.5B-Instruct");
let messages = vec![
ChatMessage::system("You are a helpful assistant."),
ChatMessage::user("Hello!"),
];
let formatted = template.format_conversation(&messages)?;Supported Formats:
| Format | Models | System Prompt |
|---|---|---|
| ChatML | Qwen2, Yi, OpenHermes | Yes |
| Llama2 | TinyLlama, Vicuna, LLaMA 2 | Yes |
| Mistral | Mistral-7B, Mixtral | No |
| Phi | Phi-2, Phi-3 | Yes |
| Alpaca | Alpaca, Guanaco | Yes |
| Raw | Fallback | Passthrough |
| Custom | Any (Jinja2) | Configurable |
See examples/chat_template.rs for complete usage.
# All examples
cargo run --example inference # Basic inference demo
cargo run --example api_server # HTTP server demo
cargo run --example chat_template # Chat template formatting
cargo run --example gguf_loading # Load GGUF models
cargo run --example apr_loading # Load APR models
cargo run --example tokenization # Tokenizer demo
cargo run --example safetensors_loading # SafeTensors demo
cargo run --example observability_demo # Metrics demo
cargo run --example model_cache # Caching demorealizar serve --demo --port 8080 # Demo server
curl http://localhost:8080/health # Health check
curl http://localhost:8080/metrics # Prometheus# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":true}'Use the X-Trace-Level header for inference debugging:
# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...
# Step-level: forward pass steps (embed, attention, mlp, lm_head)
curl -H "X-Trace-Level: step" -X POST http://localhost:8080/v1/chat/completions ...
# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...Response includes trace data:
{
"choices": [...],
"brick_trace": {
"level": "brick",
"operations": 5,
"total_time_us": 12345,
"breakdown": [{"name": "token_0", "time_us": 2469}, ...]
}
}cargo install realizar # From crates.io
cargo install --path . # From sourcedefault= server + cli + gpucuda= NVIDIA CUDA support (pure Rust PTX, no nvcc)minimal= Core inference onlybench-http= External server benchmarking
realizar/
βββ src/
β βββ gguf.rs # GGUF parser + transformer inference
β βββ safetensors.rs # SafeTensors parser
β βββ apr.rs # APR format (native)
β βββ quantize.rs # Q4_K, Q8_0 dequantization
β βββ layers.rs # Transformer layers
β βββ tokenizer.rs # BPE, SentencePiece
β βββ chat_template.rs # Chat templates (ChatML, LLaMA2, Mistral, etc.)
β βββ api.rs # REST endpoints
β βββ bench_preflight.rs # Deterministic benchmarking
βββ benches/
βββ apr_real.rs # APR benchmarks
βββ gguf_real.rs # GGUF benchmarks
βββ comparative.rs # Format comparison
βββ external_matrix.rs # External server benchmarks
See CONTRIBUTING.md for development setup, testing, and code quality requirements.
MIT - Pragmatic AI Labs