This repository was archived by the owner on Apr 7, 2026. It is now read-only.
Highlights
Performance milestone: Candle parity achieved! Realizar now matches the performance of HuggingFace's Candle for GGUF Q4_0 inference.
Added
-
Q4_0×Q8_0 Integer SIMD Matmul - 2x inference speedup for GGUF Q4_0 models
- Quantize activations to Q8_0 format for integer multiply-accumulate
- Use
_mm256_maddubs_epi16for AVX2 SIMD acceleration - Sign trick algorithm matching llama.cpp's approach
- 2-block loop unrolling with prefetch hints
-
APR SIMD Matmul - 5-7x inference speedup for APR transformer models
- Trueno Matrix/Vector SIMD acceleration
- Scalar fallback for edge cases
- APR now achieves near-GGUF parity (1.4-6x vs 6-10x before)
Changed
- Aprender Dependency - Updated from 0.14 to 0.20.1
- Latest TransformerLM and MoE support
- Improved APR format handling
Performance
| Metric | Before | After |
|---|---|---|
| GGUF Q4_0 | 4.2-7.1 tok/s | 8.4-11.9 tok/s (2x) |
| APR tiny_64x1 | 500 µs | 66 µs (7.5x) |
| APR medium_256x4 | 48 ms | 9.0 ms (5.3x) |
| vs Candle | 55-72% | 91-120% |
| vs llama.cpp | 10-16% | 20-26% |
Quality
- All 806 tests pass (with aprender-serve feature)
- All falsification tests pass
- Clippy: 0 warnings