Skip to content
This repository was archived by the owner on Apr 7, 2026. It is now read-only.

v0.3.2: Q4_0×Q8_0 Integer SIMD + APR Performance

Latest

Choose a tag to compare

@noahgift noahgift released this 29 Dec 23:39
· 1620 commits to main since this release

Highlights

Performance milestone: Candle parity achieved! Realizar now matches the performance of HuggingFace's Candle for GGUF Q4_0 inference.

Added

  • Q4_0×Q8_0 Integer SIMD Matmul - 2x inference speedup for GGUF Q4_0 models

    • Quantize activations to Q8_0 format for integer multiply-accumulate
    • Use _mm256_maddubs_epi16 for AVX2 SIMD acceleration
    • Sign trick algorithm matching llama.cpp's approach
    • 2-block loop unrolling with prefetch hints
  • APR SIMD Matmul - 5-7x inference speedup for APR transformer models

    • Trueno Matrix/Vector SIMD acceleration
    • Scalar fallback for edge cases
    • APR now achieves near-GGUF parity (1.4-6x vs 6-10x before)

Changed

  • Aprender Dependency - Updated from 0.14 to 0.20.1
    • Latest TransformerLM and MoE support
    • Improved APR format handling

Performance

Metric Before After
GGUF Q4_0 4.2-7.1 tok/s 8.4-11.9 tok/s (2x)
APR tiny_64x1 500 µs 66 µs (7.5x)
APR medium_256x4 48 ms 9.0 ms (5.3x)
vs Candle 55-72% 91-120%
vs llama.cpp 10-16% 20-26%

Quality

  • All 806 tests pass (with aprender-serve feature)
  • All falsification tests pass
  • Clippy: 0 warnings

Links