This directory contains examples demonstrating Trueno's high-performance compute capabilities and comparisons with NumPy/PyTorch.
| Example | Description | Command |
|---|---|---|
quickstart.rs |
⭐ Start here! All core features in one file | cargo run --example quickstart |
performance_demo.rs |
Compare Scalar vs SSE2/AVX backends | cargo run --release --example performance_demo |
matrix_operations.rs |
Matrix multiplication and transpose | cargo run --release --example matrix_operations |
activation_functions.rs |
Neural network activations (ReLU, Sigmoid, etc.) | cargo run --release --example activation_functions |
backend_detection.rs |
Auto-detection of available SIMD backends | cargo run --release --example backend_detection |
ml_similarity.rs |
Cosine similarity for ML applications | cargo run --release --example ml_similarity |
symmetric_eigen.rs |
Eigendecomposition for PCA/spectral analysis | cargo run --release --example symmetric_eigen |
hash_demo.rs |
SIMD-optimized hashing for KV stores | cargo run --release --example hash_demo |
gpu_batch_demo.rs |
GPU batch operations (requires gpu feature) |
cargo run --release --features gpu --example gpu_batch_demo |
gpu_monitor_demo.rs |
GPU monitoring and metrics | cargo run --release --features gpu --example gpu_monitor_demo |
gpu_tiled_reduction.rs |
GPU tiled reduction operations | cargo run --release --features gpu --example gpu_tiled_reduction |
tiled_reduction_demo.rs |
TensorView and PartitionView demo | cargo run --release --example tiled_reduction_demo |
perf_tui.rs |
Interactive TUI performance dashboard | cargo run --release --example perf_tui |
regression_test.rs |
Numerical regression testing | cargo run --release --example regression_test |
vocab_bench.rs |
Vocabulary processing benchmark | cargo run --release --example vocab_bench |
profile_vocab.rs |
Vocabulary profiling | cargo run --release --example profile_vocab |
execution_graph.rs |
Execution path graph demo | cargo run --release --features execution-graph --example execution_graph |
ml_tuner_demo.rs |
ML-based kernel selection | cargo run --release --features ml-tuner --example ml_tuner_demo |
ml_tuner_evolution.rs |
ML tuner evolution demo | cargo run --release --features ml-tuner --example ml_tuner_evolution |
model_tracing.rs |
Model-level inference tracing | cargo run --release --example model_tracing |
tiling_demo.rs |
Tiling compute blocks | cargo run --release --example tiling_demo |
tile_profiler_demo.rs |
Tile profiler demo | cargo run --release --example tile_profiler_demo |
blis_benchmark.rs |
BLIS-style GEMM benchmark | cargo run --release --example blis_benchmark |
simd_comparison.rs |
SIMD backend comparison | cargo run --release --example simd_comparison |
simd_softmax_quant.rs |
SIMD softmax + quantization | cargo run --release --example simd_softmax_quant |
| Example | Description | Command |
|---|---|---|
benchmark_matrix_suite.rs |
Matrix operation benchmarks | cargo run --release --example benchmark_matrix_suite |
benchmark_matvec.rs |
Matrix-vector multiplication | cargo run --release --example benchmark_matvec |
benchmark_matvec_parallel.rs |
Parallel matrix-vector ops | cargo run --release --example benchmark_matvec_parallel |
benchmark_parallel.rs |
Parallel computation benchmarks | cargo run --release --example benchmark_parallel |
| Example | Description | Command |
|---|---|---|
ptx_quickstart |
⭐ Start here! Basic PTX code generation | cargo run -p trueno-gpu --example ptx_quickstart |
gemm_kernel |
GEMM kernel generation (naive/tiled) | cargo run -p trueno-gpu --example gemm_kernel |
cuda_monitor |
GPU monitoring and metrics | cargo run -p trueno-gpu --example cuda_monitor |
flash_attention_cuda |
Flash Attention implementation | cargo run -p trueno-gpu --example flash_attention_cuda |
simple_attention_cuda |
Basic multi-head attention | cargo run -p trueno-gpu --example simple_attention_cuda |
q4k_gemm |
Quantized GEMM (Q4_K format) | cargo run -p trueno-gpu --example q4k_gemm |
q5k_q6k_gemm |
Q5_K/Q6_K quantized GEMM (PARITY-116/117) | cargo run -p trueno-gpu --example q5k_q6k_gemm |
register_allocation |
PTX register allocation demo | cargo run -p trueno-gpu --example register_allocation |
gpu_pixels_render |
GPU pixel rendering | cargo run -p trueno-gpu --example gpu_pixels_render |
dump_ptx |
Dump raw PTX output | cargo run -p trueno-gpu --example dump_ptx |
satd_kernels |
SATD (video codec) kernels | cargo run -p trueno-gpu --example satd_kernels |
lz4_compression |
🗜️ LZ4 compression kernel | cargo run -p trueno-gpu --example lz4_compression |
lz4_file_compress |
📦 LZ4 file compression CLI | cargo run -p trueno-gpu --example lz4_file_compress -- bench |
ptx_optimize |
PTX optimization passes demo | cargo run -p trueno-gpu --example ptx_optimize |
bench_kernel_gen |
Kernel generation benchmarks | cargo run -p trueno-gpu --example bench_kernel_gen |
Note: PTX generation examples work without a GPU. Runtime examples (cuda_monitor, flash_attention_cuda) require an NVIDIA GPU with CUDA drivers.
| Example | Description | Command |
|---|---|---|
dot_product_comparison.py |
⚡ Dot product benchmark | uv run examples/dot_product_comparison.py |
matrix_multiply_comparison.py |
🔢 Matrix multiplication benchmark | uv run examples/matrix_multiply_comparison.py |
activation_comparison.py |
🧠 Activation functions benchmark | uv run examples/activation_comparison.py |
- Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh- Run any comparison example (UV handles dependencies automatically):
uv run examples/dot_product_comparison.py
uv run examples/matrix_multiply_comparison.py
uv run examples/activation_comparison.pyAlternatively, use the benchmarks environment:
cd benchmarks
uv run ../examples/dot_product_comparison.pycargo run --release --example performance_demo
cargo run --release --example matrix_operationsKey Insights:
- Demonstrates compute-intensive operations where SIMD excels
- Shows NumPy vs PyTorch performance characteristics
- Highlights Trueno's 1.6x advantage over NumPy (11.9x over scalar)
Expected Output:
Size NumPy (μs) PyTorch (μs) Winner Speedup
--------------------------------------------------------------------------------
100 0.82 ± 0.15 1.85 ± 0.22 NumPy 2.26x
1,000 3.21 ± 0.18 6.45 ± 0.31 NumPy 2.01x
10,000 25.67 ± 1.23 58.32 ± 2.45 NumPy 2.27x
Trueno Context:
- Trueno AVX-512: 11.9x faster than scalar
- Trueno AVX-512: 1.6x faster than NumPy
- Trueno AVX-512: 2.8x faster than PyTorch
Key Insights:
- Shows O(n³) complexity scaling
- Demonstrates when GPU acceleration becomes effective
- Highlights optimized BLAS libraries in NumPy
Expected Output:
Size NumPy Time PyTorch Time Winner Speedup
------------------------------------------------------------------------------------------
64×64 59.87 μs 125.34 μs NumPy 2.09x
128×128 434.23 μs 678.45 μs NumPy 1.56x
256×256 2.67 ms 3.45 ms NumPy 1.29x
512×512 19.82 ms 25.67 ms NumPy 1.29x
Trueno Context:
- SIMD backend: ~7x faster than naive O(n³) for 128×128
- GPU backend: 2-10x faster than scalar for 500×500+
- Automatic backend selection based on matrix size
Key Insights:
- Compares common ML activation functions
- Shows relative costs (ReLU << Tanh < Sigmoid < Exp)
- Demonstrates SIMD benefits for transcendental functions
Expected Output:
Activation NumPy (μs) PyTorch (μs) Winner Speedup
------------------------------------------------------------------------------------------
ReLU 2.34 ± 0.12 5.67 ± 0.23 NumPy 2.42x
Sigmoid 15.67 ± 0.45 32.34 ± 1.12 NumPy 2.06x
Tanh 8.92 ± 0.34 18.45 ± 0.67 NumPy 2.07x
Exp 12.45 ± 0.56 28.91 ± 1.23 NumPy 2.32x
Trueno Context:
- SIMD-optimized implementations
- 2-4x speedup for compute-intensive activations
- Zero Python overhead for ML inference
| Operation | Trueno vs Scalar | Trueno vs NumPy | Trueno vs PyTorch |
|---|---|---|---|
| Dot Product | 11.9x faster | 1.6x faster | 2.8x faster |
| Matrix Multiply | 7x faster (128×128) | ~1x (competitive) | ~1.5x faster |
| Activations | 2-4x faster | ~1x (competitive) | ~2x faster |
✅ Ideal Use Cases:
- Real-time systems requiring predictable latency
- Embedded systems without Python runtime
- WebAssembly deployment (browser/edge)
- ML inference pipelines in Rust
- Systems programming with high-performance compute needs
- Rapid prototyping in Python
- Large ecosystem of Python ML libraries
- Training large neural networks (PyTorch GPU)
- Interactive data exploration (Jupyter notebooks)
- Comprehensive Benchmarks: See
benchmarks/README.md - Performance Analysis: See
docs/performance-analysis.md - API Documentation: See
docs/directory - Project README: See root
README.md
To add new examples:
- Rust examples: Add to this directory with
.rsextension - Python examples: Add comparison scripts with NumPy/PyTorch
- Update this README: Document the new example
- Follow TDD: Ensure examples are well-tested
See CLAUDE.md for development guidelines.
Last Updated: 2026-01-25 Version: v0.14.1 Contact: GitHub Issues