Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
activation_comparison.py	activation_comparison.py
activation_functions.rs	activation_functions.rs
b4_fault_simulation.rs	b4_fault_simulation.rs
backend_detection.rs	backend_detection.rs
bench_profiling_primitives.rs	bench_profiling_primitives.rs
benchmark_matrix_suite.rs	benchmark_matrix_suite.rs
benchmark_matvec.rs	benchmark_matvec.rs
benchmark_matvec_parallel.rs	benchmark_matvec_parallel.rs
benchmark_parallel.rs	benchmark_parallel.rs
blis_benchmark.rs	blis_benchmark.rs
brick_profiler_v2.rs	brick_profiler_v2.rs
coop_gemm_bench.rs	coop_gemm_bench.rs
design_by_contract.rs	design_by_contract.rs
dot_product_comparison.py	dot_product_comparison.py
execution_graph.rs	execution_graph.rs
gpu_batch_demo.rs	gpu_batch_demo.rs
gpu_monitor_demo.rs	gpu_monitor_demo.rs
gpu_tiled_reduction.rs	gpu_tiled_reduction.rs
hash_demo.rs	hash_demo.rs
inference_demo.rs	inference_demo.rs
matrix_multiply_comparison.py	matrix_multiply_comparison.py
matrix_operations.rs	matrix_operations.rs
ml_similarity.rs	ml_similarity.rs
ml_tuner_demo.rs	ml_tuner_demo.rs
ml_tuner_evolution.rs	ml_tuner_evolution.rs
model_tracing.rs	model_tracing.rs
perf_tui.rs	perf_tui.rs
performance_demo.rs	performance_demo.rs
profile_vocab.rs	profile_vocab.rs
quickstart.rs	quickstart.rs
regression_test.rs	regression_test.rs
simd_comparison.rs	simd_comparison.rs
simd_softmax_quant.rs	simd_softmax_quant.rs
symmetric_eigen.rs	symmetric_eigen.rs
tile_profiler_demo.rs	tile_profiler_demo.rs
tiled_reduction_demo.rs	tiled_reduction_demo.rs
tiling_demo.rs	tiling_demo.rs
tuner_usage.rs	tuner_usage.rs
vocab_bench.rs	vocab_bench.rs
wgpu_backward_demo.rs	wgpu_backward_demo.rs

Trueno Examples

This directory contains examples demonstrating Trueno's high-performance compute capabilities and comparisons with NumPy/PyTorch.

📁 Examples Overview

Rust Examples (Trueno Native)

Example	Description	Command
`quickstart.rs`	⭐ Start here! All core features in one file	`cargo run --example quickstart`
`performance_demo.rs`	Compare Scalar vs SSE2/AVX backends	`cargo run --release --example performance_demo`
`matrix_operations.rs`	Matrix multiplication and transpose	`cargo run --release --example matrix_operations`
`activation_functions.rs`	Neural network activations (ReLU, Sigmoid, etc.)	`cargo run --release --example activation_functions`
`backend_detection.rs`	Auto-detection of available SIMD backends	`cargo run --release --example backend_detection`
`ml_similarity.rs`	Cosine similarity for ML applications	`cargo run --release --example ml_similarity`
`symmetric_eigen.rs`	Eigendecomposition for PCA/spectral analysis	`cargo run --release --example symmetric_eigen`
`hash_demo.rs`	SIMD-optimized hashing for KV stores	`cargo run --release --example hash_demo`
`gpu_batch_demo.rs`	GPU batch operations (requires `gpu` feature)	`cargo run --release --features gpu --example gpu_batch_demo`
`gpu_monitor_demo.rs`	GPU monitoring and metrics	`cargo run --release --features gpu --example gpu_monitor_demo`
`gpu_tiled_reduction.rs`	GPU tiled reduction operations	`cargo run --release --features gpu --example gpu_tiled_reduction`
`tiled_reduction_demo.rs`	TensorView and PartitionView demo	`cargo run --release --example tiled_reduction_demo`
`perf_tui.rs`	Interactive TUI performance dashboard	`cargo run --release --example perf_tui`
`regression_test.rs`	Numerical regression testing	`cargo run --release --example regression_test`
`vocab_bench.rs`	Vocabulary processing benchmark	`cargo run --release --example vocab_bench`
`profile_vocab.rs`	Vocabulary profiling	`cargo run --release --example profile_vocab`
`execution_graph.rs`	Execution path graph demo	`cargo run --release --features execution-graph --example execution_graph`
`ml_tuner_demo.rs`	ML-based kernel selection	`cargo run --release --features ml-tuner --example ml_tuner_demo`
`ml_tuner_evolution.rs`	ML tuner evolution demo	`cargo run --release --features ml-tuner --example ml_tuner_evolution`
`model_tracing.rs`	Model-level inference tracing	`cargo run --release --example model_tracing`
`tiling_demo.rs`	Tiling compute blocks	`cargo run --release --example tiling_demo`
`tile_profiler_demo.rs`	Tile profiler demo	`cargo run --release --example tile_profiler_demo`
`blis_benchmark.rs`	BLIS-style GEMM benchmark	`cargo run --release --example blis_benchmark`
`simd_comparison.rs`	SIMD backend comparison	`cargo run --release --example simd_comparison`
`simd_softmax_quant.rs`	SIMD softmax + quantization	`cargo run --release --example simd_softmax_quant`

Benchmark Examples

Example	Description	Command
`benchmark_matrix_suite.rs`	Matrix operation benchmarks	`cargo run --release --example benchmark_matrix_suite`
`benchmark_matvec.rs`	Matrix-vector multiplication	`cargo run --release --example benchmark_matvec`
`benchmark_matvec_parallel.rs`	Parallel matrix-vector ops	`cargo run --release --example benchmark_matvec_parallel`
`benchmark_parallel.rs`	Parallel computation benchmarks	`cargo run --release --example benchmark_parallel`

CUDA/PTX Examples (trueno-gpu)

Example	Description	Command
`ptx_quickstart`	⭐ Start here! Basic PTX code generation	`cargo run -p trueno-gpu --example ptx_quickstart`
`gemm_kernel`	GEMM kernel generation (naive/tiled)	`cargo run -p trueno-gpu --example gemm_kernel`
`cuda_monitor`	GPU monitoring and metrics	`cargo run -p trueno-gpu --example cuda_monitor`
`flash_attention_cuda`	Flash Attention implementation	`cargo run -p trueno-gpu --example flash_attention_cuda`
`simple_attention_cuda`	Basic multi-head attention	`cargo run -p trueno-gpu --example simple_attention_cuda`
`q4k_gemm`	Quantized GEMM (Q4_K format)	`cargo run -p trueno-gpu --example q4k_gemm`
`q5k_q6k_gemm`	Q5_K/Q6_K quantized GEMM (PARITY-116/117)	`cargo run -p trueno-gpu --example q5k_q6k_gemm`
`register_allocation`	PTX register allocation demo	`cargo run -p trueno-gpu --example register_allocation`
`gpu_pixels_render`	GPU pixel rendering	`cargo run -p trueno-gpu --example gpu_pixels_render`
`dump_ptx`	Dump raw PTX output	`cargo run -p trueno-gpu --example dump_ptx`
`satd_kernels`	SATD (video codec) kernels	`cargo run -p trueno-gpu --example satd_kernels`
`lz4_compression`	🗜️ LZ4 compression kernel	`cargo run -p trueno-gpu --example lz4_compression`
`lz4_file_compress`	📦 LZ4 file compression CLI	`cargo run -p trueno-gpu --example lz4_file_compress -- bench`
`ptx_optimize`	PTX optimization passes demo	`cargo run -p trueno-gpu --example ptx_optimize`
`bench_kernel_gen`	Kernel generation benchmarks	`cargo run -p trueno-gpu --example bench_kernel_gen`

Note: PTX generation examples work without a GPU. Runtime examples (cuda_monitor, flash_attention_cuda) require an NVIDIA GPU with CUDA drivers.

Python Examples (NumPy/PyTorch Comparison)

Example	Description	Command
`dot_product_comparison.py`	⚡ Dot product benchmark	`uv run examples/dot_product_comparison.py`
`matrix_multiply_comparison.py`	🔢 Matrix multiplication benchmark	`uv run examples/matrix_multiply_comparison.py`
`activation_comparison.py`	🧠 Activation functions benchmark	`uv run examples/activation_comparison.py`

🚀 Quick Start

Running Python Comparisons

Install UV (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Run any comparison example (UV handles dependencies automatically):

uv run examples/dot_product_comparison.py
uv run examples/matrix_multiply_comparison.py
uv run examples/activation_comparison.py

Alternatively, use the benchmarks environment:

cd benchmarks
uv run ../examples/dot_product_comparison.py

Running Rust Examples

cargo run --release --example performance_demo
cargo run --release --example matrix_operations

📊 What Do These Examples Show?

Dot Product Comparison (`dot_product_comparison.py`)

Key Insights:

Demonstrates compute-intensive operations where SIMD excels
Shows NumPy vs PyTorch performance characteristics
Highlights Trueno's 1.6x advantage over NumPy (11.9x over scalar)

Expected Output:

Size          NumPy (μs)      PyTorch (μs)    Winner          Speedup
--------------------------------------------------------------------------------
100               0.82 ±  0.15      1.85 ±  0.22  NumPy           2.26x
1,000             3.21 ±  0.18      6.45 ±  0.31  NumPy           2.01x
10,000           25.67 ±  1.23     58.32 ±  2.45  NumPy           2.27x

Trueno Context:

Trueno AVX-512: 11.9x faster than scalar
Trueno AVX-512: 1.6x faster than NumPy
Trueno AVX-512: 2.8x faster than PyTorch

Matrix Multiplication Comparison (`matrix_multiply_comparison.py`)

Key Insights:

Shows O(n³) complexity scaling
Demonstrates when GPU acceleration becomes effective
Highlights optimized BLAS libraries in NumPy

Expected Output:

Size       NumPy Time           PyTorch Time         Winner       Speedup
------------------------------------------------------------------------------------------
64×64      59.87 μs             125.34 μs            NumPy          2.09x
128×128    434.23 μs            678.45 μs            NumPy          1.56x
256×256    2.67 ms              3.45 ms              NumPy          1.29x
512×512    19.82 ms             25.67 ms             NumPy          1.29x

Trueno Context:

SIMD backend: ~7x faster than naive O(n³) for 128×128
GPU backend: 2-10x faster than scalar for 500×500+
Automatic backend selection based on matrix size

Activation Functions Comparison (`activation_comparison.py`)

Key Insights:

Compares common ML activation functions
Shows relative costs (ReLU << Tanh < Sigmoid < Exp)
Demonstrates SIMD benefits for transcendental functions

Expected Output:

Activation      NumPy (μs)      PyTorch (μs)    Winner       Speedup
------------------------------------------------------------------------------------------
ReLU                2.34 ±  0.12      5.67 ±  0.23  NumPy          2.42x
Sigmoid            15.67 ±  0.45     32.34 ±  1.12  NumPy          2.06x
Tanh                8.92 ±  0.34     18.45 ±  0.67  NumPy          2.07x
Exp                12.45 ±  0.56     28.91 ±  1.23  NumPy          2.32x

Trueno Context:

SIMD-optimized implementations
2-4x speedup for compute-intensive activations
Zero Python overhead for ML inference

🎯 Performance Summary

Operation	Trueno vs Scalar	Trueno vs NumPy	Trueno vs PyTorch
Dot Product	11.9x faster	1.6x faster	2.8x faster
Matrix Multiply	7x faster (128×128)	~1x (competitive)	~1.5x faster
Activations	2-4x faster	~1x (competitive)	~2x faster

💡 When to Use Trueno

✅ Ideal Use Cases:

Real-time systems requiring predictable latency
Embedded systems without Python runtime
WebAssembly deployment (browser/edge)
ML inference pipelines in Rust
Systems programming with high-performance compute needs

⚠️ When NumPy/PyTorch May Be Better:

Rapid prototyping in Python
Large ecosystem of Python ML libraries
Training large neural networks (PyTorch GPU)
Interactive data exploration (Jupyter notebooks)

📚 More Resources

Comprehensive Benchmarks: See benchmarks/README.md
Performance Analysis: See docs/performance-analysis.md
API Documentation: See docs/ directory
Project README: See root README.md

🤝 Contributing

To add new examples:

Rust examples: Add to this directory with .rs extension
Python examples: Add comparison scripts with NumPy/PyTorch
Update this README: Document the new example
Follow TDD: Ensure examples are well-tested

See CLAUDE.md for development guidelines.

Last Updated: 2026-01-25 Version: v0.14.1 Contact: GitHub Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Trueno Examples

📁 Examples Overview

Rust Examples (Trueno Native)

Benchmark Examples

CUDA/PTX Examples (trueno-gpu)

Python Examples (NumPy/PyTorch Comparison)

🚀 Quick Start

Running Python Comparisons

Running Rust Examples

📊 What Do These Examples Show?

Dot Product Comparison (`dot_product_comparison.py`)

Matrix Multiplication Comparison (`matrix_multiply_comparison.py`)

Activation Functions Comparison (`activation_comparison.py`)

🎯 Performance Summary

💡 When to Use Trueno

📚 More Resources

🤝 Contributing

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Trueno Examples

📁 Examples Overview

Rust Examples (Trueno Native)

Benchmark Examples

CUDA/PTX Examples (trueno-gpu)

Python Examples (NumPy/PyTorch Comparison)

🚀 Quick Start

Running Python Comparisons

Running Rust Examples

📊 What Do These Examples Show?

Dot Product Comparison (dot_product_comparison.py)

Matrix Multiplication Comparison (matrix_multiply_comparison.py)

Activation Functions Comparison (activation_comparison.py)

🎯 Performance Summary

💡 When to Use Trueno

📚 More Resources

🤝 Contributing

Dot Product Comparison (`dot_product_comparison.py`)

Matrix Multiplication Comparison (`matrix_multiply_comparison.py`)

Activation Functions Comparison (`activation_comparison.py`)