This guide covers RHDL's execution backends, their performance characteristics, and how to benchmark your designs.
RHDL provides multiple simulation backends with different performance/flexibility tradeoffs:
| Backend | Type | Speed | Startup | Best For |
|---|---|---|---|---|
| Ruby Behavioral | Interpreted | Baseline | Immediate | Development, debugging |
| IR Interpreter | Native (Rust) | ~60K cycles/s | Immediate | Interactive debugging |
| IR JIT | Native (Cranelift) | ~200-600K cycles/s | 0.05-0.5s | Moderate simulations |
| IR Compiler | Native (AOT) | ~1-2M cycles/s | 5-8s | Long simulations |
| Verilator | External (C++) | ~5-6M cycles/s | 10-30s | Maximum performance |
| CIRCT/MLIR (Arcilator) | External (LLVM/CIRCT) | Workload-dependent | Toolchain-dependent | Native RTL parity and benchmarking |
RHDL includes rake tasks for benchmarking different system configurations:
Benchmarks the MOS 6502 CPU running Karateka game code with memory bridging:
rake bench[mos6502] # Default: 5M cycles
rake bench[mos6502,1000000] # Custom: 1M cyclesSample Results (1M cycles):
| Backend | Init Time | Run Time | Rate | Speedup |
|---|---|---|---|---|
| Interpreter | - | - | - | (skipped >100K) |
| JIT | 0.06s | 4.32s | 0.23M/s | 1.0x |
| Compiler | 7.88s | 0.63s | 1.58M/s | 6.8x |
| Verilator | ~15s | ~0.18s | ~5.6M/s | ~24x |
Benchmarks the complete Apple II system (CPU + memory + I/O) running Karateka:
rake bench[apple2] # Default: 5M cycles
rake bench[apple2,1000000] # Custom: 1M cyclesSample Results (1M cycles):
| Backend | Init Time | Run Time | Rate | Speedup |
|---|---|---|---|---|
| Interpreter | - | - | - | (skipped >100K) |
| JIT | 0.05s | 17.38s | 0.06M/s | 1.0x |
| Compiler | 6.46s | 3.62s | 0.28M/s | 4.8x |
| Verilator | ~20s | ~0.18s | ~5.6M/s | ~97x |
Benchmarks the GameBoy running Prince of Persia ROM for a specified number of frames:
rake bench[gameboy] # Default: 1000 frames
rake bench[gameboy,100] # Custom: 100 framesSample Results (100 frames / 7M cycles):
| Backend | Init Time | Run Time | Speed | % Real-time |
|---|---|---|---|---|
| IR Compiler | 5.63s | 5.51s | 1.27 MHz | 30.4% |
| Verilator | ~15s | ~1.2s | ~5.8 MHz | ~138% |
The GameBoy runs at 4.19 MHz, so backends achieving >100% can run faster than real hardware.
Benchmarks low-level gate simulation:
rake bench:native[gates] # Gate-level toggle benchmark
rake bench:native[cpu8bit] # 8-bit CPU FastHarness benchmarkBenchmarks browser-style WASM execution paths for compiler vs arcilator vs verilator backends:
rake bench:web[apple2] # Apple II web WASM benchmark
rake bench:web[apple2,5000000] # Apple II custom cycle count
rake bench:web[riscv] # RISC-V xv6 web WASM benchmark
rake bench:web[riscv,100000] # RISC-V custom cycle countRuby Behavioral
- Developing new components
- Interactive debugging with detailed error messages
- Small test cases where performance isn't critical
- Validating correctness before synthesis
IR Interpreter
- Interactive CPU debugging with breakpoints
- Small simulation runs (<100K cycles)
- When you need visibility into instruction execution
IR JIT (Cranelift)
- Moderate length simulations (100K - 10M cycles)
- Good balance between startup time and sustained speed
- Running games interactively
IR Compiler (AOT)
- Long simulations (>1M cycles)
- Full system simulation
- When compilation time is acceptable for faster execution
- Batch testing
Verilator
- Maximum performance requirements
- Reference validation against RTL
- Running complex games at real-time or faster
- When you need cycle-accurate RTL simulation
CIRCT/MLIR (Arcilator)
- Native RTL simulation using FIRRTL/MLIR lowering
- Cross-validating behavior against Verilator and IR compiler backends
- Useful when your flow is already based on CIRCT tools (
firtool,arcilator) - Similar startup/runtime tradeoffs to other external RTL backends
The backends have an inverse relationship between startup time and runtime performance:
Startup Time
Fast ←───────────────────────────────────→ Slow
│ │
│ Interpreter JIT Compiler Verilator/CIRCT │
│ ↓ ↓ ↓ ↓ │
│ Slow Medium Fast Fastest │
│ │
Slow ←───────────────────────────────────→ Fast
Runtime Speed
Rule of thumb:
- For <100K cycles: Use Interpreter or JIT
- For 100K-1M cycles: Use JIT
- For 1M-10M cycles: Use Compiler
- For >10M cycles: Use Verilator or CIRCT/MLIR (if available)
Native backends require the Rust toolchain. Build all extensions with:
rake native:build # Build all Rust extensions
rake native:check # Verify availability
rake native:clean # Clean build artifactsThe build process compiles:
- ISA Simulator Native (MOS 6502)
- Netlist Interpreter (gate-level)
- Netlist JIT (gate-level Cranelift)
- Netlist Compiler (gate-level SIMD)
- IR Interpreter
- IR JIT (Cranelift)
- IR Compiler (AOT)
Verilator provides the fastest simulation by compiling Verilog to optimized C++:
# Ubuntu/Debian
sudo apt-get install verilator
# macOS
brew install verilator
# Verify installation
verilator --versionWhen Verilator is available, it's automatically included in benchmark comparisons.
CIRCT/MLIR flows use firtool and arcilator:
# Verify installation
firtool --version
arcilator --versionAvailability and packaging vary by platform; see CIRCT release/install docs for your environment.
- Init Time: Time to initialize the backend (includes JIT/AOT compilation)
- Run Time: Time to execute the specified cycles
- Rate: Cycles per second (higher is better)
- Speed: For GameBoy, percentage of real hardware speed (4.19 MHz)
- Circuit Complexity: More gates/signals = slower simulation
- Memory Access Patterns: Frequent memory operations add overhead
- Backend Optimization Level: Compiler > JIT > Interpreter
- Hardware: CPU speed, cache size, SIMD support (AVX2/AVX512)
| System | IR JIT | IR Compiler | Verilator |
|---|---|---|---|
| MOS 6502 (CPU only) | 200-300K/s | 1.5-2M/s | 5-6M/s |
| Apple II (full system) | 50-100K/s | 250-400K/s | 5-6M/s |
| GameBoy (full system) | 400-600K/s | 1-1.5M/s | 5-6M/s |
- Compare backends: Large JIT-to-Compiler speedup suggests bytecode overhead
- Check init time: Long init suggests complex IR generation
- Monitor memory: High memory usage may indicate inefficient signal storage
- Reduce hierarchy depth: Flatter designs simulate faster
- Minimize wire fan-out: High fan-out increases update propagation
- Use registers wisely: Excessive registers add clock overhead
- Batch operations: Process multiple test vectors with SIMD backends
- Simulation - Detailed backend documentation
- Gate-Level Backend - Gate-level synthesis and simulation
- CLI Reference - Command-line tools