End-to-end Mamba selective state-space model accelerator targeting AWS F2 FPGA (Xilinx VU47P) via Catapult HLS.
CS217 - AI Accelerator Design Project
| Parameter | Value |
|---|---|
| d_model | 64 |
| d_inner | 128 |
| d_state | 16 |
| d_conv | 4 |
| dt_rank | 4 |
| Scan engines (P) | 4 |
| Chunk size (C) | 16 |
| TILE_D (GEMV parallelism) | 16 |
| Weight type | Q4.12 (ac_fixed<16,4,true>) |
| Activation type | Q8.8 (ac_fixed<16,8,true>) |
| Accumulator type | Q16.16 (ac_fixed<32,16,true>) |
.
├── src/ # SystemC source (software model + HLS target)
│ ├── include/ # Shared specs and type definitions
│ │ ├── MambaSpec.h # Mamba dimension constants
│ │ ├── MambaCoreSpec.h # Core address map (kAddr* constants)
│ │ └── AxiSpec.h # AXI types and address widths
│ └── Top/
│ ├── Top.h # Top-level module (PE + GB + DataBus)
│ ├── TolerantManagerFromFile.h # AXI test driver
│ ├── testbench.cpp # Top-level testbench
│ ├── PEPartition/
│ │ ├── PEPartition.h # PE partition wrapper
│ │ └── PEModule/
│ │ ├── PEModule.h # *** Main HLS synthesis target ***
│ │ │ # Mamba FSM, GEMV, fused decode,
│ │ │ # AXI receive, weight banking
│ │ ├── ActUnit/ # (Legacy starter code)
│ │ ├── PECore/ # (Legacy starter code)
│ │ └── MambaCore/ # Software golden model + shared units
│ │ ├── MambaCore.h # Software golden model (SystemC)
│ │ ├── MambaDatapath.h # Shared functions (adder tree, RMSNorm)
│ │ ├── GEMVUnit.h # GEMV engine with bank macros
│ │ ├── ChunkedScanUnit.h # Parallel prefix scan (P=4, C=16)
│ │ ├── test_datapath.cpp # Unit test: GEMV, RMSNorm
│ │ ├── test_chunked_scan.cpp # Unit test: parallel scan
│ │ └── test_mambacore.cpp # Integration test: full block
│ ├── GBPartition/
│ │ └── GBModule/
│ │ ├── GBControl/GBControl.h # Host-PE orchestration FSM
│ │ ├── GBCore/GBCore.h # Unified SRAM scratchpad
│ │ └── NMP/NMP.h # AXI routing
│ └── DataBus/DataBus.h # PE-GB communication buses
├── hls/ # HLS build directory (Catapult projects)
│ └── Top/ # Mirrors src/ hierarchy
├── design_top/ # AWS F2 shell integration
│ ├── design/concat_Top.v # Generated RTL Verilog
│ └── verif/tests/ # HW-sim .mem files and SV testbench inputs
├── scripts/
│ ├── hls/nvhls_exec.tcl # Catapult synthesis config
│ └── plot_*.py # Timing / error plotting utilities
├── tests/ # Generated per-test artifacts (active flow)
│ └── <name>/ # .mem files, vectors/, golden outputs, config.json
├── golden_model/test_vectors/ # Legacy single-test vectors still used by older flows
├── reports/
│ ├── hls/ # HLS cycle reports and sim logs
│ └── aws/ # Vivado logs, FPGA logs, shared plots/
├── results/ # Sweep outputs and run-specific artifacts
│ └── runs/ # Per-sweep manifests, logs, timing plots
├── generate_test.py # Unified test generator (golden + .mem + vectors)
├── generate_axi_csv.py # Legacy AXI CSV generator for older flows
├── Makefile.tests # MambaCore unit test targets
├── build_all_hls.sh # Full bottom-up HLS synthesis
└── setup.csh # Environment setup
/farmshare/home/classes/ee/admin/software/bin/rhel8.sh
source setup.cshmake -f Makefile.tests run_datapath # GEMV, discretization
make -f Makefile.tests run_chunked_scan # Parallel prefix scan
make -f Makefile.tests run_mambacore # Full MambaCore E2E
make -f Makefile.tests run_pemodule # Full PEModuleRequires source setup.csh first (sets $REPO_TOP for test vector path resolution).
There are three different execution paths in this repo:
- Local C++ / SystemC tests:
src/**/testbench.cppandsrc/**/test_*.cppare compiled into CPU executables such assim_test,test_mambacore, andtest_pemodule. Example:make -f Makefile.tests run_pemodulebuildssrc/Top/PEPartition/PEModule/sim_testand runs it locally. - AWS RTL hw-sim:
cd design_top && make hw_simruns the SystemVerilog testbenchdesign_top/verif/tests/design_top_base_test.svagainst the FPGA wrapperdesign_top/design/design_top.svplus the generated RTL in the AWS shell simulation environment. - Real FPGA run:
cd design_top && make run_fpga_test ...launchesdesign_top/scripts/run_fpga_test.py, which builds and runs the host application indesign_top/software/src/design_top.cagainst an FPGA image that already containsdesign_top/design/design_top.svplus the generated RTL. The timing metrics used in FPGA runs come from registers implemented indesign_top/design/design_top.svand read back by the C host.
The active generated-artifact flow is generate_test.py -> tests/<name>/, which produces:
tests/<name>/mamba_axi_addrs.memtests/<name>/mamba_axi_data.memtests/<name>/golden_output.memtests/<name>/config.json
golden_model/test_vectors/ still exists for older local/SystemC paths, but it is no longer the main path used by run_fpga_test.py.
test_datapath— GEMV banked MAC (ε=0.05); discretization bitwise match vs Python PWL goldentest_chunked_scan— Parallel prefix scan vs sequential recurrence (ε=0.05). Self-contained, no external files.test_mambacore— Full MambaCore FSM E2E, bitwise match vs Python PWL goldentest_pemodule— Full PEModule with streaming I/O, bitwise match vs Python PWL golden
Generate the test vectors (requires NumPy):
bash run_sweep.sh --generate-only --pwl --regen
# Also generate intermediate .bin files needed by test_datapath:
python3 generate_test.py --name systemc_test --config-L 0 --config-Ngen 1 --use-ac-math-pwl --emit-compat-intermediates# Full bottom-up synthesis (all modules, ~12+ hours)
./build_all_hls.sh
# PEModule only (~8 hours)
cd hls/Top/PEPartition/PEModule && rm -rf Catapult && make hlscd design_top && make copy_rtlTop
├── PEPartition
│ └── PEModule ← Core Mamba datapath
│ ├── 16 weight BRAM banks
│ ├── Fused decode engine
│ ├── Chunked parallel scan (4 engines)
│ └── GEMV engine (16 parallel MAC lanes)
├── GBPartition
│ └── GBModule
│ ├── GBControl ← Host-PE orchestration FSM
│ ├── GBCore ← Unified SRAM scratchpad
│ └── NMP ← AXI routing
└── DataBus ← PE-GB communication channels
To give the design more timing margin, the Vivado timing constraint is overridden in the external HDK tree:
- File:
~/aws-fpga/hdk/common/shell_stable/build/scripts/aws_gen_clk_constraints.tcland/home/ubuntu/aws-fpga/hdk/common/shell_stable/build/scripts/aws_clock_properties.tcl - Change: All recipes set
clk_main_a0_period = 5.208(192 MHz target)
Build the FPGA bitstream, generate the AFI, and program the FPGA:
# SSH into your AWS F2 instance first, then source the AWS FPGA tooling
cd ~/aws-fpga
source hdk_setup.sh
source sdk_setup.sh
# Move into this repo's FPGA design directory and source the local setup
cd design_top
source setup.sh
make fpga_build # should take about 2 hours
make generate_afi
# Wait for AFI to become available
make check_afi_available
# Once available
make program_fpga
make run_fpga_test # defaults to tests/systemc_ngen4
make run_fpga_test FPGA_TEST=prefill16_ngen4
make run_fpga_test TEST_DIR=../tests/prefill128_ngen16
# From the repo root, run an automated FPGA sweep over (L, Ngen) configs
cd ..
bash run_sweep.sh --mode=full-gridPass --pwl to run_sweep.sh if you want the generated golden data to use the hardware-like ac_math piecewise-linear approximations and truncation behavior instead of the default exact NumPy math. This writes separate tests/<name>_pwl/ vectors and is the better choice when you want generated tests that more closely match the current RTL/FPGA implementation. However, since the operatons are done in a scalar way, the test generation is much slower compared to the NumPy path.
For a fair CPU baseline, use the stripped-down floating-point NumPy benchmark path in generate_test.py. This skips artifact generation overhead and reports the average over timed runs after warmup:
python3 generate_test.py \
--name bench_np_float_L128_N128 \
--config-L 128 \
--config-Ngen 128 \
--benchmark-np \
--benchmark-runs 15 \
--benchmark-warmup 5For the matching FPGA run, use the generated test directory and inspect the timing metrics printed by the host log:
python3 design_top/scripts/run_fpga_test.py \
--test-dir tests/grid_L128_N128_s42 \
--log-file reports/aws/grid_L128_N128.log \
--clock-mhz 200To generate the GB start-to-done timing heatmap, decode-scaling plot, and prefill-scaling plot from a full-grid sweep run, use the following with the latest sweep run path:
python3 scripts/plot_full_grid_timing.py \
--run-dir results/runs/20260315_043655_full-grid_exact_s42_r1 \
--output-prefix full_grid_timingTo plot decode error growth (mean and max absolute error vs. decode token) from the PEModule test log:
python3 scripts/plot_pemodule_decode_error_growth.py \
--log reports/hls/pemodule_stream_test_updated.log