Mamba SSM FPGA Accelerator

End-to-end Mamba selective state-space model accelerator targeting AWS F2 FPGA (Xilinx VU47P) via Catapult HLS.

CS217 - AI Accelerator Design Project

Parameters

Parameter	Value
d_model	64
d_inner	128
d_state	16
d_conv	4
dt_rank	4
Scan engines (P)	4
Chunk size (C)	16
TILE_D (GEMV parallelism)	16
Weight type	Q4.12 (ac_fixed<16,4,true>)
Activation type	Q8.8 (ac_fixed<16,8,true>)
Accumulator type	Q16.16 (ac_fixed<32,16,true>)

Repository Structure

.
├── src/                              # SystemC source (software model + HLS target)
│   ├── include/                      # Shared specs and type definitions
│   │   ├── MambaSpec.h               # Mamba dimension constants
│   │   ├── MambaCoreSpec.h           # Core address map (kAddr* constants)
│   │   └── AxiSpec.h                 # AXI types and address widths
│   └── Top/
│       ├── Top.h                     # Top-level module (PE + GB + DataBus)
│       ├── TolerantManagerFromFile.h # AXI test driver
│       ├── testbench.cpp             # Top-level testbench
│       ├── PEPartition/
│       │   ├── PEPartition.h         # PE partition wrapper
│       │   └── PEModule/
│       │       ├── PEModule.h        # *** Main HLS synthesis target ***
│       │       │                     #     Mamba FSM, GEMV, fused decode,
│       │       │                     #     AXI receive, weight banking
│       │       ├── ActUnit/          # (Legacy starter code)
│       │       ├── PECore/           # (Legacy starter code)
│       │       └── MambaCore/        # Software golden model + shared units
│       │           ├── MambaCore.h       # Software golden model (SystemC)
│       │           ├── MambaDatapath.h   # Shared functions (adder tree, RMSNorm)
│       │           ├── GEMVUnit.h        # GEMV engine with bank macros
│       │           ├── ChunkedScanUnit.h # Parallel prefix scan (P=4, C=16)
│       │           ├── test_datapath.cpp     # Unit test: GEMV, RMSNorm
│       │           ├── test_chunked_scan.cpp # Unit test: parallel scan
│       │           └── test_mambacore.cpp    # Integration test: full block
│       ├── GBPartition/
│       │   └── GBModule/
│       │       ├── GBControl/GBControl.h # Host-PE orchestration FSM
│       │       ├── GBCore/GBCore.h       # Unified SRAM scratchpad
│       │       └── NMP/NMP.h             # AXI routing
│       └── DataBus/DataBus.h             # PE-GB communication buses
├── hls/                              # HLS build directory (Catapult projects)
│   └── Top/                          # Mirrors src/ hierarchy
├── design_top/                       # AWS F2 shell integration
│   ├── design/concat_Top.v          # Generated RTL Verilog
│   └── verif/tests/                 # HW-sim .mem files and SV testbench inputs
├── scripts/
│   ├── hls/nvhls_exec.tcl           # Catapult synthesis config
│   └── plot_*.py                    # Timing / error plotting utilities
├── tests/                            # Generated per-test artifacts (active flow)
│   └── <name>/                      # .mem files, vectors/, golden outputs, config.json
├── golden_model/test_vectors/        # Legacy single-test vectors still used by older flows
├── reports/
│   ├── hls/                          # HLS cycle reports and sim logs
│   └── aws/                          # Vivado logs, FPGA logs, shared plots/
├── results/                          # Sweep outputs and run-specific artifacts
│   └── runs/                        # Per-sweep manifests, logs, timing plots
├── generate_test.py                  # Unified test generator (golden + .mem + vectors)
├── generate_axi_csv.py              # Legacy AXI CSV generator for older flows
├── Makefile.tests                    # MambaCore unit test targets
├── build_all_hls.sh                  # Full bottom-up HLS synthesis
└── setup.csh                         # Environment setup

How to Build and Run

Environment Setup (Farmshare)

/farmshare/home/classes/ee/admin/software/bin/rhel8.sh
source setup.csh

Run Software Unit Tests

make -f Makefile.tests run_datapath       # GEMV, discretization
make -f Makefile.tests run_chunked_scan   # Parallel prefix scan
make -f Makefile.tests run_mambacore      # Full MambaCore E2E
make -f Makefile.tests run_pemodule       # Full PEModule

Requires source setup.csh first (sets $REPO_TOP for test vector path resolution).

What Runs Where

There are three different execution paths in this repo:

Local C++ / SystemC tests: src/**/testbench.cpp and src/**/test_*.cpp are compiled into CPU executables such as sim_test, test_mambacore, and test_pemodule. Example: make -f Makefile.tests run_pemodule builds src/Top/PEPartition/PEModule/sim_test and runs it locally.
AWS RTL hw-sim: cd design_top && make hw_sim runs the SystemVerilog testbench design_top/verif/tests/design_top_base_test.sv against the FPGA wrapper design_top/design/design_top.sv plus the generated RTL in the AWS shell simulation environment.
Real FPGA run: cd design_top && make run_fpga_test ... launches design_top/scripts/run_fpga_test.py, which builds and runs the host application in design_top/software/src/design_top.c against an FPGA image that already contains design_top/design/design_top.sv plus the generated RTL. The timing metrics used in FPGA runs come from registers implemented in design_top/design/design_top.sv and read back by the C host.

The active generated-artifact flow is generate_test.py -> tests/<name>/, which produces:

tests/<name>/mamba_axi_addrs.mem
tests/<name>/mamba_axi_data.mem
tests/<name>/golden_output.mem
tests/<name>/config.json

golden_model/test_vectors/ still exists for older local/SystemC paths, but it is no longer the main path used by run_fpga_test.py.

Test Details

test_datapath — GEMV banked MAC (ε=0.05); discretization bitwise match vs Python PWL golden
test_chunked_scan — Parallel prefix scan vs sequential recurrence (ε=0.05). Self-contained, no external files.
test_mambacore — Full MambaCore FSM E2E, bitwise match vs Python PWL golden
test_pemodule — Full PEModule with streaming I/O, bitwise match vs Python PWL golden

Generate the test vectors (requires NumPy):

bash run_sweep.sh --generate-only --pwl --regen
# Also generate intermediate .bin files needed by test_datapath:
python3 generate_test.py --name systemc_test --config-L 0 --config-Ngen 1 --use-ac-math-pwl --emit-compat-intermediates

Run HLS Synthesis

# Full bottom-up synthesis (all modules, ~12+ hours)
./build_all_hls.sh

# PEModule only (~8 hours)
cd hls/Top/PEPartition/PEModule && rm -rf Catapult && make hls

Copy RTL for AWS Build

cd design_top && make copy_rtl

HLS Module Hierarchy

Top
├── PEPartition
│   └── PEModule          ← Core Mamba datapath
│       ├── 16 weight BRAM banks
│       ├── Fused decode engine
│       ├── Chunked parallel scan (4 engines)
│       └── GEMV engine (16 parallel MAC lanes)
├── GBPartition
│   └── GBModule
│       ├── GBControl     ← Host-PE orchestration FSM
│       ├── GBCore        ← Unified SRAM scratchpad
│       └── NMP           ← AXI routing
└── DataBus               ← PE-GB communication channels

AWS Clock Constraint Override

To give the design more timing margin, the Vivado timing constraint is overridden in the external HDK tree:

File: ~/aws-fpga/hdk/common/shell_stable/build/scripts/aws_gen_clk_constraints.tcl and /home/ubuntu/aws-fpga/hdk/common/shell_stable/build/scripts/aws_clock_properties.tcl
Change: All recipes set clk_main_a0_period = 5.208 (192 MHz target)

Build and Program FPGA

Build the FPGA bitstream, generate the AFI, and program the FPGA:

# SSH into your AWS F2 instance first, then source the AWS FPGA tooling
cd ~/aws-fpga
source hdk_setup.sh
source sdk_setup.sh

# Move into this repo's FPGA design directory and source the local setup
cd design_top
source setup.sh

make fpga_build          # should take about 2 hours
make generate_afi

# Wait for AFI to become available
make check_afi_available

# Once available
make program_fpga
make run_fpga_test                         # defaults to tests/systemc_ngen4
make run_fpga_test FPGA_TEST=prefill16_ngen4
make run_fpga_test TEST_DIR=../tests/prefill128_ngen16

# From the repo root, run an automated FPGA sweep over (L, Ngen) configs
cd ..
bash run_sweep.sh --mode=full-grid

Pass --pwl to run_sweep.sh if you want the generated golden data to use the hardware-like ac_math piecewise-linear approximations and truncation behavior instead of the default exact NumPy math. This writes separate tests/<name>_pwl/ vectors and is the better choice when you want generated tests that more closely match the current RTL/FPGA implementation. However, since the operatons are done in a scalar way, the test generation is much slower compared to the NumPy path.

CPU Baseline And FPGA Comparison

For a fair CPU baseline, use the stripped-down floating-point NumPy benchmark path in generate_test.py. This skips artifact generation overhead and reports the average over timed runs after warmup:

python3 generate_test.py \
  --name bench_np_float_L128_N128 \
  --config-L 128 \
  --config-Ngen 128 \
  --benchmark-np \
  --benchmark-runs 15 \
  --benchmark-warmup 5

For the matching FPGA run, use the generated test directory and inspect the timing metrics printed by the host log:

python3 design_top/scripts/run_fpga_test.py \
  --test-dir tests/grid_L128_N128_s42 \
  --log-file reports/aws/grid_L128_N128.log \
  --clock-mhz 200

Results

To generate the GB start-to-done timing heatmap, decode-scaling plot, and prefill-scaling plot from a full-grid sweep run, use the following with the latest sweep run path:

python3 scripts/plot_full_grid_timing.py \
  --run-dir results/runs/20260315_043655_full-grid_exact_s42_r1 \
  --output-prefix full_grid_timing

To plot decode error growth (mean and max absolute error vs. decode token) from the PEModule test log:

python3 scripts/plot_pemodule_decode_error_growth.py \
  --log reports/hls/pemodule_stream_test_updated.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mamba SSM FPGA Accelerator

Parameters

Repository Structure

How to Build and Run

Environment Setup (Farmshare)

Run Software Unit Tests

What Runs Where

Test Details

Run HLS Synthesis

Copy RTL for AWS Build

HLS Module Hierarchy

AWS Clock Constraint Override

Build and Program FPGA

CPU Baseline And FPGA Comparison

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
design_top		design_top
golden_model/test_vectors		golden_model/test_vectors
hls/Top		hls/Top
reports		reports
results/runs/20260310_062133_full-grid_pwl_s42_r1		results/runs/20260310_062133_full-grid_pwl_s42_r1
scripts		scripts
src		src
.gitignore		.gitignore
Makefile		Makefile
Makefile.tests		Makefile.tests
README.md		README.md
ac_math_pwl.py		ac_math_pwl.py
build_all.log		build_all.log
build_all_hls.sh		build_all_hls.sh
build_all_hls_noclean.sh		build_all_hls_noclean.sh
generate_axi_csv.py		generate_axi_csv.py
generate_test.py		generate_test.py
run_sweep.sh		run_sweep.sh
setup.csh		setup.csh

Folders and files

Latest commit

History

Repository files navigation

Mamba SSM FPGA Accelerator

Parameters

Repository Structure

How to Build and Run

Environment Setup (Farmshare)

Run Software Unit Tests

What Runs Where

Test Details

Run HLS Synthesis

Copy RTL for AWS Build

HLS Module Hierarchy

AWS Clock Constraint Override

Build and Program FPGA

CPU Baseline And FPGA Comparison

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages