Skip to content

ttalati/cs217-mamba-accelerator

Repository files navigation

Mamba SSM FPGA Accelerator

End-to-end Mamba selective state-space model accelerator targeting AWS F2 FPGA (Xilinx VU47P) via Catapult HLS.

CS217 - AI Accelerator Design Project

Parameters

Parameter Value
d_model 64
d_inner 128
d_state 16
d_conv 4
dt_rank 4
Scan engines (P) 4
Chunk size (C) 16
TILE_D (GEMV parallelism) 16
Weight type Q4.12 (ac_fixed<16,4,true>)
Activation type Q8.8 (ac_fixed<16,8,true>)
Accumulator type Q16.16 (ac_fixed<32,16,true>)

Repository Structure

.
├── src/                              # SystemC source (software model + HLS target)
│   ├── include/                      # Shared specs and type definitions
│   │   ├── MambaSpec.h               # Mamba dimension constants
│   │   ├── MambaCoreSpec.h           # Core address map (kAddr* constants)
│   │   └── AxiSpec.h                 # AXI types and address widths
│   └── Top/
│       ├── Top.h                     # Top-level module (PE + GB + DataBus)
│       ├── TolerantManagerFromFile.h # AXI test driver
│       ├── testbench.cpp             # Top-level testbench
│       ├── PEPartition/
│       │   ├── PEPartition.h         # PE partition wrapper
│       │   └── PEModule/
│       │       ├── PEModule.h        # *** Main HLS synthesis target ***
│       │       │                     #     Mamba FSM, GEMV, fused decode,
│       │       │                     #     AXI receive, weight banking
│       │       ├── ActUnit/          # (Legacy starter code)
│       │       ├── PECore/           # (Legacy starter code)
│       │       └── MambaCore/        # Software golden model + shared units
│       │           ├── MambaCore.h       # Software golden model (SystemC)
│       │           ├── MambaDatapath.h   # Shared functions (adder tree, RMSNorm)
│       │           ├── GEMVUnit.h        # GEMV engine with bank macros
│       │           ├── ChunkedScanUnit.h # Parallel prefix scan (P=4, C=16)
│       │           ├── test_datapath.cpp     # Unit test: GEMV, RMSNorm
│       │           ├── test_chunked_scan.cpp # Unit test: parallel scan
│       │           └── test_mambacore.cpp    # Integration test: full block
│       ├── GBPartition/
│       │   └── GBModule/
│       │       ├── GBControl/GBControl.h # Host-PE orchestration FSM
│       │       ├── GBCore/GBCore.h       # Unified SRAM scratchpad
│       │       └── NMP/NMP.h             # AXI routing
│       └── DataBus/DataBus.h             # PE-GB communication buses
├── hls/                              # HLS build directory (Catapult projects)
│   └── Top/                          # Mirrors src/ hierarchy
├── design_top/                       # AWS F2 shell integration
│   ├── design/concat_Top.v          # Generated RTL Verilog
│   └── verif/tests/                 # HW-sim .mem files and SV testbench inputs
├── scripts/
│   ├── hls/nvhls_exec.tcl           # Catapult synthesis config
│   └── plot_*.py                    # Timing / error plotting utilities
├── tests/                            # Generated per-test artifacts (active flow)
│   └── <name>/                      # .mem files, vectors/, golden outputs, config.json
├── golden_model/test_vectors/        # Legacy single-test vectors still used by older flows
├── reports/
│   ├── hls/                          # HLS cycle reports and sim logs
│   └── aws/                          # Vivado logs, FPGA logs, shared plots/
├── results/                          # Sweep outputs and run-specific artifacts
│   └── runs/                        # Per-sweep manifests, logs, timing plots
├── generate_test.py                  # Unified test generator (golden + .mem + vectors)
├── generate_axi_csv.py              # Legacy AXI CSV generator for older flows
├── Makefile.tests                    # MambaCore unit test targets
├── build_all_hls.sh                  # Full bottom-up HLS synthesis
└── setup.csh                         # Environment setup

How to Build and Run

Environment Setup (Farmshare)

/farmshare/home/classes/ee/admin/software/bin/rhel8.sh
source setup.csh

Run Software Unit Tests

make -f Makefile.tests run_datapath       # GEMV, discretization
make -f Makefile.tests run_chunked_scan   # Parallel prefix scan
make -f Makefile.tests run_mambacore      # Full MambaCore E2E
make -f Makefile.tests run_pemodule       # Full PEModule

Requires source setup.csh first (sets $REPO_TOP for test vector path resolution).

What Runs Where

There are three different execution paths in this repo:

  • Local C++ / SystemC tests: src/**/testbench.cpp and src/**/test_*.cpp are compiled into CPU executables such as sim_test, test_mambacore, and test_pemodule. Example: make -f Makefile.tests run_pemodule builds src/Top/PEPartition/PEModule/sim_test and runs it locally.
  • AWS RTL hw-sim: cd design_top && make hw_sim runs the SystemVerilog testbench design_top/verif/tests/design_top_base_test.sv against the FPGA wrapper design_top/design/design_top.sv plus the generated RTL in the AWS shell simulation environment.
  • Real FPGA run: cd design_top && make run_fpga_test ... launches design_top/scripts/run_fpga_test.py, which builds and runs the host application in design_top/software/src/design_top.c against an FPGA image that already contains design_top/design/design_top.sv plus the generated RTL. The timing metrics used in FPGA runs come from registers implemented in design_top/design/design_top.sv and read back by the C host.

The active generated-artifact flow is generate_test.py -> tests/<name>/, which produces:

  • tests/<name>/mamba_axi_addrs.mem
  • tests/<name>/mamba_axi_data.mem
  • tests/<name>/golden_output.mem
  • tests/<name>/config.json

golden_model/test_vectors/ still exists for older local/SystemC paths, but it is no longer the main path used by run_fpga_test.py.

Test Details

  • test_datapath — GEMV banked MAC (ε=0.05); discretization bitwise match vs Python PWL golden
  • test_chunked_scan — Parallel prefix scan vs sequential recurrence (ε=0.05). Self-contained, no external files.
  • test_mambacore — Full MambaCore FSM E2E, bitwise match vs Python PWL golden
  • test_pemodule — Full PEModule with streaming I/O, bitwise match vs Python PWL golden

Generate the test vectors (requires NumPy):

bash run_sweep.sh --generate-only --pwl --regen
# Also generate intermediate .bin files needed by test_datapath:
python3 generate_test.py --name systemc_test --config-L 0 --config-Ngen 1 --use-ac-math-pwl --emit-compat-intermediates

Run HLS Synthesis

# Full bottom-up synthesis (all modules, ~12+ hours)
./build_all_hls.sh

# PEModule only (~8 hours)
cd hls/Top/PEPartition/PEModule && rm -rf Catapult && make hls

Copy RTL for AWS Build

cd design_top && make copy_rtl

HLS Module Hierarchy

Top
├── PEPartition
│   └── PEModule          ← Core Mamba datapath
│       ├── 16 weight BRAM banks
│       ├── Fused decode engine
│       ├── Chunked parallel scan (4 engines)
│       └── GEMV engine (16 parallel MAC lanes)
├── GBPartition
│   └── GBModule
│       ├── GBControl     ← Host-PE orchestration FSM
│       ├── GBCore        ← Unified SRAM scratchpad
│       └── NMP           ← AXI routing
└── DataBus               ← PE-GB communication channels

AWS Clock Constraint Override

To give the design more timing margin, the Vivado timing constraint is overridden in the external HDK tree:

  • File: ~/aws-fpga/hdk/common/shell_stable/build/scripts/aws_gen_clk_constraints.tcl and /home/ubuntu/aws-fpga/hdk/common/shell_stable/build/scripts/aws_clock_properties.tcl
  • Change: All recipes set clk_main_a0_period = 5.208 (192 MHz target)

Build and Program FPGA

Build the FPGA bitstream, generate the AFI, and program the FPGA:

# SSH into your AWS F2 instance first, then source the AWS FPGA tooling
cd ~/aws-fpga
source hdk_setup.sh
source sdk_setup.sh

# Move into this repo's FPGA design directory and source the local setup
cd design_top
source setup.sh

make fpga_build          # should take about 2 hours
make generate_afi

# Wait for AFI to become available
make check_afi_available

# Once available
make program_fpga
make run_fpga_test                         # defaults to tests/systemc_ngen4
make run_fpga_test FPGA_TEST=prefill16_ngen4
make run_fpga_test TEST_DIR=../tests/prefill128_ngen16

# From the repo root, run an automated FPGA sweep over (L, Ngen) configs
cd ..
bash run_sweep.sh --mode=full-grid

Pass --pwl to run_sweep.sh if you want the generated golden data to use the hardware-like ac_math piecewise-linear approximations and truncation behavior instead of the default exact NumPy math. This writes separate tests/<name>_pwl/ vectors and is the better choice when you want generated tests that more closely match the current RTL/FPGA implementation. However, since the operatons are done in a scalar way, the test generation is much slower compared to the NumPy path.

CPU Baseline And FPGA Comparison

For a fair CPU baseline, use the stripped-down floating-point NumPy benchmark path in generate_test.py. This skips artifact generation overhead and reports the average over timed runs after warmup:

python3 generate_test.py \
  --name bench_np_float_L128_N128 \
  --config-L 128 \
  --config-Ngen 128 \
  --benchmark-np \
  --benchmark-runs 15 \
  --benchmark-warmup 5

For the matching FPGA run, use the generated test directory and inspect the timing metrics printed by the host log:

python3 design_top/scripts/run_fpga_test.py \
  --test-dir tests/grid_L128_N128_s42 \
  --log-file reports/aws/grid_L128_N128.log \
  --clock-mhz 200

Results

To generate the GB start-to-done timing heatmap, decode-scaling plot, and prefill-scaling plot from a full-grid sweep run, use the following with the latest sweep run path:

python3 scripts/plot_full_grid_timing.py \
  --run-dir results/runs/20260315_043655_full-grid_exact_s42_r1 \
  --output-prefix full_grid_timing

To plot decode error growth (mean and max absolute error vs. decode token) from the PEModule test log:

python3 scripts/plot_pemodule_decode_error_growth.py \
  --log reports/hls/pemodule_stream_test_updated.log

About

mamba accelerator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors