Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

DRAFT STATUS

This PR will remain in Draft until the items in the discussion section are resolved.

Description

This PR is a draft implementation of the Structured Statespace Duality described in the original mamba2 paper which reframes the SSM_SCAN op as a pseudo-attention operation. The paper describes it in great detail, but the short version is that when performing a multi-token scan, the recurrent formulation of SSM_SCAN is inefficient because it cannot parallelize over the sequence dimension the way an attention calculation can. With the SSD formulation, the logical attention matrix is decomposed into chunks and the state is updated at the chunk boundaries, allowing prefill to "jump" by the size of the chunk rather than proceed with tokens one-at-a-time.

Reference Links

Changes

  • Introduce new primitive operations in ggml:

    • ggml_cumsum / ggml_cumsum_0: Perform a cumulative sum along a give dimension
    • ggml_tri_dims / ggml_tri / ggml_tri_keep: Apply a triangular mask to the given matrix
    • ggml_softplus: Perform the unary softplus operation
  • Implement an alternate path through llm_graph_context_mamba::build_mamba2_layer when a multi-token update is detected

    • This path is the core of the SSD implementation and avoids calling SSM_SCAN in favor of the chunked pseudo-attention formulation

Discussion

There are a number of outstanding discussion points on this work that need to be resolved before moving it forward:

  1. Performance: Currently, this implementation appears to be significantly slower than simply using SSM_SCAN which roundly defeats the purpose of the change! I suspect that the performance issues are due to the number of ggml_permute / ggml_cont ops that are added to the graph, but could use assistance figuring out how to eliminate them or identifying other sources of slowness.
  2. To chunk or not to chunk: In this PR I have sub-ubatch chunking implemented. I had it mostly working before the corresponding discussion on Qwen3Next. The inter-chunk update would be needed anyway, so I didn't strip it out, but it would be fairly trivial to do so and might offer some performance improvements.
  3. Handling of repeat_interleave: Similar to the issue that came up when initially implementing NemotronH support, I believe that ggml_repeat behaves differently than mx.repeat, resulting in incorrect results for models with n_groups > 1 (tested with NemotronH).

Testing

I've tested this locally with various members of the Granite 4 family and with nvidia/NVIDIA-Nemotron-Nano-9B-v2. For the Granite 4 models with n_groups == 1, I get nearly identical results to running with purely SSM_SCAN, but NemotronH still struggles due to repeat_interleave issues (see above). I'll flesh out more testing results once we've worked through some of the above issues.

cc @compilade since I know this has been on your TODO list since the original mamba2 implementation.

ggerganov and others added 30 commits October 9, 2025 19:40
* gg/metal-mul-mat-fixes:
metal : fix mul-mm condition + fix mul-mv permuted kernels
Cherry-picked and edited from 7ec2df6

The original commit contained the DELTA_NET op as well which I've removed
in this cherry-picked version.

Co-Authored-By: Piotr Wilkin <[email protected]>

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
This should be using simd operations for better parallelism, but that will
come next.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2Perf

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master:
Add server-driven parameter defaults and syncing (ggml-org#16515)
metal: optimise `GGML_OP_SUM` (ggml-org#16559)
server : fix img token logs (ggml-org#16595)
llama-quant: add support for mmproj (ggml-org#16592)
CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585)
server : fix mtmd checkpoints (ggml-org#16591)
metal : avoid using Metal's gpuAddress property (ggml-org#16576)
vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)
vulkan: Support FA with K/V in F32 (ggml-org#16543)
vulkan: Improve build time for MSVC (ggml-org#16545)
CUDA: enable FA for FP32 KV cache (ggml-org#16546)
CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)
CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)
cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)
server : dynamic token limit for prompt cache (ggml-org#16560)
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2Perf

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
One more backwards mulmat

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
…elements

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
…tests)

This will help avoid ggml_permute and ggml_cont requirements in the SSD
impl

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
…ontiguous

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
…d non-cont

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
We shouldn't need this once cumsum can operate on other dims and we can
avoid all the various permutes elsewhere.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
With tests

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
… state

We will probably remove the chunking loop in favor of just using the
microbatching, but we'll still need this in that case for subsequent
microbatches.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
The code now runs cleanly for parallel requests

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Something is definitely still broken for nemotron-h which may be the g > 1
aspect of the model

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: (169 commits)
opencl: support imrope (ggml-org#16914)
fix: Viewing multiple PDF attachments (ggml-org#16974)
model-conversion : pass config to from_pretrained (ggml-org#16963)
server : add props.model_alias (ggml-org#16943)
ggml: CUDA: add head size 72 for flash-attn (ggml-org#16962)
mtmd: add --image-min/max-tokens (ggml-org#16921)
mtmd: pad mask for qwen2.5vl (ggml-org#16954)
ggml : LoongArch fixes (ggml-org#16958)
sync: minja (glm 4.6 & minmax m2 templates) (ggml-org#16949)
SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (ggml-org#16869)
feat(webui): improve LaTeX rendering with currency detection (ggml-org#16508)
test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (ggml-org#16936)
ci : disable failing riscv cross build (ggml-org#16952)
model: add Janus Pro for image understanding (ggml-org#16906)
clip : use FA (ggml-org#16837)
server : support unified cache across slots (ggml-org#16736)
common : move gpt-oss reasoning processing to init params (ggml-org#16937)
docs: remove llama_sampler_accept reference in sampling sample usage (ggml-org#16920)
CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (ggml-org#16917)
devops: fix failing s390x docker build (ggml-org#16918)
...
@gabe-l-hart gabe-l-hart requested a review from compilade November 3, 2025 22:40
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 3, 2025
@pwilkin
Copy link
Collaborator

pwilkin commented Nov 3, 2025

Yeah, I had an issue with repeat_interleave too. Technically, repeat_interleave is equivalent to permute + repeat, but of course it introduces additional operations.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 3, 2025

Regarding the chunking: won't this explode the graph a lot?

In case of Delta Net attention, since you have to use triangular solve there, you don't want the chunk size over 64 or performance drops drastically. But that means that you're going to go up to 8 chunks for a typical ubatch size of 512.

The graph for Qwen3 Next already has 9000 nodes. I'm a bit afraid of doing chunking this way (and I know @ggerganov had strong objections too).

@gabe-l-hart
Copy link
Collaborator Author

chunking: won't this explode the graph a lot?

Yep, it sure will. I also suspect this as one of the reasons this is slower currently. I don't think SSD has the same need for chunking based on computational complexity, so I think it's mostly there for memory overhead management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants