UPSTREAM PR #17737: CANN: implement the SSM_CONV operator by loci-dev · Pull Request #416 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-03T12:46:41Z

Description

We implement the SSM_CONV operator using depthwise 1D convolution.
We use high-level builtin aclnnConvolution function.

The goal is to compute the following:

$$ y[i,j,k] = \sum_{l=0}^{dconv}w[l,i] x[l+j, i, k] $$

where the shape of $y$ is $[dinner, nt, ns]$, $x$ is $[dconv - 1 + nt, dinner, ns]$ and $w$ is $[dconv, dinner]$.

In order to use aclnnConvolution to implement this formula, we reshape the tensors and set the groups parameter to d_inner to calculate the convolution for each channel independently.

Testing

We ran test-backend-ops test suite for SSM_CONV on two different cards: 310P3 and 910B3.

For the 310P3 card, it requires setting the cubeMathType parameter to ALLOW_FP32_DOWN_PRECISION, and it seems that causes the computation to be done not in f32, which in turn causes the tests to not pass with a small error (NMSE 0.000000114, greater than the allowed 1e-7). We had to override max_nmse_err() method for test_ssm_conv to set the maximum error to 1e-6 which allows the tests to pass.

On the 910B card, the operator runs in f32 natively, it passes the tests at the original 1e-7 precision.

Co-authored-by: Aleksei Lobanov, <[email protected]> Co-authored-by: Sujin Kang, <[email protected]>

loci-review · 2025-12-03T13:42:49Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #416 - CANN SSM_CONV Operator Implementation

Overview

PR #416 implements the SSM_CONV operator for the CANN backend, adding support for state-space model convolution operations on Ascend NPUs. The changes introduce 137 new lines across 4 files with no deletions, representing a pure feature addition rather than a modification of existing code paths.

Performance Impact Analysis

Power Consumption: Analysis across all binaries shows 0.0% change in power consumption between versions. The measured values for key binaries remain identical:

libllama.so: 194,195 nJ (no change)
libggml-cpu.so: 116,810 nJ (no change)
llama-run: 218,940 nJ (no change)

Inference Performance: No functions in the core inference path (llama_decode, llama_encode, llama_tokenize) were modified. The new ggml_cann_ssm_conv function is an isolated addition to the CANN backend operator set and does not affect existing CPU or GPU inference paths. Tokens per second for standard transformer models remains unchanged.

Code Changes:

New function ggml_cann_ssm_conv implements depthwise 1D convolution using aclnnConvolution
Tensor reshaping logic converts between GGML layout (CLN format) and CANN NCL format
Platform-specific handling for Ascend 310P3 cards sets cubeMathType=1 for FP32 precision
Switch case additions in ggml_cann_compute_forward and ggml_backend_cann_supports_op register the new operator
Test tolerance adjustment from 1e-7 to 1e-6 accommodates 310P3 precision behavior

Scope: This PR exclusively affects state-space models (Mamba, RWKV architectures) running on CANN backend. Standard transformer models and non-CANN backends are unaffected. The implementation adds 123 lines of tensor manipulation and convolution setup code without modifying any existing operator implementations.

loci-review · 2025-12-18T06:22:21Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #416

Analysis Scope: CANN backend SSM_CONV operator implementation
Versions Compared: afa63c13 vs ec4cac41
Project: llama.cpp (auroralabs-loci)

Summary

This PR adds SSM convolution operator support for CANN backend without measurable performance impact. Analysis shows zero performance change across all 16 binaries and no function-level metrics available for comparison. The code introduces 109 new lines implementing ggml_cann_ssm_conv() using depthwise 1D convolution via aclnnConvolution API. A correctness issue was identified: missing break statement after GGML_OP_OUT_PROD case causes fall-through to GGML_OP_SSM_CONV, affecting outer product operations on CANN backend.

Power Consumption: All binaries show 0.0% change. Four binaries have negligible absolute deltas under 1.1 nJ due to floating-point precision: libllama.so (+0.50 nJ), llama-cvector-generator (+0.21 nJ), llama-run (+0.52 nJ), llama-tts (+1.05 nJ). Remaining 12 binaries are identical.

Inference Impact: No tokenization or inference functions modified. Functions llama_decode, llama_encode, llama_tokenize remain unchanged, resulting in zero impact on tokens per second. The implementation enables Mamba-style model support on Ascend NPUs without affecting existing inference paths.

Code Changes: New operator adds F32-only SSM convolution with tensor reshaping from CLN to NCL format, depthwise convolution with groups=nr, and platform-specific handling for Ascend 310P cards requiring reduced precision (cubeMathType=1). Implementation uses zero-copy tensor views with stride manipulation for efficient memory access.

CANN: implement SSM_CONV operator

df6a560

Co-authored-by: Aleksei Lobanov, <[email protected]> Co-authored-by: Sujin Kang, <[email protected]>

loci-dev temporarily deployed to PROD__AL_DEMO December 3, 2025 12:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from d15b30f to 738bfbf Compare December 4, 2025 06:13

CANN: remove custom error limit for SSM_CONV

eb07456

loci-dev force-pushed the main branch 2 times, most recently from f01b714 to 47d1dc9 Compare December 4, 2025 10:10

CANN: merge SSM_CONV tensor shape/strides into one line

a70e4c8

loci-dev force-pushed the main branch 16 times, most recently from ca4155f to b86b588 Compare December 5, 2025 22:08

loci-dev force-pushed the main branch 29 times, most recently from 1daebfe to 75a97fd Compare December 10, 2025 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17737: CANN: implement the SSM_CONV operator#416

UPSTREAM PR #17737: CANN: implement the SSM_CONV operator#416
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17737-branch_0Marble-squash-commits

loci-dev commented Dec 3, 2025

Uh oh!

loci-review bot commented Dec 3, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 3, 2025

Description

Testing

Uh oh!

loci-review bot commented Dec 3, 2025

Performance Analysis Summary: PR #416 - CANN SSM_CONV Operator Implementation

Overview

Performance Impact Analysis

Uh oh!

loci-review bot commented Dec 18, 2025

Performance Analysis Summary: PR #416

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants