UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS by loci-dev · Pull Request #434 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-04T16:42:59Z

Description

EDIT: This PR has now been updated to match the kernel signatures in master and ggml-org/llama.cpp#17584. CUMSUM was already added in ggml-org/llama.cpp#17305 and there are several more ops added in #17063, so this PR now adds TRI, FILL, EXPM1, and SOFTPLUS for Metal.

This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* origin/master: CUDA: generalized (mma) FA, add Volta support (#17505) chat : reserve memory in compute_diffs and improve naming (#17729)

* origin/master: server: strip content-length header on proxy (#17734) server: move msg diffs tracking to HTTP thread (#17740) examples : add missing code block end marker [no ci] (#17756) common : skip model validation when --help is requested (#17755) ggml-cpu : remove asserts always evaluating to false (#17728) convert: use existing local chat_template if mistral-format model has one. (#17749) cmake : simplify build info detection using standard variables (#17423) ci : disable ggml-ci-x64-amd-* (#17753) common: use native MultiByteToWideChar (#17738) metal : use params per pipeline instance (#17739) llama : fix sanity checks during quantization (#17721) build : move _WIN32_WINNT definition to headers (#17736) build: enable parallel builds in msbuild using MTT (#17708) ggml-cpu: remove duplicate conditional check 'iid' (#17650) Add a couple of file types to the text section (#17670) convert : support latest mistral-common (fix conversion with --mistral-format) (#17712) Use OpenAI-compatible `/v1/models` endpoint by default (#17689) webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (#17445)

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

loci-review · 2025-12-04T17:34:50Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #434

Overview

This PR adds Metal backend support for four tensor operations (TRI, FILL, EXPM1, SOFTPLUS) across 7 files with 265 new lines and zero deletions. Analysis shows no measurable performance impact on existing operations.

Performance Metrics

Function-Level Analysis:
All analyzed functions show identical or near-identical metrics between versions:

llama_decode: 732,360 ns response time (−3 ns change)
llama_tokenize: 394,120 ns response time (−1 ns change)
llama_model_load_from_file: 7,314,205 ns response time (+13 ns change)
ggml_graph_compute: 29,524 ns response time (0 ns change)

Power Consumption:
All binaries show 0.0% change in estimated power consumption:

libllama.so: 194,027 nJ (unchanged)
llama-run: 218,706 nJ (unchanged)
libggml-cpu.so: 117,027 nJ (unchanged)

Key Findings

Inference Performance Impact:
No impact on tokens per second. The core inference functions (llama_decode, llama_tokenize) show changes under 3 ns, which is within measurement noise. Given the reference that 2,000,000 ns degradation in llama_decode causes 7% tokens per second reduction, the observed 3 ns change represents approximately 0.00001% impact—effectively zero.

Code Changes:
The PR implements additive GPU acceleration for new operations without modifying existing code paths. Changes include:

Metal kernel implementations for TRI (triangular masking), FILL (constant value assignment), SOFTPLUS (log(1+exp(x))), and EXPM1 (exp(x)−1)
Operation dispatch integration in ggml_metal_op_encode_impl
Pipeline factory functions following existing patterns

Impacted Functions:
None of the tokenization or inference functions are modified. New functions added:

ggml_metal_op_fill (41 lines)
ggml_metal_op_tri (57 lines)
ggml_metal_library_get_pipeline_tri (23 lines)

These functions are only invoked when TRI, FILL, SOFTPLUS, or EXPM1 operations appear in computation graphs, which does not occur in standard LLM inference paths.

Binary Impact:
No binaries show power consumption changes. The additions increase Metal backend binary size by approximately 150 KB for 12 TRI kernel variants but do not affect runtime execution of existing operations.

gabe-l-hart and others added 12 commits December 3, 2025 08:18

feat(wip): Port initial TRI impl from pervious work

a9c9244

The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove argument for constant val override

2a7bbc7

This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Move the ttype conditional to templating to avoid conditional i…

f2ad887

…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Type fixes

6a27050

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

feat: Add softplus for metal

7cbbff7

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add EXPM1 for metal

434ec07

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add FILL for metal

1496afd

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask

7690808

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove unused arguments

60fe39b

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Merge remote-tracking branch 'origin/master' into ggml-cumsum-tri

c195af2

* origin/master: CUDA: generalized (mma) FA, add Volta support (#17505) chat : reserve memory in compute_diffs and improve naming (#17729)

refactor: Use select instead of branch for softplus non-vec

338acb3

Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

loci-dev temporarily deployed to PROD__AL_DEMO December 4, 2025 16:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 16 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13

loci-dev force-pushed the main branch 30 times, most recently from af1ee09 to 943ad50 Compare December 12, 2025 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434
loci-dev wants to merge 12 commits intomainfrom
upstream-PR16623-branch_gabe-l-hart-ggml-cumsum-tri

loci-dev commented Dec 4, 2025

Uh oh!

loci-review bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 4, 2025

Description

Uh oh!

loci-review bot commented Dec 4, 2025

Performance Analysis Summary - PR #434

Overview

Performance Metrics

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants