Skip to content

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434

Open
loci-dev wants to merge 12 commits intomainfrom
upstream-PR16623-branch_gabe-l-hart-ggml-cumsum-tri
Open

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434
loci-dev wants to merge 12 commits intomainfrom
upstream-PR16623-branch_gabe-l-hart-ggml-cumsum-tri

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Dec 4, 2025

Mirrored from ggml-org/llama.cpp#16623

Description

EDIT: This PR has now been updated to match the kernel signatures in master and ggml-org/llama.cpp#17584. CUMSUM was already added in ggml-org/llama.cpp#17305 and there are several more ops added in #17063, so this PR now adds TRI, FILL, EXPM1, and SOFTPLUS for Metal.


This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

gabe-l-hart and others added 12 commits December 3, 2025 08:18
The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…n kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master:
CUDA: generalized (mma) FA, add Volta support (#17505)
chat : reserve memory in compute_diffs and improve naming (#17729)
* origin/master:
server: strip content-length header on proxy (#17734)
server: move msg diffs tracking to HTTP thread (#17740)
examples : add missing code block end marker [no ci] (#17756)
common : skip model validation when --help is requested (#17755)
ggml-cpu : remove asserts always evaluating to false (#17728)
convert: use existing local chat_template if mistral-format model has one. (#17749)
cmake : simplify build info detection using standard variables (#17423)
ci : disable ggml-ci-x64-amd-* (#17753)
common: use native MultiByteToWideChar (#17738)
metal : use params per pipeline instance (#17739)
llama : fix sanity checks during quantization (#17721)
build : move _WIN32_WINNT definition to headers (#17736)
build: enable parallel builds in msbuild using MTT (#17708)
ggml-cpu: remove duplicate conditional check 'iid' (#17650)
Add a couple of file types to the text section (#17670)
convert : support latest mistral-common (fix conversion with --mistral-format) (#17712)
Use OpenAI-compatible `/v1/models` endpoint by default (#17689)
webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (#17445)
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #434

Overview

This PR adds Metal backend support for four tensor operations (TRI, FILL, EXPM1, SOFTPLUS) across 7 files with 265 new lines and zero deletions. Analysis shows no measurable performance impact on existing operations.

Performance Metrics

Function-Level Analysis:
All analyzed functions show identical or near-identical metrics between versions:

  • llama_decode: 732,360 ns response time (−3 ns change)
  • llama_tokenize: 394,120 ns response time (−1 ns change)
  • llama_model_load_from_file: 7,314,205 ns response time (+13 ns change)
  • ggml_graph_compute: 29,524 ns response time (0 ns change)

Power Consumption:
All binaries show 0.0% change in estimated power consumption:

  • libllama.so: 194,027 nJ (unchanged)
  • llama-run: 218,706 nJ (unchanged)
  • libggml-cpu.so: 117,027 nJ (unchanged)

Key Findings

Inference Performance Impact:
No impact on tokens per second. The core inference functions (llama_decode, llama_tokenize) show changes under 3 ns, which is within measurement noise. Given the reference that 2,000,000 ns degradation in llama_decode causes 7% tokens per second reduction, the observed 3 ns change represents approximately 0.00001% impact—effectively zero.

Code Changes:
The PR implements additive GPU acceleration for new operations without modifying existing code paths. Changes include:

  • Metal kernel implementations for TRI (triangular masking), FILL (constant value assignment), SOFTPLUS (log(1+exp(x))), and EXPM1 (exp(x)−1)
  • Operation dispatch integration in ggml_metal_op_encode_impl
  • Pipeline factory functions following existing patterns

Impacted Functions:
None of the tokenization or inference functions are modified. New functions added:

  • ggml_metal_op_fill (41 lines)
  • ggml_metal_op_tri (57 lines)
  • ggml_metal_library_get_pipeline_tri (23 lines)

These functions are only invoked when TRI, FILL, SOFTPLUS, or EXPM1 operations appear in computation graphs, which does not occur in standard LLM inference paths.

Binary Impact:
No binaries show power consumption changes. The additions increase Metal backend binary size by approximately 150 KB for 12 TRI kernel variants but do not affect runtime execution of existing operations.

@loci-dev loci-dev force-pushed the main branch 16 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from af1ee09 to 943ad50 Compare December 12, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants