UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434
UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS#434
Conversation
The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master: CUDA: generalized (mma) FA, add Volta support (#17505) chat : reserve memory in compute_diffs and improve naming (#17729)
* origin/master: server: strip content-length header on proxy (#17734) server: move msg diffs tracking to HTTP thread (#17740) examples : add missing code block end marker [no ci] (#17756) common : skip model validation when --help is requested (#17755) ggml-cpu : remove asserts always evaluating to false (#17728) convert: use existing local chat_template if mistral-format model has one. (#17749) cmake : simplify build info detection using standard variables (#17423) ci : disable ggml-ci-x64-amd-* (#17753) common: use native MultiByteToWideChar (#17738) metal : use params per pipeline instance (#17739) llama : fix sanity checks during quantization (#17721) build : move _WIN32_WINNT definition to headers (#17736) build: enable parallel builds in msbuild using MTT (#17708) ggml-cpu: remove duplicate conditional check 'iid' (#17650) Add a couple of file types to the text section (#17670) convert : support latest mistral-common (fix conversion with --mistral-format) (#17712) Use OpenAI-compatible `/v1/models` endpoint by default (#17689) webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (#17445)
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #434OverviewThis PR adds Metal backend support for four tensor operations (TRI, FILL, EXPM1, SOFTPLUS) across 7 files with 265 new lines and zero deletions. Analysis shows no measurable performance impact on existing operations. Performance MetricsFunction-Level Analysis:
Power Consumption:
Key FindingsInference Performance Impact: Code Changes:
Impacted Functions:
These functions are only invoked when TRI, FILL, SOFTPLUS, or EXPM1 operations appear in computation graphs, which does not occur in standard LLM inference paths. Binary Impact: |
df48f9e to
cb46586
Compare
af1ee09 to
943ad50
Compare
Mirrored from ggml-org/llama.cpp#16623
Description
EDIT: This PR has now been updated to match the kernel signatures in
masterand ggml-org/llama.cpp#17584. CUMSUM was already added in ggml-org/llama.cpp#17305 and there are several more ops added in #17063, so this PR now addsTRI,FILL,EXPM1, andSOFTPLUSforMetal.This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of
CUMSUMandTRItoMetalandCUDA. It also extends type support toF16andBF16.The goal of this PR is to establish these two ops in the interest of both the
DELTA_NETop for Qwen3-Next and the chunked implementation of the State Space Duality form ofSSM_SCANfor faster prefill.I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.