UPSTREAM PR #17764: ggml webgpu: unary op suppport, code refactoring, ops support by loci-dev · Pull Request #436 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-04T18:44:51Z

This PR adds the following for the WebGPU backend:

Basic support for most unary operators
- Updates the wgsl shader generation script to handle formatting functions for different unary operators
Refactoring the webgpu backend code a bit
- Moves pipeline initialization/storage to use std::map instead of statically sized arrays, since not all combinations of features (inplace,vectorized, etc) are needed/used.
- Adds helper functions, e.g. CEIL_DIV, for common calculations to the webgpu backend file
Finally adds WebGPU backend operator support to ops.md

I know this PR is large, all of the code is contained to the WebGPU backend though. This code was also partially written by @abhijitramesh, @XXjcontiniXX, and @neha-ha, and I have reviewed their code.

commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (#11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6baea31645e5d96ad53664acae856f74b96f4 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c660c10dec4487d434549bdb707a9cd9f37 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7dec41c29186d66152735b244c5699f9dc Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 362749910be4f0120c8ffb21ceddeb7d2c088e51 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb0858333785757804c5104e59c4981843207c16 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa4aa53711be5a126043670cc182c78bfcd Merge: 8a6ec843 74b8fc1 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae38278a2db236adc5912c9140e4f0d63f2c19 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2f8877a405470ca56709c42a1fd43713de Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

loci-review · 2025-12-04T19:31:57Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #436

Overview

PR #436 adds WebGPU backend support for unary operators and refactors pipeline storage from static arrays to std::map. Analysis shows zero measurable performance impact across all 16 analyzed binaries. Power consumption remains unchanged, with deltas under 0.2 nJ (effectively zero). No function-level performance data was captured, indicating the changes do not affect existing execution paths.

Code Changes

The PR implements:

15 unary operators (ABS, SGN, NEG, STEP, TANH, ELU, RELU, SIGMOID, GELU, SILU, HARDSWISH, HARDSIGMOID, EXP, GELU_ERF, XIELU) for WebGPU backend
Pipeline storage refactoring: replaced fixed arrays with std::map<int, webgpu_pipeline> structures
Hardcoded workgroup size to 288 threads (previously dynamic) as workaround for WebGPU implementation bugs
New WGSL shader template system with 60+ shader variants
Helper macros: CEIL_DIV and ROUNDUP_POW2 for workgroup calculations

Key Findings

Inference Performance Impact:
No impact on tokens per second. The changes are isolated to the WebGPU backend initialization and operator dispatch paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications. The unary operator support only activates when WebGPU backend is explicitly used and models contain supported activation functions.

Power Consumption:
All binaries show zero effective change:

libllama.so: +0.17 nJ (194,027 nJ baseline)
llama-run: -0.10 nJ (218,706 nJ baseline)
llama-cvector-generator: -0.14 nJ (249,105 nJ baseline)
Remaining 13 binaries: 0.0 nJ change

Performance-Critical Areas:
No functions from the identified performance-critical areas (llama_decode, ggml_backend_graph_compute, llama_model_load_from_file, memory management, tokenization) were modified. Changes are confined to WebGPU-specific code paths that only execute when WebGPU backend is active.

Technical Notes:
The hardcoded 288-thread workgroup size may underutilize capable GPUs but represents a necessary workaround. Pipeline refactoring reduces memory footprint by allocating only used variants. The std::map lookup overhead (O(log n)) is negligible given small map sizes (typically <10 entries).

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-review · 2025-12-05T19:23:53Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #436

Overview

This PR adds WebGPU backend support for 16 unary operators and refactors pipeline management from static arrays to std::map structures. All changes are isolated to the WebGPU backend implementation.

Performance Impact

No performance changes detected. Analysis across all 16 binaries shows 0.0% change in power consumption and no measurable differences in response time or throughput metrics. The performance analysis returned "no_data" status, indicating no functions met the threshold criteria for performance change reporting.

Code Changes Analysis

The PR implements three main modifications:

1. Pipeline Storage Refactoring
Converts fixed-size arrays to dynamic maps for pipeline storage, reducing memory footprint by eliminating unused pipeline slots. This affects initialization only, not runtime execution paths.

2. Workgroup Size Standardization
Introduces WEBGPU_MAX_WG_SIZE constant (288) and helper macros CEIL_DIV and ROUNDUP_POW2 to address WebGPU implementation inconsistencies across platforms. Replaces 20+ manual calculations with consistent macro usage.

3. Unary Operation Support
Adds GPU acceleration for 16 operations (ABS, SGN, NEG, STEP, TANH, ELU, RELU, SIGMOID, GELU, GELU_QUICK, SILU, HARDSWISH, HARDSIGMOID, EXP, GELU_ERF, XIELU) supporting F32/F16 types and in-place/out-of-place execution modes. Generates 64 shader variants through template-based code generation.

Inference Impact

Tokens per second: No impact. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes. The new unary operation paths are not exercised in the benchmark workload and only activate when models specifically use these activation functions.

Power Consumption

No change across all binaries. All 16 analyzed binaries report 0.0% power consumption change, confirming the refactoring is performance-neutral for existing workloads.

reeselevine and others added 3 commits December 3, 2025 15:01

Remove extra code and format

893e6af

Add ops documentation (finally)

417fa79

loci-dev temporarily deployed to PROD__AL_DEMO December 4, 2025 18:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from 3e4b499 to e81a7eb Compare December 5, 2025 13:17

reeselevine and others added 2 commits December 5, 2025 10:02

Update ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py

d62cd87

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Merge remote-tracking branch 'upstream/master'

7d1f6fa

loci-dev force-pushed the main branch from e81a7eb to 806b364 Compare December 5, 2025 18:11

loci-dev temporarily deployed to PROD__AL_DEMO December 5, 2025 18:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 13 times, most recently from 7d0b0c3 to e5edfa8 Compare December 7, 2025 01:37

loci-dev force-pushed the main branch 30 times, most recently from af1ee09 to 943ad50 Compare December 12, 2025 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17764: ggml webgpu: unary op suppport, code refactoring, ops support#436

UPSTREAM PR #17764: ggml webgpu: unary op suppport, code refactoring, ops support#436
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17764-branch_reeselevine-master

loci-dev commented Dec 4, 2025

Uh oh!

loci-review bot commented Dec 4, 2025

Uh oh!

loci-review bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 4, 2025

Uh oh!

loci-review bot commented Dec 4, 2025

Performance Analysis Summary - PR #436

Overview

Code Changes

Key Findings

Uh oh!

loci-review bot commented Dec 5, 2025

Performance Analysis Summary - PR #436

Overview

Performance Impact

Code Changes Analysis

Inference Impact

Power Consumption

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants