Skip to content

UPSTREAM PR #17764: ggml webgpu: unary op suppport, code refactoring, ops support#436

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17764-branch_reeselevine-master
Open

UPSTREAM PR #17764: ggml webgpu: unary op suppport, code refactoring, ops support#436
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17764-branch_reeselevine-master

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Dec 4, 2025

Mirrored from ggml-org/llama.cpp#17764

This PR adds the following for the WebGPU backend:

  • Basic support for most unary operators
    • Updates the wgsl shader generation script to handle formatting functions for different unary operators
  • Refactoring the webgpu backend code a bit
    • Moves pipeline initialization/storage to use std::map instead of statically sized arrays, since not all combinations of features (inplace,vectorized, etc) are needed/used.
    • Adds helper functions, e.g. CEIL_DIV, for common calculations to the webgpu backend file
  • Finally adds WebGPU backend operator support to ops.md

I know this PR is large, all of the code is contained to the WebGPU backend though. This code was also partially written by @abhijitramesh, @XXjcontiniXX, and @neha-ha, and I have reviewed their code.

reeselevine and others added 3 commits December 3, 2025 15:01
commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (#11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (#10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
    Co-authored-by: Reese Levine <reeselevine1@gmail.com>

Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6baea31645e5d96ad53664acae856f74b96f4
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c660c10dec4487d434549bdb707a9cd9f37
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7dec41c29186d66152735b244c5699f9dc
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 362749910be4f0120c8ffb21ceddeb7d2c088e51
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb0858333785757804c5104e59c4981843207c16
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa4aa53711be5a126043670cc182c78bfcd
Merge: 8a6ec843 74b8fc1
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae38278a2db236adc5912c9140e4f0d63f2c19
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2f8877a405470ca56709c42a1fd43713de
Author: James Contini <jamescontini@gmail.com>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #436

Overview

PR #436 adds WebGPU backend support for unary operators and refactors pipeline storage from static arrays to std::map. Analysis shows zero measurable performance impact across all 16 analyzed binaries. Power consumption remains unchanged, with deltas under 0.2 nJ (effectively zero). No function-level performance data was captured, indicating the changes do not affect existing execution paths.

Code Changes

The PR implements:

  • 15 unary operators (ABS, SGN, NEG, STEP, TANH, ELU, RELU, SIGMOID, GELU, SILU, HARDSWISH, HARDSIGMOID, EXP, GELU_ERF, XIELU) for WebGPU backend
  • Pipeline storage refactoring: replaced fixed arrays with std::map<int, webgpu_pipeline> structures
  • Hardcoded workgroup size to 288 threads (previously dynamic) as workaround for WebGPU implementation bugs
  • New WGSL shader template system with 60+ shader variants
  • Helper macros: CEIL_DIV and ROUNDUP_POW2 for workgroup calculations

Key Findings

Inference Performance Impact:
No impact on tokens per second. The changes are isolated to the WebGPU backend initialization and operator dispatch paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications. The unary operator support only activates when WebGPU backend is explicitly used and models contain supported activation functions.

Power Consumption:
All binaries show zero effective change:

  • libllama.so: +0.17 nJ (194,027 nJ baseline)
  • llama-run: -0.10 nJ (218,706 nJ baseline)
  • llama-cvector-generator: -0.14 nJ (249,105 nJ baseline)
  • Remaining 13 binaries: 0.0 nJ change

Performance-Critical Areas:
No functions from the identified performance-critical areas (llama_decode, ggml_backend_graph_compute, llama_model_load_from_file, memory management, tokenization) were modified. Changes are confined to WebGPU-specific code paths that only execute when WebGPU backend is active.

Technical Notes:
The hardcoded 288-thread workgroup size may underutilize capable GPUs but represents a necessary workaround. Pipeline refactoring reduces memory footprint by allocating only used variants. The std::map lookup overhead (O(log n)) is negligible given small map sizes (typically <10 entries).

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 3e4b499 to e81a7eb Compare December 5, 2025 13:17
reeselevine and others added 2 commits December 5, 2025 10:02
@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 5, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #436

Overview

This PR adds WebGPU backend support for 16 unary operators and refactors pipeline management from static arrays to std::map structures. All changes are isolated to the WebGPU backend implementation.

Performance Impact

No performance changes detected. Analysis across all 16 binaries shows 0.0% change in power consumption and no measurable differences in response time or throughput metrics. The performance analysis returned "no_data" status, indicating no functions met the threshold criteria for performance change reporting.

Code Changes Analysis

The PR implements three main modifications:

1. Pipeline Storage Refactoring
Converts fixed-size arrays to dynamic maps for pipeline storage, reducing memory footprint by eliminating unused pipeline slots. This affects initialization only, not runtime execution paths.

2. Workgroup Size Standardization
Introduces WEBGPU_MAX_WG_SIZE constant (288) and helper macros CEIL_DIV and ROUNDUP_POW2 to address WebGPU implementation inconsistencies across platforms. Replaces 20+ manual calculations with consistent macro usage.

3. Unary Operation Support
Adds GPU acceleration for 16 operations (ABS, SGN, NEG, STEP, TANH, ELU, RELU, SIGMOID, GELU, GELU_QUICK, SILU, HARDSWISH, HARDSIGMOID, EXP, GELU_ERF, XIELU) supporting F32/F16 types and in-place/out-of-place execution modes. Generates 64 shader variants through template-based code generation.

Inference Impact

Tokens per second: No impact. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes. The new unary operation paths are not exercised in the benchmark workload and only activate when models specifically use these activation functions.

Power Consumption

No change across all binaries. All 16 analyzed binaries report 0.0% power consumption change, confirming the refactoring is performance-neutral for existing workloads.

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 7d0b0c3 to e5edfa8 Compare December 7, 2025 01:37
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from af1ee09 to 943ad50 Compare December 12, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants