UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq by DajanaV · Pull Request #17 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-31T11:04:30Z

Pointed out
here that only a single value is needed per target col per thread

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* metal : dynamic simdgroups for MV kernels * cont : minor

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing headers Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

…#16359) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models

…6363)

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build

* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

…16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…0 (#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn

* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>

loci-review · 2025-10-31T12:43:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: CUDA MMVQ Optimization

Data Availability Status

The performance analysis tools returned empty results for the specified version comparison. This indicates that either performance profiling data is not available for these versions or the version identifiers do not match the performance database records.

Code Analysis Based on PR Changes

Based on the code review of PR #17, the following analysis covers the potential performance impacts:

Modified Function: `mul_mat_vec_q` (CUDA Kernel)

Location: ggml/src/ggml-cuda/mmvq.cu
Change Type: Memory optimization in fused matrix-vector quantized multiplication

Performance Impact Analysis by KPI

1. Tokens per Second

Potential Impact: Positive improvement expected

Analysis: The CUDA kernel optimization reduces memory overhead in the quantized matrix-vector multiplication path, which is critical for inference performance. While the specific performance metrics are not available from the tools, the code changes suggest:

Memory Access Optimization: Reduced array dimensions from [ncols_dst][rows_per_cuda_block] to [ncols_dst]
Register Pressure Reduction: Lower memory usage per thread block improves GPU occupancy
Loop Unrolling: Added #pragma unroll directives for compiler optimization

Reference Context: Given that a 2ms slowdown in llama_decode results in 7% fewer tokens per second on the reference system (smollm:135m, Intel i7-1255U), optimizations in the underlying CUDA kernels should provide measurable improvements in GPU-accelerated scenarios.

2. Power Consumption

Impacted Binaries:

llama-cli
llama-server
llama-bench
Any binary utilizing CUDA backend for quantized operations

Analysis: The memory optimization reduces GPU register usage and memory bandwidth requirements, potentially leading to lower power consumption during inference operations that utilize quantized matrix-vector multiplications.

3. Quantization Efficiency

Impacted Functions:

mul_mat_vec_q (CUDA kernel)
ggml_cuda_mul_mat_vec_q (host function)

Analysis: The optimization maintains quantization accuracy while improving memory efficiency. The change specifically targets the bias and gate handling in fused operations, which are common in transformer architectures with GLU activations.

4. Memory Usage

Improvements:

GPU Memory: Reduced per-thread memory allocation in CUDA kernels
Register Usage: Lower register pressure allows better thread occupancy
Memory Bandwidth: Simplified indexing patterns reduce memory access overhead

Specific Changes:

Eliminated rows_per_cuda_block dimension from bias arrays
Simplified memory access patterns from 2D to 1D indexing

5. Batch Processing

Potential Impact: Improved efficiency for batched operations

Analysis: The memory optimization benefits scale with batch size, as the reduced memory footprint per thread allows for better GPU utilization when processing multiple sequences simultaneously.

Action Items for Performance Improvement

Immediate Code-Level Optimizations

Verify CUDA Compilation Flags
- Ensure -O3 optimization level is enabled for CUDA compilation
- Validate that --use_fast_math is applied where appropriate
- Check compute capability targeting matches deployment hardware
Memory Access Pattern Validation
- Verify coalesced memory access patterns in the optimized kernel
- Ensure shared memory bank conflicts are minimized
- Validate alignment requirements for optimal memory throughput
Build System Enhancements
- Enable link-time optimization (LTO) for CUDA object files
- Verify proper NVCC optimization flags in CMake configuration
- Ensure debug symbols are stripped in release builds

Function-Specific Improvements

CUDA Kernel Optimization
- Consider further loop unrolling opportunities in the main computation loop
- Evaluate shared memory usage patterns for additional optimizations
- Assess potential for warp-level primitives usage
Host-Device Interface
- Validate memory transfer patterns between host and device
- Ensure optimal CUDA stream utilization
- Check for unnecessary synchronization points

Critical Functions Status

Based on the code analysis, the following critical functions are potentially impacted by the CUDA optimization:

mul_mat_vec_q: Direct optimization target with memory efficiency improvements
ggml_cuda_mul_mat_vec_q: Host function calling the optimized kernel
Quantized inference pipeline: Indirect benefits through improved CUDA backend performance

The optimization represents a targeted improvement in the CUDA quantized matrix-vector multiplication path, which is fundamental to efficient transformer inference on GPU hardware. The changes maintain functional correctness while reducing memory overhead and improving GPU utilization characteristics.

Fix - unsetenv()

* Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

…better shader parameter handling (#20173) * K quant speedup (#20) * Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

jeffbolznv and others added 30 commits September 29, 2025 19:26

tests: override test_set_rows::max_nmse_err to allow for occasional r…

a74a0d6

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

codeowners: add codeowners for opencl backend (#16344)

de41f2b

kleidiai : fix work size and threads sync for fp16 (#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)

a014310

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (#16348)

2df5bcf

opencl: support ne3 in get_rows (#15866)

d1c84a6

Chatapi ignore empty sampling (#16330)

16b0ca0

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

opencl: support pad_ext (#15888)

7c156df

ci : fix ccache key for ubuntu-cpu-cmake (#16355)

b2ba81d

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

webui: Remove running llama-server within WebUI dev.sh script (#1…

aa9538a

…6363)

vulkan: make ggml_vk_default_dispatcher support older vulkan headers …

132d673

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

Add optional setting for showing "Model used:" information (#16337)

4f15759

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

ci : use registry cache for docker builds (#16366)

1104ca1

Improve code block color theming (#16325)

2a9b633

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

ci: Properly install rocwmma for hip builds (#16305)

1fe4e38

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

CI: reenable cdna in rocm docker builds (#16376)

c8dedc9

HIP: add IMbackK to codeowner (#16375)

95ce098

ci : fix clean-up of old logs (#16381)

bbd32bc

DajanaV force-pushed the main branch from 2bde1d9 to 741b26c Compare October 31, 2025 15:24

DajanaV added the outdated-summary Summary report is out of date label Oct 31, 2025

DajanaV force-pushed the main branch 20 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev pushed a commit that referenced this pull request Jan 14, 2026

Merge pull request #17 from I-N-T-E-L/fix---unsetenv()

1c05c32

Fix - unsetenv()

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq#17

UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq#17
DajanaV wants to merge 6887 commits intomainfrom
upstream-PR16858-branch_ORippler-osimons/remove_unneded_bias_gate_dims

DajanaV commented Oct 31, 2025

Uh oh!

loci-review bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

DajanaV commented Oct 31, 2025

Uh oh!

loci-review bot commented Oct 31, 2025

Performance Analysis Summary: CUDA MMVQ Optimization

Data Availability Status

Code Analysis Based on PR Changes

Modified Function: mul_mat_vec_q (CUDA Kernel)

Performance Impact Analysis by KPI

1. Tokens per Second

2. Power Consumption

3. Quantization Efficiency

4. Memory Usage

5. Batch Processing

Action Items for Performance Improvement

Immediate Code-Level Optimizations

Function-Specific Improvements

Critical Functions Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Modified Function: `mul_mat_vec_q` (CUDA Kernel)