UPSTREAM PR #16884: CUDA: fuse rope + set_rows by DajanaV · Pull Request #21 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-31T13:48:18Z

Based on #16769.

On a 4090:

Model	Test	t/s master	t/s cuda-rope-fusion	Speedup
llama 8B Q4_K_M	tg32	134.90	136.07	1.01
llama 8B Q4_K_M	tg64	131.41	132.84	1.01
llama 8B Q4_K_M	tg128	130.54	131.87	1.01
qwen3moe 30B.A3B Q4_0	tg32	167.18	168.23	1.01
qwen3moe 30B.A3B Q4_0	tg64	161.00	161.90	1.01
qwen3moe 30B.A3B Q4_0	tg128	158.84	159.83	1.01

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build

* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <[email protected]> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> Signed-off-by: Adrien Gallouët <[email protected]>

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

…16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…0 (#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn

* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

* First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

…GGML_KQ_MASK_PAD) (#16316)

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

…6354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>

loci-review · 2025-10-31T14:45:58Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

llama_decode: No performance changes (48,432,684 ns response time, 71 ns throughput, 54 ns bottleneck)
llama_encode: No performance changes (12,186,729 ns response time, 57 ns throughput, 40 ns bottleneck)
llama_tokenize: No performance changes (832,589 ns response time, 22 ns throughput, 17 ns bottleneck)

Supporting Functions

llama_model_load_from_file: No performance changes (330,045,660 ns response time)
llama_batch_init: No performance changes (257 ns response time)
ggml_backend_graph_compute: No performance changes (148 ns response time)

All critical functions show identical performance metrics between versions, with no modifications detected.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact: No Change

Critical Functions Status: llama_decode, llama_encode, and llama_tokenize show zero performance degradation
Reference Baseline: The 7% tokens/second reduction with 2ms llama_decode slowdown does not apply here
Affected Functions: None of the tokenization/inference pipeline functions have measurable changes

2. Power Consumption

Impact: Negligible

build.bin.libllama.so: 305,212 nJ (0.0% change from 305,211 nJ base)
build.bin.libggml-cpu.so: 151,692 nJ (0.0% change)
build.bin.libggml-base.so: 90,434 nJ (0.0% change)
build.bin.libggml.so: 6,339 nJ (0.0% change)

3. Quantization Efficiency

Impact: No Change

llama_model_quantize: Function performance unchanged
Quantization Pipeline: No modifications to quantization-related functions detected
Format Support: Q4_0, Q4_1, Q8_0 processing efficiency maintained

4. Memory Usage

Impact: No Change

KV Cache Functions: llama_memory_clear, llama_memory_seq_rm, llama_memory_seq_cp show no performance changes
Memory Allocation: ggml_gallocr_new, ggml_tallocr_alloc functions unchanged
Memory Management: Unified and recurrent memory systems maintain baseline performance

5. Batch Processing

Impact: No Change

Batch Functions: llama_batch_init, llama_batch_get_one, llama_batch_free show identical metrics
Parallel Processing: llama_decode batch processing performance unchanged
Dynamic Batching: No degradation in adaptive batch size management

Action Items

Code Optimization Focus

CUDA ROPE Fusion: The implemented ROPE + VIEW + SET_ROWS fusion in ggml-cuda.cu provides 1% GPU performance improvement without affecting CPU-based critical functions
Template Optimization: Consider reducing template instantiation overhead in rope.cu mixed-precision implementations
Validation Caching: Cache fusion eligibility checks in ggml_cuda_should_fuse_rope_set_rows() to reduce repeated validation overhead

Build System Enhancements

Compiler Optimization: Maintain current optimization flags as they preserve performance across all critical functions
Template Compilation: Monitor compilation time impact from expanded template parameters in CUDA kernels
Backend Selection: Ensure CUDA fusion optimizations don't interfere with CPU backend performance

Conclusion

The version comparison shows stable performance across all critical LLaMA.cpp functions. The CUDA ROPE fusion implementation provides GPU-specific optimizations without impacting CPU inference performance. No degradation detected in tokenization, memory management, or batch processing pipelines that would affect tokens per second throughput or power consumption.

allozaur and others added 30 commits October 1, 2025 12:08

Add optional setting for showing "Model used:" information (#16337)

4f15759

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

ci : use registry cache for docker builds (#16366)

1104ca1

Improve code block color theming (#16325)

2a9b633

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

ci: Properly install rocwmma for hip builds (#16305)

1fe4e38

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

CI: reenable cdna in rocm docker builds (#16376)

c8dedc9

HIP: add IMbackK to codeowner (#16375)

95ce098

ci : fix clean-up of old logs (#16381)

bbd32bc

ci: update vulkan ci (#16294)

f09aefa

ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388)

72ee736

musa: update compile flags (#16265)

91a2a56

Signed-off-by: Xiaodong Ye <[email protected]>

test-barrier : do not use more threads than physically available (#16…

d64c810

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

fix: track viewportHeight via window.innerHeight to avoid unwanted sc…

5113efd

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

webui : Fix messages payload sent to chat completions (#16402)

136bda7

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

vulkan: in flash attention, bounds check against nem1 (don't rely on …

e308efd

…GGML_KQ_MASK_PAD) (#16316)

Capture model name only after first token (streaming) or completed re…

7723327

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

0e1f838

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

Fix missing messages on sibling navigation (#16408)

84c8e30

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

ggml : fix graph reallocation with multiple chunks (#16396)

638d330

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

llama : fix shapes for bert/mpt q/k norm (#16409)

946f71e

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

am17an added 3 commits October 31, 2025 20:07

CUDA: add fused rope

ea859a2

move k forward_expand up

607f73b

create helper function instead of re-using params

dc814b8

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 13:48 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 21 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16884: CUDA: fuse rope + set_rows#21

UPSTREAM PR #16884: CUDA: fuse rope + set_rows#21
DajanaV wants to merge 6905 commits intomainfrom
upstream-PR16884-branch_am17an-cuda-add-rope-fusion

DajanaV commented Oct 31, 2025

Uh oh!

loci-review bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants