UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 by DajanaV · Pull Request #28 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-31T19:04:41Z

This PR is a small refactoring of ggml_compute_forward_rope_f32 and _f16.
I extracted the rotate_pairs logic to remove some duplicate code.

Also, I kept the current behavior for unsupported rope type - it defaults to normal type (use consecutive values for the pairs).

Later edit: I added some performance tests to test-backend-ops.
Here is the output from compare-llama-bench.py

Backend	GGML op	Op parameters	Bandwidth (GB/s) master	Bandwidth (GB/s) refactor-rope	Speedup
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.77	1.11	1.45
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	2.19	3.29	1.50
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.78	1.11	1.43
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	2.06	2.98	1.44
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.97	1.60	1.66
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	4.49	7.38	1.64
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.91	1.50	1.66
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	4.37	6.61	1.52
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	1.55	3.41	2.20
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.29	15.90	2.18
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	1.55	3.38	2.18
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.80	15.33	2.26
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	1.81	4.48	2.47
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	8.60	21.22	2.47
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	1.75	4.38	2.51
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	8.53	18.89	2.21
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.89	1.41	1.58
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	4.28	6.82	1.59
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.82	1.34	1.64
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	4.08	6.08	1.49
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.76	1.13	1.49
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	2.30	3.68	1.60
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.77	1.18	1.53
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	2.20	3.34	1.52
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.13	4.43	1.42
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	14.47	20.34	1.41
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.91	4.36	1.50
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	14.36	19.06	1.33
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	2.65	2.60	0.98
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.02	8.08	1.15
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.20	2.30	1.05
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.46	6.82	1.06
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.93	4.11	1.05
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	16.07	18.33	1.14
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.94	3.36	1.14
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	13.88	14.59	1.05
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	10.30	11.87	1.15
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	46.39	54.29	1.17
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	8.30	10.79	1.30
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	36.20	43.39	1.20
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	14.87	17.91	1.20
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	61.97	71.31	1.15
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	11.76	15.95	1.36
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	50.30	62.62	1.24
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.47	3.63	1.05
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	14.31	17.21	1.20
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.59	3.01	1.16
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	11.25	13.20	1.17
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	2.75	3.05	1.11
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.53	8.99	1.19
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.25	2.59	1.15
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.87	7.71	1.12
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	11.45	12.20	1.07
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	46.83	52.90	1.13
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	9.31	10.61	1.14
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	38.95	43.93	1.13

* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <[email protected]> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> Signed-off-by: Adrien Gallouët <[email protected]>

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

…16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…0 (#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn

* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

* First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

…GGML_KQ_MASK_PAD) (#16316)

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

…6354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]>

* rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

loci-review · 2025-10-31T20:19:09Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp ROPE Optimization

Critical Function Performance Impact

Core Inference Functions - No Performance Changes

All primary inference functions show zero performance impact:

Function	Response Time Change	Throughput Change	Bottleneck Change	Modified
`llama_decode`	0 ns (48,432,560 ns)	0 ns (71 ns)	0 ns (54 ns)	No
`llama_encode`	0 ns (12,186,698 ns)	0 ns (57 ns)	0 ns (40 ns)	No
`llama_tokenize`	0 ns (832,591 ns)	0 ns (22 ns)	0 ns (17 ns)	No
`ggml_compute_forward_rope`	0 ns (17,992 ns)	0 ns (79 ns)	0 ns (48 ns)	No

Model Loading and Batch Processing - Stable Performance

Function	Response Time Change	Throughput Change	Bottleneck Change
`llama_model_load_from_file`	0 ns (330,046,720 ns)	0 ns (205 ns)	0 ns (36 ns)
`llama_batch_init`	0 ns (257 ns)	0 ns (200 ns)	0 ns (95 ns)

Key Performance Indicators Impact Analysis

1. Tokens Per Second - No Impact

Status: No changes detected in inference-critical functions

llama_decode: 0 ns change (maintains 48.4 ms response time)
llama_encode: 0 ns change (maintains 12.2 ms response time)
llama_tokenize: 0 ns change (maintains 833 μs response time)

Inference: Based on the reference that 2 ms slower llama_decode reduces tokens/second by 7%, the zero change in llama_decode performance indicates no impact on tokens per second.

2. Power Consumption - Negligible Change

Binary-Level Analysis:

build.bin.libllama.so: -0.0% change (305,212.44 nJ → 305,211.93 nJ)
build.bin.libggml-base.so: 0.0% change (90,434.19 nJ)
build.bin.libggml-cpu.so: 0.0% change (151,692.17 nJ)
build.bin.libggml.so: 0.0% change (6,339.24 nJ)

Impact: Negligible power consumption reduction across all binaries.

3. Quantization Efficiency - No Impact

Analysis: llama_model_quantize function shows no performance changes

Response Time: Stable at 330 ms
Throughput: Stable at 205 ns
Bottleneck: Stable at 36 ns

Impact: Quantization operations maintain identical performance characteristics.

4. Memory Usage - No Direct Impact

Memory Management Functions:

KV Cache Operations: No changes detected in memory management functions
Batch Allocation: llama_batch_init shows zero performance change
Model Loading: llama_model_load_from_file maintains stable 330 ms response time

Impact: Memory allocation patterns and cache management remain unchanged.

5. Batch Processing - No Impact

Batch Processing Functions:

llama_batch_init: 0 ns change across all metrics
llama_decode (batch processing): 0 ns change in 48.4 ms response time
Parallel Processing: No changes in batch allocation or processing efficiency

Impact: Batch processing performance remains stable.

ROPE Function Optimization Analysis

Template Consolidation Benefits

The ROPE refactoring in ggml/src/ggml-cpu/ops.cpp provides:

Code Reduction: 267 lines removed, 53 lines added (83% reduction)
Template Unification: Single ggml_compute_forward_rope_flt<T>() replaces separate f32/f16 functions
Performance Potential: PR benchmarks show 1.4x-2.5x speedup for f16 operations

Control Flow Simplification

Before: Complex nested conditionals for ROPE types
After: Clean switch statement with delegated rotate_pairs<T>() calls

switch (mode) {
  case GGML_ROPE_TYPE_NORMAL: rotate_pairs<T>(n_dims, 1, cache, src, dst_data, 1); break;
  case GGML_ROPE_TYPE_NEOX:   rotate_pairs<T>(n_dims, n_dims/2, cache, src, dst_data); break;
  case GGML_ROPE_TYPE_MROPE:  rotate_pairs<T>(n_dims, n_dims/2, cache, src, dst_data); break;
  case GGML_ROPE_TYPE_VISION: rotate_pairs<T>(ne0, n_dims, cache, src, dst_data); break;
}

Action Items for Performance Optimization

Build System Optimizations

Enable Link-Time Optimization (LTO)
```
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
```
- Benefit: Maximize template inlining and cross-module optimization
Compiler Optimization Flags
```
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -mtune=native")
```
- Benefit: Enable aggressive optimization and CPU-specific instructions
Template Specialization Verification
- Action: Ensure template instantiation occurs at compile time
- Benefit: Eliminate runtime dispatch overhead

Code-Level Optimizations

SIMD Intrinsics Integration
- Target: rotate_pairs<T>() function in ROPE operations
- Implementation: Add explicit vectorization for trigonometric operations
- Expected Gain: 2x-4x improvement in ROPE computation
Memory Alignment Optimization
```
alignas(32) float cache[ne0 + CACHE_LINE_SIZE_F32];
```
- Benefit: Improve cache line utilization in ROPE operations
Branch Prediction Optimization
- Target: ROPE mode switch statement
- Implementation: Reorder cases by frequency of use
- Benefit: Reduce branch misprediction penalties

Performance Monitoring Focus Areas

ROPE Operation Latency
- Metric: Track ggml_compute_forward_rope_flt<T>() execution time
- Baseline: Current 18 μs response time
Template Instantiation Overhead
- Metric: Compare f32 vs f16 template performance
- Target: Verify expected 1.4x-2.5x speedup materialization
Memory Bandwidth Utilization
- Focus: Cache efficiency in rotate_pairs<T>() operations
- Metric: L1/L2 cache miss rates during ROPE computation

Conclusion

The ROPE optimization provides significant code quality improvements through template consolidation while maintaining stable performance across all critical inference functions. The zero impact on llama_decode, llama_encode, and llama_tokenize ensures no regression in tokens per second or overall inference performance. The template-based approach creates opportunities for future SIMD optimizations and compiler-level improvements that could materialize the 1.4x-2.5x performance gains demonstrated in the PR benchmarks.

angt and others added 30 commits October 1, 2025 20:22

ci: Properly install rocwmma for hip builds (#16305)

1fe4e38

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

CI: reenable cdna in rocm docker builds (#16376)

c8dedc9

HIP: add IMbackK to codeowner (#16375)

95ce098

ci : fix clean-up of old logs (#16381)

bbd32bc

ci: update vulkan ci (#16294)

f09aefa

ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388)

72ee736

musa: update compile flags (#16265)

91a2a56

Signed-off-by: Xiaodong Ye <[email protected]>

test-barrier : do not use more threads than physically available (#16…

d64c810

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

fix: track viewportHeight via window.innerHeight to avoid unwanted sc…

5113efd

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

webui : Fix messages payload sent to chat completions (#16402)

136bda7

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

vulkan: in flash attention, bounds check against nem1 (don't rely on …

e308efd

…GGML_KQ_MASK_PAD) (#16316)

Capture model name only after first token (streaming) or completed re…

7723327

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

0e1f838

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

Fix missing messages on sibling navigation (#16408)

84c8e30

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

ggml : fix graph reallocation with multiple chunks (#16396)

638d330

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

llama : fix shapes for bert/mpt q/k norm (#16409)

946f71e

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

chat : support Magistral thinking (#16413)

128d522

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

rpc : check src buffer when copying tensor (#16421)

f392839

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

duduta added 4 commits October 31, 2025 18:32

templateify ggml_compute_forward_rope_f32 and _f16

29fbf41

abort when rope type not supported, remove GLM from test-rope

71efb38

add imrope branch to switch

2cfa7ae

add rope tests for perf

2a6c387

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 19:04 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 20 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#28

UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#28
DajanaV wants to merge 6909 commits intomainfrom
upstream-PR16805-branch_duduta-refactor-rope

DajanaV commented Oct 31, 2025

Uh oh!

loci-review bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

DajanaV commented Oct 31, 2025

Uh oh!

loci-review bot commented Oct 31, 2025

Performance Analysis Summary: LLaMA.cpp ROPE Optimization

Critical Function Performance Impact

Core Inference Functions - No Performance Changes

Model Loading and Batch Processing - Stable Performance

Key Performance Indicators Impact Analysis

1. Tokens Per Second - No Impact

2. Power Consumption - Negligible Change

3. Quantization Efficiency - No Impact

4. Memory Usage - No Direct Impact

5. Batch Processing - No Impact

ROPE Function Optimization Analysis

Template Consolidation Benefits

Control Flow Simplification

Action Items for Performance Optimization

Build System Optimizations

Code-Level Optimizations

Performance Monitoring Focus Areas

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants