ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 by duduta · Pull Request #16805 · ggml-org/llama.cpp

duduta · 2025-10-27T15:25:45Z

This PR is a small refactoring of ggml_compute_forward_rope_f32 and _f16.
I extracted the rotate_pairs logic to remove some duplicate code.

Also, I kept the current behavior for unsupported rope type - it defaults to normal type (use consecutive values for the pairs).

Later edit: I added some performance tests to test-backend-ops.
Here is the output from compare-llama-bench.py

Backend	GGML op	Op parameters	Bandwidth (GB/s) master	Bandwidth (GB/s) refactor-rope	Speedup
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.77	1.11	1.45
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	2.19	3.29	1.50
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.78	1.11	1.43
CPU	ROPE	type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	2.06	2.98	1.44
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.97	1.60	1.66
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	4.49	7.38	1.64
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.91	1.50	1.66
CPU	ROPE	type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	4.37	6.61	1.52
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	1.55	3.41	2.20
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.29	15.90	2.18
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	1.55	3.38	2.18
CPU	ROPE	type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.80	15.33	2.26
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	1.81	4.48	2.47
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	8.60	21.22	2.47
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	1.75	4.38	2.51
CPU	ROPE	type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	8.53	18.89	2.21
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.89	1.41	1.58
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	4.28	6.82	1.59
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.82	1.34	1.64
CPU	ROPE	type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	4.08	6.08	1.49
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	0.76	1.13	1.49
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	2.30	3.68	1.60
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	0.77	1.18	1.53
CPU	ROPE	type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	2.20	3.34	1.52
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.13	4.43	1.42
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	14.47	20.34	1.41
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.91	4.36	1.50
CPU	ROPE	type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	14.36	19.06	1.33
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	2.65	2.60	0.98
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.02	8.08	1.15
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.20	2.30	1.05
CPU	ROPE	type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.46	6.82	1.06
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.93	4.11	1.05
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	16.07	18.33	1.14
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.94	3.36	1.14
CPU	ROPE	type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	13.88	14.59	1.05
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	10.30	11.87	1.15
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	46.39	54.29	1.17
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	8.30	10.79	1.30
CPU	ROPE	type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	36.20	43.39	1.20
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	14.87	17.91	1.20
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	61.97	71.31	1.15
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	11.76	15.95	1.36
CPU	ROPE	type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	50.30	62.62	1.24
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	3.47	3.63	1.05
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	14.31	17.21	1.20
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.59	3.01	1.16
CPU	ROPE	type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	11.25	13.20	1.17
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	2.75	3.05	1.11
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	7.53	8.99	1.19
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	2.25	2.59	1.15
CPU	ROPE	type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	6.87	7.71	1.12
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0	11.45	12.20	1.07
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0	46.83	52.90	1.13
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0	9.31	10.61	1.14
CPU	ROPE	type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0	38.95	43.93	1.13

ggerganov

Can you demonstrate the performance is preserved?

ggerganov · 2025-10-27T18:31:46Z

ggml/src/ggml-cpu/ops.cpp

+                    break;
+                  default:
+                    //rope type not supported, silently default to NORMAL
+                    rotate_pairs<T>(n_dims, 1, cache, src, dst_data, 1);


Isn't it better to GGML_ABORT here?

I thought so too. I was unsure because I saw that test-rope.cpp tests for an unsupported (not yet implemented maybe?) rope type. Shall I change both?

The GLM rope type was removed - we should remove it from the test-rope.

Ok, will do that.

duduta · 2025-10-28T12:43:23Z

@ggerganov It seems there's an improvement in performance, I didn't expect that (see my updated first comment). ~~As a note, I just copy pasted the default test_rope cases from test-backend-ops~~, I still don't know how to chose real life relevant dimensions for perf.
Later edit: I tried to select only several cases, for different attention head counts and head_dim counts, and I bumped the seq_length to 512 in the shape of the tensor.

ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…g#16805) * extract rotate_pairs logic from ggml_compute_forward_rope_f32 * templateify ggml_compute_forward_rope_f32 and _f16 * abort when rope type not supported, remove GLM from test-rope * add imrope branch to switch * add rope tests for perf * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* extract rotate_pairs logic from ggml_compute_forward_rope_f32 * templateify ggml_compute_forward_rope_f32 and _f16 * abort when rope type not supported, remove GLM from test-rope * add imrope branch to switch * add rope tests for perf * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

duduta requested review from ggerganov and slaren as code owners October 27, 2025 15:25

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025

ggerganov reviewed Oct 27, 2025

View reviewed changes

duduta force-pushed the refactor-rope branch 2 times, most recently from 83928e8 to 2a6c387 Compare October 31, 2025 18:08

DajanaV mentioned this pull request Oct 31, 2025

UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 auroralabs-loci/llama.cpp#28

Closed

duduta force-pushed the refactor-rope branch 2 times, most recently from 8505c7f to 11a1992 Compare November 10, 2025 11:58

duduta added 5 commits November 10, 2025 13:59

extract rotate_pairs logic from ggml_compute_forward_rope_f32

97b6cda

templateify ggml_compute_forward_rope_f32 and _f16

7082f9c

abort when rope type not supported, remove GLM from test-rope

6fa7a64

add imrope branch to switch

e3854bb

add rope tests for perf

11a1992

ggerganov approved these changes Nov 10, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

Update ggml/src/ggml-cpu/ops.cpp

2181b51

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

duduta force-pushed the refactor-rope branch from 3507202 to c6330ac Compare November 10, 2025 13:29

Update ggml/src/ggml-cpu/ops.cpp

c6330ac

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

slaren approved these changes Nov 11, 2025

View reviewed changes

ggerganov merged commit 73460f6 into ggml-org:master Nov 11, 2025
68 of 69 checks passed

duduta deleted the refactor-rope branch November 11, 2025 12:46

ddpasa mentioned this pull request Nov 22, 2025

Eval bug: Vulkan issues starting with 439342ea0be347ff279ec204719794df3b3108f6 #17438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#16805

ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#16805
ggerganov merged 7 commits intoggml-org:masterfrom
duduta:refactor-rope

duduta commented Oct 27, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Oct 27, 2025

Uh oh!

duduta Oct 28, 2025

Uh oh!

ggerganov Oct 28, 2025

Uh oh!

duduta Oct 28, 2025

Uh oh!

duduta commented Oct 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

duduta commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

duduta Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

duduta Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

duduta commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

duduta commented Oct 27, 2025 •

edited

Loading

duduta commented Oct 28, 2025 •

edited

Loading