Skip to content

ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#16805

Merged
ggerganov merged 7 commits intoggml-org:masterfrom
duduta:refactor-rope
Nov 11, 2025
Merged

ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16#16805
ggerganov merged 7 commits intoggml-org:masterfrom
duduta:refactor-rope

Conversation

@duduta
Copy link
Copy Markdown
Contributor

@duduta duduta commented Oct 27, 2025

This PR is a small refactoring of ggml_compute_forward_rope_f32 and _f16.
I extracted the rotate_pairs logic to remove some duplicate code.

Also, I kept the current behavior for unsupported rope type - it defaults to normal type (use consecutive values for the pairs).

Later edit: I added some performance tests to test-backend-ops.
Here is the output from compare-llama-bench.py

Backend GGML op Op parameters Bandwidth (GB/s) master Bandwidth (GB/s) refactor-rope Speedup
CPU ROPE type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 0.77 1.11 1.45
CPU ROPE type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 2.19 3.29 1.50
CPU ROPE type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 0.78 1.11 1.43
CPU ROPE type=f16,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 2.06 2.98 1.44
CPU ROPE type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 0.97 1.60 1.66
CPU ROPE type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 4.49 7.38 1.64
CPU ROPE type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 0.91 1.50 1.66
CPU ROPE type=f16,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 4.37 6.61 1.52
CPU ROPE type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 1.55 3.41 2.20
CPU ROPE type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 7.29 15.90 2.18
CPU ROPE type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 1.55 3.38 2.18
CPU ROPE type=f16,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 6.80 15.33 2.26
CPU ROPE type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 1.81 4.48 2.47
CPU ROPE type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 8.60 21.22 2.47
CPU ROPE type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 1.75 4.38 2.51
CPU ROPE type=f16,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 8.53 18.89 2.21
CPU ROPE type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 0.89 1.41 1.58
CPU ROPE type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 4.28 6.82 1.59
CPU ROPE type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 0.82 1.34 1.64
CPU ROPE type=f16,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 4.08 6.08 1.49
CPU ROPE type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 0.76 1.13 1.49
CPU ROPE type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 2.30 3.68 1.60
CPU ROPE type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 0.77 1.18 1.53
CPU ROPE type=f16,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 2.20 3.34 1.52
CPU ROPE type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 3.13 4.43 1.42
CPU ROPE type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 14.47 20.34 1.41
CPU ROPE type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 2.91 4.36 1.50
CPU ROPE type=f16,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 14.36 19.06 1.33
CPU ROPE type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 2.65 2.60 0.98
CPU ROPE type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 7.02 8.08 1.15
CPU ROPE type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 2.20 2.30 1.05
CPU ROPE type=f32,ne_a=[128,12,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 6.46 6.82 1.06
CPU ROPE type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 3.93 4.11 1.05
CPU ROPE type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 16.07 18.33 1.14
CPU ROPE type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 2.94 3.36 1.14
CPU ROPE type=f32,ne_a=[128,12,512,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 13.88 14.59 1.05
CPU ROPE type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 10.30 11.87 1.15
CPU ROPE type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 46.39 54.29 1.17
CPU ROPE type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 8.30 10.79 1.30
CPU ROPE type=f32,ne_a=[128,32,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 36.20 43.39 1.20
CPU ROPE type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 14.87 17.91 1.20
CPU ROPE type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 61.97 71.31 1.15
CPU ROPE type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 11.76 15.95 1.36
CPU ROPE type=f32,ne_a=[128,64,512,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 50.30 62.62 1.24
CPU ROPE type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 3.47 3.63 1.05
CPU ROPE type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 14.31 17.21 1.20
CPU ROPE type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 2.59 3.01 1.16
CPU ROPE type=f32,ne_a=[64,8,512,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 11.25 13.20 1.17
CPU ROPE type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 2.75 3.05 1.11
CPU ROPE type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 7.53 8.99 1.19
CPU ROPE type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 2.25 2.59 1.15
CPU ROPE type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 6.87 7.71 1.12
CPU ROPE type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0 11.45 12.20 1.07
CPU ROPE type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0 46.83 52.90 1.13
CPU ROPE type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0 9.31 10.61 1.14
CPU ROPE type=f32,ne_a=[80,32,512,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0 38.95 43.93 1.13

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you demonstrate the performance is preserved?

break;
default:
//rope type not supported, silently default to NORMAL
rotate_pairs<T>(n_dims, 1, cache, src, dst_data, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to GGML_ABORT here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too. I was unsure because I saw that test-rope.cpp tests for an unsupported (not yet implemented maybe?) rope type. Shall I change both?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GLM rope type was removed - we should remove it from the test-rope.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do that.

@duduta
Copy link
Copy Markdown
Contributor Author

duduta commented Oct 28, 2025

@ggerganov It seems there's an improvement in performance, I didn't expect that (see my updated first comment). As a note, I just copy pasted the default test_rope cases from test-backend-ops, I still don't know how to chose real life relevant dimensions for perf.
Later edit: I tried to select only several cases, for different attention head counts and head_dim counts, and I bumped the seq_length to 512 in the shape of the tensor.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggerganov ggerganov merged commit 73460f6 into ggml-org:master Nov 11, 2025
68 of 69 checks passed
@duduta duduta deleted the refactor-rope branch November 11, 2025 12:46
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…g#16805)

* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants