Skip to content

Migrate Qwen3.5 to IMROPE#19443

Closed
pwilkin wants to merge 1 commit intoggml-org:masterfrom
pwilkin:qwen35-imrope
Closed

Migrate Qwen3.5 to IMROPE#19443
pwilkin wants to merge 1 commit intoggml-org:masterfrom
pwilkin:qwen35-imrope

Conversation

@pwilkin
Copy link
Copy Markdown
Member

@pwilkin pwilkin commented Feb 9, 2026

As @ngxson rightly noticed, Qwen3.5 actually inherits from Qwen3VL, not from Qwen3Next in terms of RoPE, so we need IMROPE and not NEOX.

@pwilkin pwilkin requested a review from ngxson February 9, 2026 00:01
@pwilkin pwilkin requested a review from CISC as a code owner February 9, 2026 00:01
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Feb 9, 2026

Throwing this out there and going to sleep, will take a look tomorrow.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Feb 9, 2026

Technically IMROPE and MROPE is NEOX for text-only, so at least there is no problem with text input.

Btw, would appreciate if you can try permuting IMROPE --> MROPE in the conversion script though. Otherwise IMROPE may decrease the perf in some cases. Ref the layout of IMROPE vs MROPE:

if (is_imrope) { // qwen3vl apply interleaved mrope
if (sector % 3 == 1 && sector < 3 * sections[1]) {
theta = theta_h;
} else if (sector % 3 == 2 && sector < 3 * sections[2]) {
theta = theta_w;
} else if (sector % 3 == 0 && sector < 3 * sections[0]) {
theta = theta_t;
} else {
theta = theta_e;
}

From what I understand, given example [t, t, t, x, x, y, y] input position (with t=time dimension), then IMROPE layout will be: [t, x, y, t, x, y, t]

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Feb 9, 2026

@ngxson will try tomorrow (unless they kill me at work).

@github-actions github-actions bot added the model Model specific label Feb 9, 2026
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Feb 9, 2026

@ngxson I admit I'm feeling a bit out of my league here, but Opus claims that it can't be done:

IMROPE vs MROPE: Analysis and Equivalence

Background

Qwen3.5 uses IMROPE (Interleaved Multi-dimensional Rotary Position Embedding),
inherited from Qwen3VL. The IMROPE kernel is slower than MROPE due to modulo
operations in the index mapping. This document analyzes whether MROPE can be used
instead, and whether weight permutation can achieve exact equivalence.

How RoPE Works (Quick Recap)

RoPE rotates pairs of elements (Q[2j], Q[2j+1]) by an angle:

theta_j = position * freq(j)

where freq(j) = rope_theta * theta_scale^j. The frequency is determined by the
pair index j.

MROPE and IMROPE

Both MROPE and IMROPE extend RoPE to multiple position dimensions (T=time,
H=height, W=width, E=extra) using a sections array (e.g., [11, 11, 10, 0]).
Each pair index j is assigned to one position dimension. The angle becomes:

theta_j = pos[dim(j)] * freq(j)

where dim(j) depends on the rope type.

MROPE: Chunked Layout

Assigns contiguous chunks of pair indices to each dimension:

sections = [11, 11, 10]
Pair indices:  [0-10] → T,  [11-21] → H,  [22-31] → W
Layout:        TTTTTTTTTTT HHHHHHHHHHH WWWWWWWWWW

IMROPE: Interleaved Layout

Assigns pair indices in round-robin fashion:

sections = [11, 11, 10]
Pair indices:  0→T, 1→H, 2→W, 3→T, 4→H, 5→W, 6→T, ...
Layout:        THWTHWTHWTHWTHWTHWTHWTHWTHWTHW TH

(W runs out after 10×3=30 slots, remaining 2 slots get T, H)

Implementation (ggml)

From ggml/src/ggml-cpu/ops.cpp, the ggml_mrope_cache_init function:

int sector = (i0 / 2) % sect_dims;

if (is_imrope) {
    if      (sector % 3 == 1 && sector < 3 * sections[1]) theta = theta_h;
    else if (sector % 3 == 2 && sector < 3 * sections[2]) theta = theta_w;
    else if (sector % 3 == 0 && sector < 3 * sections[0]) theta = theta_t;
    else                                                   theta = theta_e;
} else {
    if      (sector < sections[0])                         theta = theta_t;
    else if (sector < sections[0] + sections[1])           theta = theta_h;
    else if (sector < sections[0] + sections[1] + sections[2]) theta = theta_w;
    else                                                   theta = theta_e;
}

Crucially, all four thetas advance together at every iteration:

theta_t *= theta_scale;
theta_h *= theta_scale;
theta_w *= theta_scale;
theta_e *= theta_scale;

Since they all start from the same rope_theta base, at pair index j:

theta_t = theta_h = theta_w = theta_e = rope_theta * theta_scale^j

The frequency is identical across dimensions at any given pair index. The only
difference between IMROPE and MROPE is which position value (pos_T, pos_H,
pos_W) multiplies this shared frequency.

Text-Only Equivalence

For text-only input, all position dimensions have the same value:

pos_T = pos_H = pos_W = pos_text

Therefore:

theta_j = pos_text * rope_theta * theta_scale^j

This is identical regardless of which dimension is assigned to pair j. IMROPE
and MROPE produce bit-identical results for text-only input.
No weight changes
are needed — just switch the rope type.

This was verified empirically: switching Qwen3.5 from IMROPE to MROPE produces
identical NMSE values (Dense: 8.94e-06, MoE: 9.36e-05).

Why Weight Permutation Cannot Achieve General Equivalence

The Approach

Bake a permutation P into Q/K weights so that element pair j moves to position
P(j), choosing P such that mrope_dim(P(j)) = imrope_dim(j) (the MROPE
dimension assignment at the new position matches IMROPE's at the old position).

The Problem

After permutation, the element originally at pair j is at position P(j) and
gets rotated by:

theta = pos[mrope_dim(P(j))] * freq(P(j))
      = pos[imrope_dim(j)]   * freq(P(j))    ← correct dimension

But the desired rotation was:

theta = pos[imrope_dim(j)] * freq(j)          ← correct frequency

Since P(j) ≠ j, we have freq(P(j)) ≠ freq(j). Example with sections=[11,11,10]:

Original pair (IMROPE) dim Permuted to (MROPE) dim freq match?
j=0 → T T P(0)=0 T freq(0)=freq(0) ✓
j=1 → H H P(1)=11 H freq(11)≠freq(1) ✗
j=2 → W W P(2)=22 W freq(22)≠freq(2) ✗
j=3 → T T P(3)=1 T freq(1)≠freq(3) ✗

Can We Compensate in the Weights?

Pre-rotation: Bake a fixed rotation angle alpha_j into each pair:

total_angle = alpha_j + pos * freq(P(j))
desired     =           pos * freq(j)
→  alpha_j  = pos * (freq(j) - freq(P(j)))

alpha_j must be a constant (baked into weights), but pos varies per token.
No fixed alpha_j works.

General linear transform: Apply a fixed 2×2 matrix M_j per pair:

M_j @ Rotate(pos * freq(P(j))) = Rotate(pos * freq(j))
→  M_j = Rotate(pos * (freq(j) - freq(P(j))))    ← depends on pos

Again position-dependent — cannot be baked into weights.

Fundamental Reason

The correction freq(j) - freq(P(j)) is a property of the pair indices, but
it must be multiplied by pos (which varies per token) to get the angle
correction. No fixed weight transformation can compensate for a
position-dependent angle difference.

Comparison with Normal ↔ NeoX RoPE Conversion

Normal ↔ NeoX RoPE conversion via weight permutation does work. The key
difference is what changes between the two formats.

Normal RoPE vs NeoX RoPE

Both use a single position dimension. The difference is which elements form a pair:

  • Normal (GPT-J): pairs first half with second half: (q_j, q_{j+n/2})
  • NeoX: pairs consecutive elements: (q_{2j}, q_{2j+1})

Example with n_rot=8, elements q0..q7:

Normal RoPE:                    NeoX RoPE:
Pair 0: (q0, q4)  freq(0)      Pair 0: (q0, q1)  freq(0)
Pair 1: (q1, q5)  freq(1)      Pair 1: (q2, q3)  freq(1)
Pair 2: (q2, q6)  freq(2)      Pair 2: (q4, q5)  freq(2)
Pair 3: (q3, q7)  freq(3)      Pair 3: (q6, q7)  freq(3)

Why It Works

The permutation [q0,q1,q2,q3,q4,q5,q6,q7] → [q0,q4,q1,q5,q2,q6,q3,q7]
rearranges elements so that NeoX sees:

Pair 0: (q0, q4)  freq(0)  ← same elements AND same freq as Normal pair 0
Pair 1: (q1, q5)  freq(1)  ← same elements AND same freq as Normal pair 1
Pair 2: (q2, q6)  freq(2)  ← same
Pair 3: (q3, q7)  freq(3)  ← same

Each pair keeps its pair index (and therefore its frequency). Only the
element arrangement within the head dimension changes.

Why IMROPE ↔ MROPE Is Different

Both IMROPE and MROPE already use the same pairing (NeoX-style consecutive
elements). The difference is which position dimension each pair index maps
to. To fix the dimension assignment, you must move entire pairs to different
pair indices — which changes the frequency. There is no within-pair rearrangement
that can fix a between-pair dimension assignment.

Conversion What differs Permutation moves Frequency preserved?
Normal ↔ NeoX Which elements form a pair Elements within pairs ✓ Yes (pair index unchanged)
IMROPE ↔ MROPE Which dimension per pair Entire pairs to new indices ✗ No (pair index changes)

Recommendation for llama.cpp

For text-only Qwen3.5 models: use LLAMA_ROPE_TYPE_MROPE instead of
LLAMA_ROPE_TYPE_IMROPE. The results are identical and the kernel is faster.

If multimodal Qwen3.5 support is ever added (where position dimensions
differ), the IMROPE kernel would be required, or a runtime
permute→MROPE→unpermute approach could be explored (trading permutation cost
vs modulo cost in the kernel).

@JJJYmmm
Copy link
Copy Markdown
Contributor

JJJYmmm commented Feb 9, 2026

agree, the rotation of channel i is r_i = pos[i] * freq[i]. If we want to change from thwthw... to ttt..hhh..www.., only permuting the proj weights just change the positions of each channel. But the freq is mismatched. For example, move the second t at pos 3 to pos 1. The mrope kenel assign it with freq[1], but the right one is freq[3]. cc @ngxson

@JJJYmmm
Copy link
Copy Markdown
Contributor

JJJYmmm commented Feb 9, 2026

if modulo op is slower, how about the counter? 🧐

int mod3 = 0;
for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
    int sector = (i0 / 2) % sect_dims;

    if (sector == 0) mod3 = 0; 

    if (is_imrope) {
        if (mod3 == 1 && sector < 3 * sections[1]) {
            theta = theta_h;
        } else if (mod3 == 2 && sector < 3 * sections[2]) {
            theta = theta_w;
        } else if (mod3 == 0 && sector < 3 * sections[0]) {
            theta = theta_t;
        } else {
            theta = theta_e;
        }
    }

    // ...

    if (++mod3 == 3) mod3 = 0;
}

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Feb 9, 2026

@JJJYmmm hmm yeah right, I'll experiment to see what's exactly was the problem. for now, at least test-backend-ops reports that the imrope is significantly slower than mrope. I'll move this to an issue for further discussions

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Feb 9, 2026

Alright, sorry I realized that the test-backend-ops case is incorrect. The perf should be the same between imrope <> mrope, so no permutations is needed @JJJYmmm

Ref: #19464

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Feb 9, 2026

Obsoleted by #19468

@pwilkin pwilkin closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants