UPSTREAM PR #17543: CANN: add support for partial RoPE and Vision mode by loci-dev · Pull Request #344 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-27T09:37:19Z

Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation.

Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated

Implementation details:

Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0
Support both F32 and F16 data types with intermediate F32 conversion
Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility
Improve cache invalidation logic to include rope_dims and indep_sects parameters

These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation.

Make sure to read the contributing guidelines before submitting a PR

loci-review · 2025-11-27T10:13:57Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #344

Analysis: This PR implements partial RoPE and vision mode support for the CANN backend across 3 files with 222 additions and 70 deletions. The changes modify the ggml_cann_rope function and related cache initialization logic in aclnn_ops.cpp, extend the ggml_cann_rope_cache structure in common.h, and update backend support logic in ggml-cann.cpp.

Performance Impact: No measurable performance changes detected. Power consumption analysis shows less than 0.001% variation across all binaries, with maximum absolute delta of 0.66 nJ in libllama.so. No functions show measurable changes in response time or throughput time between versions.

Inference Impact: No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes. The modifications are isolated to CANN backend RoPE operations, which do not affect CPU-based tokenization or inference paths.

Code Changes: The implementation adds conditional logic for partial rotation (when rope_dims < ne0) by splitting tensors into head and tail portions. For F32 tensors, the head undergoes rotation via RotaryPositionEmbedding while the tail is copied directly. F16 tensors follow the same pattern with intermediate F32 conversion. Vision mode sets rope_dims = ne0 for full tensor rotation. The changes enable support for vision-language models without affecting existing full-rotation models, which bypass the new code path when has_tail == false.

loci-review · 2025-11-29T04:23:12Z

Explore the complete analysis inside the Version Insights

Performance Review Summary: PR #344 - CANN Backend Partial RoPE Support

Overview

PR #344 implements partial Rotary Position Embedding and Vision mode support in the CANN backend (ggml-cann library). The changes modify aclnn_ops.cpp (153 additions, 61 deletions) and ggml-cann.cpp (6 additions, 8 deletions) to enable head/tail tensor splitting for models where rope_dims < ne0.

Key Findings

Performance-Critical Function Impact

The modified ggml_cann_rope() function in aclnn_ops.cpp introduces conditional execution paths:

Full RoPE path (rope_dims == ne00): Execution remains unchanged with no performance delta
Partial RoPE path (rope_dims < ne00): Adds head buffer allocation, head-only rotation, head copy-back operation, and tail copy operation

For partial RoPE cases with typical attention dimensions (rope_dims=64, ne00=128, ne01=32, ne02=2048), the additional operations introduce approximately 160000 ns overhead per call from memory copy operations alone.

The aclnn_rope_cache_init() function signature change adds rope_dims parameter, enabling correct cache sizing for partial rotation. Cache invalidation logic now includes theta_scale_updated flag, improving cache correctness.

Inference Impact

Token Generation Rate: The changes affect only the CANN backend RoPE implementation within the GGML computation graph layer. The core inference functions llama_decode(), llama_encode(), and llama_tokenize() in the llama.cpp API layer are not modified. Token generation rate impact depends on:

Model architecture: Only models using partial RoPE on CANN backend are affected
Backend selection: CPU and other GPU backends remain unchanged
RoPE frequency: Impact scales with number of RoPE operations per token

For models using full RoPE or running on non-CANN backends, tokens per second remains unchanged.

Power Consumption

Power consumption analysis applies to binaries containing the modified CANN backend code. The additional copy operations in partial RoPE path increase cumulative execution time, resulting in higher power draw proportional to the throughput time increase. Binaries using full RoPE or non-CANN backends show no power consumption change.

loci-review · 2025-12-03T02:58:13Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #344

Overview

This PR implements partial RoPE and Vision mode support in the CANN backend for Huawei Ascend NPUs. The changes modify RoPE operation handling in ggml-cann to support models where rotation applies to fewer dimensions than the tensor width (rope_dims < ne0). Analysis shows zero measurable performance impact on existing workloads, with all binaries maintaining identical execution profiles between versions.

Performance Metrics

Function-Level Changes: No functions show measurable changes in response time or throughput between base version 05f5e78f-eaba-4ef1-9f95-b27c0d1c042f and target version 42d74511-d7d3-4303-97b2-a631bb96353a.

Power Consumption: All 16 analyzed binaries show effectively zero change. The largest variation is 1 nanojoule in llama-tts (0.001% change), which falls within measurement noise. Core inference binaries libllama.so (194,075 nJ) and libggml.so (4,031 nJ) are unchanged.

Tokens Per Second Impact: No impact on inference throughput. The functions responsible for tokenization and inference (llama_decode, llama_encode, llama_tokenize) are not modified by this PR. Changes are isolated to the CANN backend RoPE implementation (ggml_cann_rope, aclnn_rope_cache_init), which only affects models running on Huawei Ascend hardware with CANN backend enabled.

Code Changes

The PR refactors ggml_cann_rope in aclnn_ops.cpp to handle partial rotation by splitting tensors into head (rotated) and tail (unrotated) portions. Key modifications include removing the n_dims == ne0 assertion, adding rope_dims parameter to cache initialization, and implementing tail copying logic. The implementation adds conditional execution paths: when rope_dims < ne00, the function allocates a contiguous buffer for the head portion, applies rotation, copies the result back, and copies the unrotated tail directly from source to destination.

For Vision mode, rope_dims is set equal to ne0, eliminating tail handling and maintaining identical execution to full rotation. The refactoring consolidates F16 and F32 type handling into a unified flow, reducing code duplication while adding functionality.

Conclusion

The PR successfully extends CANN backend model compatibility without affecting performance of existing workloads. The zero-impact metrics confirm that the refactoring maintains performance parity for full rotation cases while enabling support for partial RoPE configurations used in modern vision-language models.

Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation. 1. Support for partial RoPE (rope_dims < ne0): - Split tensor into head (first rope_dims dimensions) and tail portions - Apply rotation only to head portion using RotaryPositionEmbedding operator - Copy unrotated tail portion directly from source to destination - Handle both contiguous and non-contiguous tensor layouts 2. Support for Vision mode (GGML_ROPE_TYPE_VISION): - Set rope_dims = ne0 for Vision mode to rotate entire tensor - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2) - No tail handling needed since entire tensor is rotated Implementation details: - Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion - Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects parameters These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation.

loci-review · 2025-12-08T13:36:03Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #344

Overview

PR #344 adds partial RoPE and vision mode support to the CANN backend for Huawei Ascend NPUs. The implementation modifies RoPE operation handling in ggml-cann to support rope_dims < ne0 (partial rotation) and GGML_ROPE_TYPE_VISION mode. Analysis shows zero performance impact across all measured functions and binaries.

Performance Metrics

Core Inference Functions:

llama_decode: 732214 ns (0 ns change)
llama_tokenize: 394616 ns (0 ns change)
ggml_graph_compute: 29345 ns (0 ns change)
llama_batch_init: 235 ns (0 ns change)

Power Consumption:
All 16 binaries show changes below 0.001 percent, with absolute deltas under 1 nanojoule. The libllama.so binary changed by 0.11 nJ (194329.15 nJ to 194329.26 nJ). The llama-tts binary decreased by 0.35 nJ.

Tokens Per Second Impact:
No impact on inference throughput. The llama_decode function shows 0 ns change in both response time and throughput. Since the reference model experiences 7 percent tokens per second reduction with 2 ms llama_decode degradation, and this PR shows 0 ns change, inference performance remains unchanged.

Code Changes Analysis

The PR refactors ggml_cann_rope() from a simple switch statement to a 5-step pipeline: type conversion preparation, head tensor preparation, rotation execution, tail copying, and type conversion back. The implementation splits tensors into head (rotated) and tail (unrotated) portions when rope_dims < ne0.

For full tensor rotation (existing behavior), execution follows the original path with has_tail = false, avoiding additional allocations or copy operations. This explains the zero performance delta in measurements.

The cache initialization function now accepts rope_dims explicitly, sizing cache memory based on actual rotation dimensions rather than full tensor width. Cache invalidation logic includes theta_scale_updated parameter for correctness.

Backend support check removes partial RoPE rejection for non-310P devices while maintaining restrictions on 310P hardware.

loci-dev temporarily deployed to PROD__AL_DEMO November 27, 2025 09:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 9a74048 to af6127b Compare November 28, 2025 20:09

loci-dev temporarily deployed to PROD__AL_DEMO November 29, 2025 03:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 18 times, most recently from 333626d to 82b1c0b Compare December 1, 2025 19:10

loci-dev force-pushed the main branch from 81fd62f to 5ddab7d Compare December 3, 2025 00:35

loci-dev force-pushed the upstream-PR17543-branch_noemotiovon-rope_dim branch from e0d679c to 70c9ebc Compare December 3, 2025 02:14

loci-dev temporarily deployed to PROD__AL_DEMO December 3, 2025 02:14 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 23 times, most recently from ca9e0d2 to 3ba49e2 Compare December 5, 2025 01:37

noemotiovon added 2 commits December 8, 2025 12:30

cann: fix review comment

d903546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17543: CANN: add support for partial RoPE and Vision mode#344

UPSTREAM PR #17543: CANN: add support for partial RoPE and Vision mode#344
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17543-branch_noemotiovon-rope_dim

loci-dev commented Nov 27, 2025

Uh oh!

loci-review bot commented Nov 27, 2025

Uh oh!

loci-review bot commented Nov 29, 2025

Uh oh!

loci-review bot commented Dec 3, 2025

Uh oh!

loci-review bot commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 27, 2025

Uh oh!

loci-review bot commented Nov 27, 2025

Performance Analysis Summary - PR #344

Uh oh!

loci-review bot commented Nov 29, 2025

Performance Review Summary: PR #344 - CANN Backend Partial RoPE Support

Overview

Key Findings

Performance-Critical Function Impact

Inference Impact

Power Consumption

Uh oh!

loci-review bot commented Dec 3, 2025

Performance Analysis Summary - PR #344

Overview

Performance Metrics

Code Changes

Conclusion

Uh oh!

loci-review bot commented Dec 8, 2025

Performance Analysis Summary - PR #344

Overview

Performance Metrics

Code Changes Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants