UPSTREAM PR #17543: CANN: add support for partial RoPE and Vision mode#344
UPSTREAM PR #17543: CANN: add support for partial RoPE and Vision mode#344
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #344Analysis: This PR implements partial RoPE and vision mode support for the CANN backend across 3 files with 222 additions and 70 deletions. The changes modify the Performance Impact: No measurable performance changes detected. Power consumption analysis shows less than 0.001% variation across all binaries, with maximum absolute delta of 0.66 nJ in libllama.so. No functions show measurable changes in response time or throughput time between versions. Inference Impact: No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes. The modifications are isolated to CANN backend RoPE operations, which do not affect CPU-based tokenization or inference paths. Code Changes: The implementation adds conditional logic for partial rotation (when |
9a74048 to
af6127b
Compare
|
Explore the complete analysis inside the Version Insights Performance Review Summary: PR #344 - CANN Backend Partial RoPE SupportOverviewPR #344 implements partial Rotary Position Embedding and Vision mode support in the CANN backend ( Key FindingsPerformance-Critical Function ImpactThe modified
For partial RoPE cases with typical attention dimensions ( The Inference ImpactToken Generation Rate: The changes affect only the CANN backend RoPE implementation within the GGML computation graph layer. The core inference functions
For models using full RoPE or running on non-CANN backends, tokens per second remains unchanged. Power ConsumptionPower consumption analysis applies to binaries containing the modified CANN backend code. The additional copy operations in partial RoPE path increase cumulative execution time, resulting in higher power draw proportional to the throughput time increase. Binaries using full RoPE or non-CANN backends show no power consumption change. |
333626d to
82b1c0b
Compare
e0d679c to
70c9ebc
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #344OverviewThis PR implements partial RoPE and Vision mode support in the CANN backend for Huawei Ascend NPUs. The changes modify RoPE operation handling in Performance MetricsFunction-Level Changes: No functions show measurable changes in response time or throughput between base version Power Consumption: All 16 analyzed binaries show effectively zero change. The largest variation is 1 nanojoule in Tokens Per Second Impact: No impact on inference throughput. The functions responsible for tokenization and inference ( Code ChangesThe PR refactors For Vision mode, ConclusionThe PR successfully extends CANN backend model compatibility without affecting performance of existing workloads. The zero-impact metrics confirm that the refactoring maintains performance parity for full rotation cases while enabling support for partial RoPE configurations used in modern vision-language models. |
ca9e0d2 to
3ba49e2
Compare
Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.
1. Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated
Implementation details:
- Use has_tail flag to determine execution path: head/tail splitting when
rope_dims < ne0, or full tensor rotation when rope_dims == ne0
- Support both F32 and F16 data types with intermediate F32 conversion
- Copy non-contiguous tensors to contiguous buffers before calling
RotaryPositionEmbedding operator for compatibility
- Improve cache invalidation logic to include rope_dims and indep_sects
parameters
These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #344OverviewPR #344 adds partial RoPE and vision mode support to the CANN backend for Huawei Ascend NPUs. The implementation modifies RoPE operation handling in Performance MetricsCore Inference Functions:
Power Consumption: Tokens Per Second Impact: Code Changes AnalysisThe PR refactors For full tensor rotation (existing behavior), execution follows the original path with has_tail = false, avoiding additional allocations or copy operations. This explains the zero performance delta in measurements. The cache initialization function now accepts rope_dims explicitly, sizing cache memory based on actual rotation dimensions rather than full tensor width. Cache invalidation logic includes theta_scale_updated parameter for correctness. Backend support check removes partial RoPE rejection for non-310P devices while maintaining restrictions on 310P hardware. |
Mirrored from ggml-org/llama.cpp#17543
Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation.
Support for partial RoPE (rope_dims < ne0):
Support for Vision mode (GGML_ROPE_TYPE_VISION):
Implementation details:
These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation.
Make sure to read the contributing guidelines before submitting a PR