UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. by loci-dev · Pull Request #355 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-28T23:33:59Z

Extracted and adapted kernels by @gabe-l-hart from ggml-org/llama.cpp#16623

loci-review · 2025-11-29T00:12:30Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #355 - CUDA CUMSUM and TRI Operations

Overview

PR #355 introduces CUDA kernel implementations for cumulative sum (CUMSUM) and triangular matrix (TRI) operations, adding 315 lines across 7 files. Analysis shows no measurable performance impact on existing inference paths, with 0% power consumption change across all 16 binaries.

Key Findings

Performance Impact on Inference:
No functions in the core inference path (llama_decode, llama_encode, llama_tokenize) show response time or throughput changes. The additions are purely additive CUDA operations that remain inactive in standard CPU-based inference workloads. Tokens per second remains unaffected as tokenization and decode functions maintain identical execution characteristics.

Implementation Details:
The PR adds two new CUDA operations with supporting warp-level primitives. The TRI operation implements element-wise triangular matrix extraction with coalesced memory access patterns. The CUMSUM operation uses a two-phase algorithm: warp-level prefix sums followed by cross-warp accumulation via shared memory. Both operations support F32, F16, and BF16 data types.

Code Integration:
Changes integrate cleanly into the existing CUDA backend dispatch system in ggml-cuda.cu. Three new warp_prefix_inclusive_sum template functions were added to common.cuh as reusable primitives. The implementation includes test coverage for both operations with various tensor dimensions.

Power Consumption:
All binaries show 0 nJ change in estimated power consumption. The most power-intensive components remain llama-tts (224,623 nJ), llama-cvector-generator (220,236 nJ), and libllama.so (193,066 nJ). The new operations do not alter the computational workload of existing execution paths.

Technical Note:
The half2 specialization in common.cuh contains a compilation error (missing type declaration on line 40) that will prevent builds when FP16 is enabled. This affects CUDA devices with half-precision support but does not impact current CPU inference paths or performance metrics.

loci-review · 2025-11-29T01:26:23Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #355

Overview

PR #355 adds CUDA backend support for CUMSUM and TRI operations through new kernel implementations. The changes introduce 315 lines of new code across 7 files without modifying existing functionality.

Performance Impact

No performance changes detected. All 16 analyzed binaries show 0.0% change in power consumption. No functions exhibit measurable Response Time or Throughput Time variations between versions.

Inference Performance: The core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Token throughput is unaffected as these new operations are not yet exercised in current workloads.

Power Consumption: All binaries maintain identical energy profiles:

libllama.so: 193,066 nJ (0% change)
llama-tts: 224,623 nJ (0% change)
llama-cvector-generator: 220,237 nJ (0% change)
llama-run: 191,888 nJ (0% change)
Remaining 12 binaries: 0% change

Code Implementation

The PR implements two new CUDA operations:

CUMSUM: Computes cumulative sum using two-phase algorithm with warp-level prefix sums followed by inter-warp accumulation. Supports F32, F16, and BF16 types.

TRI: Applies triangular matrix masking for attention mechanisms. Uses simple per-element comparison with configurable mask types (lower/upper with optional diagonal).

Integration: Both operations are registered in the CUDA backend dispatch system and declared as supported operations. The implementation follows existing GGML CUDA patterns with proper type handling and bounds checking.

Compilation Issue: The half2 warp_prefix_inclusive_sum specialization contains a type declaration error that will prevent compilation when FP16 is enabled.

Conclusion

This PR extends CUDA backend capabilities without impacting existing performance. The new operations enable future model architectures requiring cumulative and triangular masking operations to run fully on CUDA without CPU fallback.

pwilkin added 2 commits November 29, 2025 00:15

Add support for CUMSUM and TRI for CUDA.

d138a03

Minor optimizations.

67207d2

loci-dev temporarily deployed to PROD__AL_DEMO November 28, 2025 23:34 — with GitHub Actions Inactive

Correct warp_prefix_inclusive_sum in float2 variant to return float2

fab0029

loci-dev temporarily deployed to PROD__AL_DEMO November 29, 2025 00:48 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 23 times, most recently from d516828 to 0a006e7 Compare December 2, 2025 04:15

loci-dev force-pushed the main branch 30 times, most recently from c217e38 to a73de67 Compare December 6, 2025 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA.#355

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA.#355
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17584-branch_pwilkin-tri_cumsum_cuda

loci-dev commented Nov 28, 2025

Uh oh!

loci-review bot commented Nov 29, 2025

Uh oh!

loci-review bot commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 28, 2025

Uh oh!

loci-review bot commented Nov 29, 2025

Performance Analysis Summary: PR #355 - CUDA CUMSUM and TRI Operations

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 29, 2025

Performance Analysis Summary: PR #355

Overview

Performance Impact

Code Implementation

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants