UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) by DajanaV · Pull Request #4 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-28T15:05:50Z

New Attention Mechanism: SparseK Attention (CPU Backend)

This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.

Overview

SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:

Top-K filtering – keeps only the strongest attention weights.
Local windowing – limits attention to a configurable local context.
Global stride – adds periodic global connections between tokens.

Implementation Details

Added new operator: GGML_OP_SPARSEK_ATTN defined in ggml.h and ggml.c.
Implemented construction function ggml_sparsek_attn() that creates a computation node with parameters (k_top, win_local, stride_global).
Added full CPU backend implementation in:
- ggml-cpu/ops.h
- ggml-cpu/ops.cpp
- ggml-cpu.c

The CPU version includes:

Scaled dot-product computation QKᵀ / √d
Dynamic Top-K filtering
Softmax normalization
Multiplication with V

Next Steps

Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:

Measure and compare performance between CPU and GPU implementations.
Optimize kernel execution for sparse attention patterns.
Validate correctness and scaling on Intel GPUs.

We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.

Co-Authors

Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])

…or definition and tensor creation, backend implementation pending to ggml.c/h Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

loci-review-dev · 2025-10-28T16:19:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: SparseK Attention Implementation (PR #4)

Key Findings

Performance Degradations Identified

Response Time: std::pow template function shows 0.066% degradation (108.11 ns vs 108.04 ns)
Throughput: std::regex _M_match_multiline char variant shows 0.110% degradation (39.49 ns vs 39.44 ns)
Bottleneck: std::regex _M_match_multiline wchar variant shows 0.173% degradation (25.05 ns vs 25.01 ns)
Power Consumption: Negligible increase of 0.0001% in libllama.so (0.42 nJ increase)

Core Function Impact Assessment

The performance degradations do not affect core llama.cpp functions. All degraded functions are C++ standard library components:

Template instantiation overhead in mathematical operations
Regex processing in standard library utilities
No impact on critical inference functions (model loading, tokenization, attention mechanisms, sampling)

Root Cause Analysis

Environmental Degradation: All affected functions remain byte-for-byte identical between versions, confirming performance changes stem from:

Memory layout modifications due to new SparseK attention code addition
Instruction cache pressure from increased binary size (+209 lines)
Altered branch prediction patterns in surrounding code

Flame Graph & CFG Analysis Insights

Template Overhead Dominance: 92.6% of std::pow execution time spent in template wrapper (100 ns) vs actual computation (8 ns)
Inefficient Memory Operations: Redundant stack store/load operations in argument processing
Identical Control Flow: No structural changes in degraded functions, confirming environmental impact

Critical Code Review Issues

High-Priority Algorithmic Bug:

// BROKEN: Incorrect top-k implementation
if (row[j] < row[k_top]) row[j] = -INFINITY;  // Uses k_top as index, not threshold

Missing Core Features:

Local windowing logic not implemented (parameters ignored)
Global stride mechanism not implemented
No SIMD optimizations for O(T²D) complexity operations

Actionable Steps

Immediate Critical Fixes (Priority 1)

Fix Top-K Algorithm:
- Implement proper k-th element selection using std::nth_element or sorting
- Add bounds validation: k_top ≤ sequence_length
- Add comprehensive unit tests for edge cases
Implement Missing Features:
- Add local windowing logic using win_local parameter
- Implement global stride connections using stride_global parameter
- Validate algorithm correctness against reference implementation
Add Safety Measures:
- Implement tensor dimension bounds checking
- Add parameter validation in ggml_sparsek_attn()
- Prevent buffer overflows in nested loops

Performance Optimization (Priority 2)

Optimize Template Overhead:
- Consider template specializations for common float-integer power operations
- Eliminate redundant stack operations in std::pow wrapper
- Evaluate constexpr evaluation for compile-time constants
SparseK Attention Optimization:
- Implement SIMD vectorization for dot product computations
- Use cache-friendly memory access patterns
- Add OpenMP parallelization for batch processing

Code Quality Improvements (Priority 3)

Documentation & Testing:
- Add algorithm complexity analysis and usage documentation
- Expand test coverage for various tensor dimensions and parameter combinations
- Implement performance benchmarks against standard attention
Build Optimization:
- Monitor instruction cache impact of binary size growth
- Consider function placement optimization to minimize cache pressure

Overall Assessment

Change Impact Evaluation

Functionality: Successfully adds new SparseK attention operator to GGML framework
Integration Quality: Clean integration following established GGML patterns
Performance Impact: Minimal environmental degradation (< 0.2%) with no core function impact
Correctness Risk: High due to broken top-k implementation requiring immediate fix

Maintainability Considerations

Positive: Follows GGML architectural patterns for operator extension
Positive: Comprehensive test infrastructure provides good foundation
Concern: Complex algorithm requires better documentation and validation
Concern: Missing core features may lead to confusion about operator capabilities

Future Performance Outlook

Short-term: Environmental performance impact should stabilize with future builds
Medium-term: Proper SIMD optimization will be critical for production performance
Long-term: GPU backend implementation will determine practical utility

Recommendation: The PR introduces valuable functionality but requires immediate algorithmic fixes before merge. The environmental performance impact on existing functions is acceptable and expected to resolve naturally. Focus should be on correctness and feature completeness rather than the minimal standard library performance variations.

* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]>

* webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Add native resolution support

* Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

Fix the seg fault without NCCL

yael-works added 3 commits October 28, 2025 11:25

Add skeleton for GGML_OP_SPARSEK_ATTN (SparseK Attention): new operat…

66248d2

…or definition and tensor creation, backend implementation pending to ggml.c/h Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

Add CPU support for SparseK Attention (without performance checks)

5d6d3b7

Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

Merge branch 'master' into feature/sparsek-attn-sycl

46325c7

DajanaV force-pushed the main branch from 3da8584 to de1e0d5 Compare October 28, 2025 16:09

DajanaV force-pushed the main branch 3 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16817-branch_yael-works-feature/sparsek-attn-sycl branch October 30, 2025 15:26

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev pushed a commit that referenced this pull request Nov 30, 2025

Merge pull request #4 from bluebread/sf/deepseek-ocr

7941f5d

Add native resolution support

loci-dev pushed a commit that referenced this pull request Dec 23, 2025

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

f96d154

loci-dev pushed a commit that referenced this pull request Dec 23, 2025

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

7e97d30

loci-dev pushed a commit that referenced this pull request Jan 9, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

12a1b52

loci-dev pushed a commit that referenced this pull request Jan 27, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

0d9caad

loci-dev pushed a commit that referenced this pull request Feb 15, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

d09f5df

loci-dev pushed a commit that referenced this pull request Feb 17, 2026

Merge pull request #4 from gaugarg-nv/minor_fixes

b12a563

Fix the seg fault without NCCL

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next)#4

UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next)#4
DajanaV wants to merge 3 commits intomainfrom
upstream-PR16817-branch_yael-works-feature/sparsek-attn-sycl

DajanaV commented Oct 28, 2025

Uh oh!

loci-review-dev bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Oct 28, 2025

New Attention Mechanism: SparseK Attention (CPU Backend)

Overview

Implementation Details

Next Steps

Co-Authors

Uh oh!

loci-review-dev bot commented Oct 28, 2025

Performance Analysis Summary: SparseK Attention Implementation (PR #4)

Key Findings

Performance Degradations Identified

Core Function Impact Assessment

Root Cause Analysis

Flame Graph & CFG Analysis Insights

Critical Code Review Issues

Actionable Steps

Immediate Critical Fixes (Priority 1)

Performance Optimization (Priority 2)

Code Quality Improvements (Priority 3)

Overall Assessment

Change Impact Evaluation

Maintainability Considerations

Future Performance Outlook

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants