UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices by DajanaV · Pull Request #126 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-07T22:36:18Z

See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which dst columns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:

Model	GPU	PPL master	PPL PR
GraniteMoe 3b q4_0	RTX 3090	10.5577	10.0799
GraniteMoe 3b q4_0	RTX 4090	10.0879	10.0877
GraniteMoe 3b q4_0	RTX 5090	15.4910	10.0853
Qwen 3 30b q4_0	RTX 3090	9.3418	9.2930
Qwen 3 30b q4_0	RTX 4090	9.2928	9.2928
Qwen 3 30b q4_0	RTX 5090	9.4052	9.2930

The bug in question has been on master since ggml-org/llama.cpp#13199 though the impact became larger with ggml-org/llama.cpp#15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.

loci-review · 2025-11-07T23:09:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
All other binaries show stable power consumption with negligible variations (<0.001%)
Overall system power efficiency maintained

Flame Graph and CFG Analysis:

_RegexMask constructor shows simple, atomic execution with single stack frame
No nested calls or complex branching detected
CFG comparison reveals identical assembly code between versions
The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

loci-review · 2025-11-07T23:09:04Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
All other binaries show stable power consumption with negligible variations (<0.001%)
Overall system power efficiency maintained

Flame Graph and CFG Analysis:

_RegexMask constructor shows simple, atomic execution with single stack frame
No nested calls or complex branching detected
CFG comparison reveals identical assembly code between versions
The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

loci-review · 2025-11-07T23:09:04Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
All other binaries show stable power consumption with negligible variations (<0.001%)
Overall system power efficiency maintained

Flame Graph and CFG Analysis:

_RegexMask constructor shows simple, atomic execution with single stack frame
No nested calls or complex branching detected
CFG comparison reveals identical assembly code between versions
The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

CUDA: fix MMQ stream-k fixup ne1 indices

d148125

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 22:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 475da08 to 5fc2eb6 Compare November 7, 2025 23:08

DajanaV force-pushed the main branch 24 times, most recently from a29809a to 973f45e Compare November 10, 2025 21:08

DajanaV force-pushed the main branch 30 times, most recently from 701e6c7 to 6196a56 Compare November 16, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126

UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17089-branch_JohannesGaessler-cuda-fix-mmq-moe-6

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary

Key Findings

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary

Key Findings

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants