Skip to content

UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126

Open
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17089-branch_JohannesGaessler-cuda-fix-mmq-moe-6
Open

UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17089-branch_JohannesGaessler-cuda-fix-mmq-moe-6

Conversation

@DajanaV
Copy link
Copy Markdown
Collaborator

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17089

See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which dst columns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:

Model GPU PPL master PPL PR
GraniteMoe 3b q4_0 RTX 3090 10.5577 10.0799
GraniteMoe 3b q4_0 RTX 4090 10.0879 10.0877
GraniteMoe 3b q4_0 RTX 5090 15.4910 10.0853
Qwen 3 30b q4_0 RTX 3090 9.3418 9.2930
Qwen 3 30b q4_0 RTX 4090 9.2928 9.2928
Qwen 3 30b q4_0 RTX 5090 9.4052 9.2930

The bug in question has been on master since ggml-org/llama.cpp#13199 though the impact became larger with ggml-org/llama.cpp#15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 7, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

  • Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
  • Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
  • No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
  • Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
  • All other binaries show stable power consumption with negligible variations (<0.001%)
  • Overall system power efficiency maintained

Flame Graph and CFG Analysis:

  • _RegexMask constructor shows simple, atomic execution with single stack frame
  • No nested calls or complex branching detected
  • CFG comparison reveals identical assembly code between versions
  • The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

2 similar comments
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 7, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

  • Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
  • Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
  • No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
  • Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
  • All other binaries show stable power consumption with negligible variations (<0.001%)
  • Overall system power efficiency maintained

Flame Graph and CFG Analysis:

  • _RegexMask constructor shows simple, atomic execution with single stack frame
  • No nested calls or complex branching detected
  • CFG comparison reveals identical assembly code between versions
  • The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 7, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp version 2a1990b0-e940-416f-8fdf-d96ef42a047a compared to base version 96963f27-2e61-4c7d-9307-c88635a4e8a6, the changes show minimal performance impact with one targeted bug fix.

Key Findings

Performance Metrics:

  • Highest Response Time change: _RegexMask constructor (+0.082%, 22.525 ns vs 22.507 ns)
  • Highest Throughput change: make_unique template for graph input position bucket (+0.117%, 104.450 ns vs 104.328 ns)
  • No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified or showed performance changes
  • Tokens per second impact: None - inference performance remains unchanged as no tokenization/inference functions were affected

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete removal (314.26 μJ to 0 μJ, -100% change)
  • All other binaries show stable power consumption with negligible variations (<0.001%)
  • Overall system power efficiency maintained

Flame Graph and CFG Analysis:

  • _RegexMask constructor shows simple, atomic execution with single stack frame
  • No nested calls or complex branching detected
  • CFG comparison reveals identical assembly code between versions
  • The 0.01 ns timing difference represents measurement noise rather than functional changes

GitHub Code Review - Critical Bug Fix:
The primary change is a targeted fix in ggml/src/ggml-cuda/mmq.cuh line 3497:

// Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After:  ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];

This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090.

Conclusion:
The changes represent a high-quality maintenance update with a critical correctness fix for CUDA operations while maintaining stable performance across all core inference components.

@DajanaV DajanaV force-pushed the main branch 24 times, most recently from a29809a to 973f45e Compare November 10, 2025 21:08
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 701e6c7 to 6196a56 Compare November 16, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants