UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126
UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices#126
Conversation
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryBased on the comprehensive analysis of llama.cpp version Key FindingsPerformance Metrics:
Power Consumption Analysis:
Flame Graph and CFG Analysis:
GitHub Code Review - Critical Bug Fix: // Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After: ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090. Conclusion: |
2 similar comments
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryBased on the comprehensive analysis of llama.cpp version Key FindingsPerformance Metrics:
Power Consumption Analysis:
Flame Graph and CFG Analysis:
GitHub Code Review - Critical Bug Fix: // Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After: ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090. Conclusion: |
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryBased on the comprehensive analysis of llama.cpp version Key FindingsPerformance Metrics:
Power Consumption Analysis:
Flame Graph and CFG Analysis:
GitHub Code Review - Critical Bug Fix: // Before: ids_dst_shared[j] = ids_dst[col_low + j];
// After: ids_dst_shared[j] = ids_dst[col_low + jt*mmq_x + j];This corrects incorrect indexing in CUDA MMQ stream-k fixup operations for MoE models, resolving perplexity degradation issues on specific GPU configurations (RTX 4090, RTX 5090). The fix shows significant model quality improvements: GraniteMoe 3b perplexity improved from 15.49 to 10.09 on RTX 5090. Conclusion: |
a29809a to
973f45e
Compare
701e6c7 to
6196a56
Compare
Mirrored from ggml-org/llama.cpp#17089
See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which
dstcolumns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:The bug in question has been on master since ggml-org/llama.cpp#13199 though the impact became larger with ggml-org/llama.cpp#15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.