UPSTREAM PR #18033: vulkan: use 4 rows for scalar FA large tile size#567
UPSTREAM PR #18033: vulkan: use 4 rows for scalar FA large tile size#567
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #567OverviewThis PR modifies the Vulkan Flash Attention scalar kernel tile size configuration in Key FindingsPerformance-Critical Area Impact: The modification affects the Attention Mechanism computation path within the Vulkan backend. The Function-Level Changes:
Tokens Per Second Impact: No direct impact on tokens per second metrics. The modified function operates within the Vulkan GPU acceleration layer for attention computation but does not alter the response time or throughput of core inference functions (llama_decode, llama_encode, llama_tokenize). Upstream testing indicates 2-5% improvement in attention kernel performance, which may translate to marginal inference speedup for Vulkan-accelerated workloads with attention-heavy models. Power Consumption: Analysis applies to binaries utilizing Vulkan backend. Reduced tile size decreases working set size per shader invocation, potentially lowering GPU register pressure and memory bandwidth consumption. The simplified branching logic eliminates one bitwise operation per tile size determination. Code Change Nature: The implementation represents a targeted optimization based on empirical testing (upstream PR ggml-org/llama.cpp#18033). The change maintains functional correctness while improving GPU resource utilization for common head dimension configurations (64, 80, 96, 128) prevalent in modern transformer architectures. |
45e0e28 to
e9472cd
Compare
874549c to
cdb09db
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #567OverviewThis PR modifies the Vulkan Flash Attention scalar kernel tile sizing in Key FindingsPerformance-Critical Area ImpactAttention Computation Path: GPU Kernel Execution: Tokens Per Second ImpactInference Functions: Affected Functions:
Power Consumption AnalysisBinary-Level Impact: |
320a1fc to
1fc5e38
Compare
e8bf2a6 to
9c8623e
Compare
Mirrored from ggml-org/llama.cpp#18033
See ggml-org/llama.cpp#17715 (comment). I also tested locally on 5090 in scalar mode and it was a few percent faster on several models.