Skip to content

vulkan: use 4 rows for scalar FA large tile size#18033

Closed
jeffbolznv wants to merge 1 commit intoggml-org:masterfrom
jeffbolznv:fa_scalar_num_large_rows_4
Closed

vulkan: use 4 rows for scalar FA large tile size#18033
jeffbolznv wants to merge 1 commit intoggml-org:masterfrom
jeffbolznv:fa_scalar_num_large_rows_4

Conversation

@jeffbolznv
Copy link
Contributor

See #17715 (comment). I also tested locally on 5090 in scalar mode and it was a few percent faster on several models.

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner December 14, 2025 17:16
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025
@jeffbolznv jeffbolznv force-pushed the fa_scalar_num_large_rows_4 branch from 874549c to cdb09db Compare December 15, 2025 03:17
@0cc4m
Copy link
Contributor

0cc4m commented Dec 16, 2025

Here are performance results from my hardware, sadly it's not this easy. This change causes not only a few improvements, but also some regressions:

Nvidia RTX 3090
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 4902.07 ± 54.57 4830.42 ± 48.22 -1.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 128.90 ± 0.36 127.93 ± 0.76 -0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 2854.52 ± 74.64 2816.07 ± 81.54 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 107.11 ± 0.20 107.00 ± 0.23 -0.1%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 1638.61 ± 5.89 1629.11 ± 9.06 -0.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 43.00 ± 0.10 42.80 ± 0.11 -0.5%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 @ d8192 1275.58 ± 12.72 1268.69 ± 14.84 -0.5%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 @ d8192 38.94 ± 0.05 38.76 ± 0.05 -0.5%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 4502.68 ± 16.52 4457.83 ± 113.44 -1.0%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 217.36 ± 1.54 216.61 ± 1.37 -0.3%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 3427.78 ± 112.91 3423.60 ± 139.74 -0.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 200.64 ± 1.18 201.54 ± 0.78 +0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 2925.00 ± 13.97 2914.02 ± 12.99 -0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 183.64 ± 2.37 183.01 ± 1.04 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 1899.39 ± 18.98 1883.39 ± 36.69 -0.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 142.97 ± 0.52 143.91 ± 0.48 +0.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 4098.53 ± 32.87 4063.04 ± 39.85 -0.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 183.28 ± 0.31 182.21 ± 0.54 -0.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 2956.39 ± 109.78 2937.67 ± 104.92 -0.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 158.06 ± 0.92 158.04 ± 1.15 -0.0%
AMD Radeon 8060S
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 576.33 ± 2.46 590.85 ± 7.05 +2.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 40.68 ± 0.25 41.21 ± 0.33 +1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 253.25 ± 9.99 267.05 ± 11.06 +5.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 30.57 ± 0.09 30.59 ± 0.07 +0.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 1457.75 ± 48.93 1469.42 ± 54.22 +0.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 104.81 ± 0.65 104.71 ± 0.25 -0.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 1201.81 ± 24.07 1209.22 ± 20.22 +0.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 99.16 ± 0.72 99.23 ± 1.11 +0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 730.09 ± 7.79 729.87 ± 10.90 -0.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 88.40 ± 0.73 92.00 ± 0.74 +4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 261.73 ± 6.90 244.74 ± 7.44 -6.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 58.66 ± 0.21 36.31 ± 0.35 -38.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1184.78 ± 39.27 1169.60 ± 37.53 -1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 70.52 ± 0.06 72.19 ± 0.22 +2.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 722.70 ± 11.20 714.92 ± 6.85 -1.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 60.23 ± 0.28 48.85 ± 0.62 -18.9%
AMD Radeon Pro VII
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 678.54 ± 3.42 664.57 ± 1.19 -2.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 76.31 ± 0.17 79.63 ± 0.32 +4.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 168.58 ± 0.20 114.26 ± 0.40 -32.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 50.94 ± 0.01 66.72 ± 0.20 +31.0%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 183.65 ± 0.34 180.72 ± 0.15 -1.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 24.45 ± 0.03 24.68 ± 0.13 +0.9%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 @ d8192 31.56 ± 0.21 17.66 ± 0.06 -44.0%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 @ d8192 11.56 ± 0.00 11.76 ± 0.05 +1.7%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 1354.36 ± 1.83 1340.72 ± 12.10 -1.0%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 111.42 ± 0.30 112.48 ± 0.04 +1.0%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 1082.29 ± 4.54 1102.42 ± 3.87 +1.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 105.24 ± 0.51 109.85 ± 0.18 +4.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 600.24 ± 2.38 603.68 ± 1.84 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 83.88 ± 0.19 96.92 ± 0.39 +15.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 163.04 ± 0.13 183.93 ± 0.23 +12.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 57.34 ± 0.06 56.61 ± 0.11 -1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1182.36 ± 6.58 1216.54 ± 2.23 +2.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 130.10 ± 0.29 138.99 ± 0.22 +6.8%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 568.38 ± 1.19 544.64 ± 0.87 -4.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 115.65 ± 0.06 91.17 ± 0.16 -21.2%

I wanted to test Intel, too, but I have some kind of memory retention issue with the driver currently, it gives me OOM on extended tests.

@jeffbolznv
Copy link
Contributor Author

Hmm, I thought the testing had generally been positive at #17715 (comment). But maybe it won't be so easy.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 16, 2025

Yes, but I only checked 0 depth, I should have also run tests with context. How did you run it?

@jeffbolznv
Copy link
Contributor Author

Pretty sure I just tested with 0 depth.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 16, 2025

Maybe we can pick the 8-row shader for larger context. I'll look into the difference in FA calls, maybe there's a better way to select the shader.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 21, 2025

Here's a perf logger comparison for gpt-oss 20B at 8192 context:

AMD 8060S (without coopmat)
Operator Calls Before After % Diff Total (us)
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1) 7040 89.997 260.397 +189.34% 633578.880
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1) 11 24449.200 46685.800 +90.95% 268941.200
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1) 11 22032.700 42849.600 +94.48% 242359.700
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1) 11 20533.300 40334.600 +96.44% 225866.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1) 11 18720.300 35757.400 +91.01% 205923.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1) 11 17113.300 30984.500 +81.06% 188246.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1) 11 15227.300 26941.100 +76.93% 167500.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,768,8,1), v(64,768,8,1), m(768,512,1,1) 180 897.641 660.024 -26.47% 161575.380
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1) 11 12751.800 24672.400 +93.48% 140269.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1) 11 11245.700 21784.200 +93.71% 123702.700
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1) 11 10102.200 19305.700 +91.10% 111124.200
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,512,8,1), v(64,512,8,1), m(512,1,1,1) 6144 17.847 11.390 -36.18% 109651.968
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1) 11 8143.530 16900.800 +107.54% 89578.830
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1) 11 6773.650 13735.700 +102.78% 74510.150
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1), GET_ROWS 640 90.947 268.368 +195.08% 58206.080
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1) 11 5286.410 11354.300 +114.78% 58150.510
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1) 23 2313.160 7.088 -99.69% 53202.680
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1) 11 4470.970 8706.310 +94.73% 49180.670
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1) 11 3027.780 6117.290 +102.04% 33305.580
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,768,8,1), v(64,768,8,1), m(768,1,1,1) 1536 19.645 13.026 -33.69% 30174.720
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1), GET_ROWS 1 24743.400 47532.700 +92.10% 24743.400
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1), GET_ROWS 1 22573.500 42561.400 +88.55% 22573.500
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1) 11 1988.410 3338.500 +67.90% 21872.510
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1), GET_ROWS 1 20389.300 40421.300 +98.25% 20389.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1) 23 820.480 629.867 -23.23% 18871.040
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1), GET_ROWS 1 18694.100 36077.800 +92.99% 18694.100
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1), GET_ROWS 1 17491.800 32431.700 +85.41% 17491.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1), GET_ROWS 1 14932.300 26217.000 +75.57% 14932.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1), GET_ROWS 1 13265.100 24539.000 +84.99% 13265.100
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1), GET_ROWS 1 10850.000 21958.900 +102.39% 10850.000
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1), GET_ROWS 1 10422.500 19270.200 +84.89% 10422.500
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1), GET_ROWS 1 7999.520 17083.600 +113.56% 7999.520
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1), GET_ROWS 1 6575.640 13447.000 +104.50% 6575.640
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1), GET_ROWS 1 5305.800 11254.700 +112.12% 5305.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1), GET_ROWS 1 4343.480 8578.680 +97.51% 4343.480
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1), GET_ROWS 1 3096.960 6296.720 +103.32% 3096.960
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1), GET_ROWS 1 2701.920 3423.720 +26.71% 2701.920
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1), GET_ROWS 1 887.360 729.120 -17.83% 887.360
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1), GET_ROWS 1 17.080 6.680 -60.89% 17.080
AMD Radeon Pro VII
Operator Calls Before After % Diff Total (us)
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1) 7040 99.331 331.368 +233.60% 699290.240
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1) 11 36184.800 38566.100 +6.58% 398032.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,768,8,1), v(64,768,8,1), m(768,512,1,1) 180 2147.570 1475.400 -31.30% 386562.600
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1) 11 33327.000 35919.900 +7.78% 366597.000
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1) 11 31083.700 33249.100 +6.97% 341920.700
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1) 11 28499.600 30737.000 +7.85% 313495.600
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1) 11 25826.900 28213.700 +9.24% 284095.900
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1) 11 23663.700 25713.100 +8.66% 260300.700
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,512,8,1), v(64,512,8,1), m(512,1,1,1) 6144 38.679 18.701 -51.65% 237643.776
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1) 11 21262.400 23277.500 +9.48% 233886.400
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1) 11 18828.800 20799.300 +10.47% 207116.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1) 11 15850.900 18466.000 +16.50% 174359.900
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1) 11 13790.400 15984.200 +15.91% 151694.400
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1) 11 11378.700 13581.500 +19.36% 125165.700
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1) 11 9330.970 11146.200 +19.45% 102640.670
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1) 11 7256.960 8704.470 +19.95% 79826.560
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1), GET_ROWS 640 99.852 330.433 +230.92% 63905.280
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,768,8,1), v(64,768,8,1), m(768,1,1,1) 1536 41.078 21.552 -47.53% 63095.808
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1) 11 5409.510 6232.040 +15.21% 59504.610
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1) 23 1831.410 1344.900 -26.56% 42122.430
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1) 11 3602.490 3826.920 +6.23% 39627.390
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1), GET_ROWS 1 36407.200 38794.400 +6.56% 36407.200
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1), GET_ROWS 1 33467.800 35914.400 +7.31% 33467.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1), GET_ROWS 1 31215.000 33216.500 +6.41% 31215.000
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1), GET_ROWS 1 28613.800 30707.400 +7.32% 28613.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1), GET_ROWS 1 25963.800 28149.800 +8.42% 25963.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1), GET_ROWS 1 23589.600 25710.100 +8.99% 23589.600
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1), GET_ROWS 1 21308.500 23325.600 +9.47% 21308.500
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1), GET_ROWS 1 18896.200 20814.600 +10.15% 18896.200
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1), GET_ROWS 1 15912.300 18446.700 +15.93% 15912.300
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1), GET_ROWS 1 13883.800 16001.300 +15.25% 13883.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1), GET_ROWS 1 11460.800 13592.200 +18.60% 11460.800
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1), GET_ROWS 1 9316.960 11144.000 +19.61% 9316.960
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1), GET_ROWS 1 7291.520 8712.640 +19.49% 7291.520
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1), GET_ROWS 1 5420.160 6224.320 +14.84% 5420.160
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1) 23 171.728 22.768 -86.74% 3949.744
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1), GET_ROWS 1 3605.760 3816.640 +5.85% 3605.760
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1), GET_ROWS 1 1946.880 1544.480 -20.67% 1946.880
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1), GET_ROWS 1 63.840 22.400 -64.91% 63.840
Nvidia RTX 3090
Operator Calls Before After % Diff Total (us)
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1) 23 48537.800 28816.100 -40.63% 1116369.400
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1) 7040 48.538 135.512 +179.19% 341707.520
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,512,8,1), v(64,512,8,1), m(512,1,1,1) 6144 18.465 12.315 -33.31% 113448.960
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1) 11 9962.030 11596.100 +16.40% 109582.330
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1) 11 9382.070 10828.900 +15.42% 103202.770
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,768,8,1), v(64,768,8,1), m(768,512,1,1) 180 563.706 588.714 +4.44% 101467.080
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1) 11 8689.940 10052.600 +15.68% 95589.340
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1) 11 8058.970 9286.190 +15.23% 88648.670
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1) 11 7426.700 8505.160 +14.52% 81693.700
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1) 11 6806.900 7740.690 +13.72% 74875.900
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1) 11 6181.240 6985.630 +13.01% 67993.640
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1) 11 5515.170 6220.520 +12.79% 60666.870
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1) 11 4809.540 5403.650 +12.35% 52904.940
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1) 11 4193.090 4679.680 +11.60% 46123.990
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1) 11 3589.120 4064.260 +13.24% 39480.320
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1) 11 2988.780 3375.380 +12.94% 32876.580
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,8448,8,1), v(64,8448,8,1), m(8448,1,1,1), GET_ROWS 640 48.630 136.529 +180.75% 31123.200
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,768,8,1), v(64,768,8,1), m(768,1,1,1) 1536 19.314 14.335 -25.78% 29666.304
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1) 23 1176.440 1127.470 -4.16% 27058.120
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1) 11 2396.810 2674.130 +11.57% 26364.910
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1) 11 1779.900 1993.080 +11.98% 19578.900
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1) 11 1192.400 1297.410 +8.81% 13116.400
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,8192,8,1), v(64,8192,8,1), m(8192,512,1,1), GET_ROWS 1 9998.340 11567.100 +15.69% 9998.340
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7680,8,1), v(64,7680,8,1), m(7680,512,1,1), GET_ROWS 1 9195.520 10782.700 +17.26% 9195.520
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,7168,8,1), v(64,7168,8,1), m(7168,512,1,1), GET_ROWS 1 8519.680 9974.780 +17.08% 8519.680
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6656,8,1), v(64,6656,8,1), m(6656,512,1,1), GET_ROWS 1 7960.580 9182.210 +15.35% 7960.580
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,6144,8,1), v(64,6144,8,1), m(6144,512,1,1), GET_ROWS 1 7520.260 8569.860 +13.96% 7520.260
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5632,8,1), v(64,5632,8,1), m(5632,512,1,1), GET_ROWS 1 6690.820 7724.030 +15.44% 6690.820
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,5120,8,1), v(64,5120,8,1), m(5120,512,1,1), GET_ROWS 1 6100.990 6863.870 +12.50% 6100.990
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4608,8,1), v(64,4608,8,1), m(4608,512,1,1), GET_ROWS 1 5401.600 6207.490 +14.92% 5401.600
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,4096,8,1), v(64,4096,8,1), m(4096,512,1,1), GET_ROWS 1 4861.950 5522.430 +13.58% 4861.950
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3584,8,1), v(64,3584,8,1), m(3584,512,1,1), GET_ROWS 1 4098.050 4643.840 +13.32% 4098.050
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,3072,8,1), v(64,3072,8,1), m(3072,512,1,1), GET_ROWS 1 3531.780 3945.470 +11.71% 3531.780
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2560,8,1), v(64,2560,8,1), m(2560,512,1,1), GET_ROWS 1 2956.290 3299.330 +11.60% 2956.290
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,2048,8,1), v(64,2048,8,1), m(2048,512,1,1), GET_ROWS 1 2426.880 2709.500 +11.65% 2426.880
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1536,8,1), v(64,1536,8,1), m(1536,512,1,1), GET_ROWS 1 1746.940 1927.170 +10.32% 1746.940
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,1024,8,1), v(64,1024,8,1), m(1024,512,1,1), GET_ROWS 1 1131.520 1231.870 +8.87% 1131.520
FLASH_ATTN_EXT dst(64,64,512,1), q(64,512,64,1), k(64,512,8,1), v(64,512,8,1), m(512,512,1,1), GET_ROWS 1 615.424 618.496 +0.50% 615.424
FLASH_ATTN_EXT dst(64,64,1,1), q(64,1,64,1), k(64,256,8,1), v(64,256,8,1), m(256,1,1,1), GET_ROWS 1 81.920 36.864 -55.00% 81.920

Looks like ubatch (ne2) and cache size (nek1) are most relevant to whether 4 or 8 is better.

@jeffbolznv
Copy link
Contributor Author

Using fewer rows is generally going to be worse for large batches (pp) because the KV cache has to be redundantly fetched more times. But it can be better for small batches (tg) when the group query attention factor is small, or if the larger number of rows doesn't fit in register file (the original case I think we were hitting on Intel). But it's clear that this change as-is is too simple.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 25, 2025

Superceded by #18280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants