Skip to content

Better CPU SWA#757

Merged
ikawrakow merged 1 commit intomainfrom
ik/cpu_swa_v2
Sep 4, 2025
Merged

Better CPU SWA#757
ikawrakow merged 1 commit intomainfrom
ik/cpu_swa_v2

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Sep 4, 2025

This is an alternative, and much simpler implementation compared to #702, of the CPU FA optimization for SWA models.

It achieves similar prompt processing speedup as #702. For instance, at 32k tokens running on a Ryzen-7950X CPU, PP is about 1.6X faster for GPT-OSS-20B-MXFP4, and 2.1X faster for Gemma3-12B-Q4_0 compared to the main branch.

GPT-OSS-20B-MXFP4

Main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 1.980 517.07 11.838 21.62
1024 256 1024 2.131 480.51 12.088 21.18
1024 256 2048 2.289 447.43 12.292 20.83
1024 256 3072 2.471 414.40 12.394 20.66
1024 256 4096 2.646 387.02 12.604 20.31
1024 256 5120 2.825 362.53 12.740 20.09
1024 256 6144 3.005 340.76 12.880 19.88
1024 256 7168 3.320 308.43 13.034 19.64
1024 256 8192 3.436 298.04 13.223 19.36
1024 256 9216 3.713 275.80 13.345 19.18
1024 256 10240 3.811 268.69 13.493 18.97
1024 256 11264 4.325 236.79 13.376 19.14
1024 256 12288 4.494 227.85 13.525 18.93
1024 256 13312 4.566 224.27 13.831 18.51
1024 256 14336 4.482 228.46 14.135 18.11
1024 256 15360 4.713 217.27 14.291 17.91
1024 256 16384 4.987 205.33 14.534 17.61
1024 256 17408 5.484 186.73 14.553 17.59
1024 256 18432 5.677 180.37 14.768 17.33
1024 256 19456 5.785 177.00 14.346 17.84
1024 256 20480 5.652 181.16 15.094 16.96
1024 256 21504 6.108 167.64 15.249 16.79
1024 256 22528 6.192 165.39 15.263 16.77
1024 256 23552 6.138 166.83 15.474 16.54
1024 256 24576 6.945 147.44 15.672 16.33
1024 256 25600 7.072 144.80 16.156 15.85
1024 256 26624 6.799 150.60 15.862 16.14
1024 256 27648 7.117 143.88 16.271 15.73
1024 256 28672 7.650 133.85 16.171 15.83
1024 256 29696 7.419 138.02 16.344 15.66
1024 256 30720 8.036 127.43 16.782 15.25
1024 256 31744 8.266 123.88 16.949 15.10

PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 1.993 513.76 11.842 21.62
1024 256 1024 2.053 498.86 12.038 21.27
1024 256 2048 2.133 480.17 12.245 20.91
1024 256 3072 2.232 458.81 12.430 20.59
1024 256 4096 2.314 442.56 12.599 20.32
1024 256 5120 2.404 425.97 12.746 20.08
1024 256 6144 2.497 410.12 12.900 19.84
1024 256 7168 2.587 395.89 13.040 19.63
1024 256 8192 2.682 381.83 13.180 19.42
1024 256 9216 2.774 369.18 13.319 19.22
1024 256 10240 2.894 353.79 13.594 18.83
1024 256 11264 3.172 322.85 13.663 18.74
1024 256 12288 3.271 313.05 13.779 18.58
1024 256 13312 3.251 314.96 13.777 18.58
1024 256 14336 3.624 282.56 14.228 17.99
1024 256 15360 3.511 291.66 14.155 18.08
1024 256 16384 3.504 292.25 14.518 17.63
1024 256 17408 3.793 269.98 14.338 17.85
1024 256 18432 3.972 257.79 14.488 17.67
1024 256 19456 4.127 248.12 14.776 17.33
1024 256 20480 3.798 269.60 15.039 17.02
1024 256 21504 4.409 232.24 14.810 17.29
1024 256 22528 4.194 244.14 15.571 16.44
1024 256 23552 4.313 237.42 15.386 16.64
1024 256 24576 4.526 226.24 15.575 16.44
1024 256 25600 4.482 228.45 15.735 16.27
1024 256 26624 4.482 228.49 16.177 15.82
1024 256 27648 4.537 225.69 16.216 15.79
1024 256 28672 4.756 215.33 16.305 15.70
1024 256 29696 4.735 216.27 16.765 15.27
1024 256 30720 5.095 200.97 16.782 15.25
1024 256 31744 5.113 200.28 16.992 15.07

Gemma3-12B-Q4_0

Main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 5.397 189.72 30.771 8.32
1024 256 1024 5.708 179.39 32.954 7.77
1024 256 2048 6.055 169.11 32.955 7.77
1024 256 3072 6.329 161.79 33.215 7.71
1024 256 4096 6.698 152.89 33.520 7.64
1024 256 5120 7.102 144.19 33.847 7.56
1024 256 6144 7.693 133.11 34.230 7.48
1024 256 7168 7.924 129.22 34.547 7.41
1024 256 8192 8.320 123.08 34.860 7.34
1024 256 9216 9.005 113.71 35.180 7.28
1024 256 10240 9.486 107.95 35.468 7.22
1024 256 11264 10.555 97.02 35.710 7.17
1024 256 12288 9.521 107.55 35.885 7.13
1024 256 13312 10.266 99.75 36.158 7.08
1024 256 14336 11.016 92.95 36.846 6.95
1024 256 15360 10.807 94.75 37.342 6.86
1024 256 16384 11.195 91.47 37.230 6.88
1024 256 17408 11.334 90.35 37.460 6.83
1024 256 18432 11.666 87.77 37.777 6.78
1024 256 19456 12.065 84.88 38.069 6.72
1024 256 20480 12.427 82.40 38.585 6.63
1024 256 21504 12.763 80.23 38.870 6.59
1024 256 22528 13.118 78.06 39.021 6.56
1024 256 23552 13.476 75.99 39.306 6.51
1024 256 24576 14.218 72.02 39.605 6.46
1024 256 25600 14.196 72.13 40.374 6.34
1024 256 26624 14.555 70.36 40.674 6.29
1024 256 27648 15.319 66.84 40.468 6.33
1024 256 28672 15.260 67.10 40.658 6.30
1024 256 29696 15.550 65.85 41.093 6.23
1024 256 30720 16.079 63.69 41.418 6.18
1024 256 31744 16.510 62.02 41.926 6.11

PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 5.450 187.91 30.839 8.30
1024 256 1024 5.751 178.06 32.972 7.76
1024 256 2048 5.807 176.35 33.240 7.70
1024 256 3072 5.858 174.80 33.537 7.63
1024 256 4096 5.923 172.89 33.816 7.57
1024 256 5120 5.989 170.98 34.100 7.51
1024 256 6144 6.060 168.97 34.379 7.45
1024 256 7168 6.119 167.35 34.661 7.39
1024 256 8192 6.187 165.50 34.934 7.33
1024 256 9216 6.255 163.71 35.203 7.27
1024 256 10240 6.327 161.84 35.489 7.21
1024 256 11264 6.425 159.38 35.720 7.17
1024 256 12288 6.479 158.06 36.702 6.98
1024 256 13312 6.715 152.49 36.673 6.98
1024 256 14336 6.567 155.92 36.668 6.98
1024 256 15360 6.585 155.50 37.465 6.83
1024 256 16384 6.639 154.24 37.790 6.77
1024 256 17408 6.716 152.48 38.139 6.71
1024 256 18432 6.763 151.42 38.197 6.70
1024 256 19456 6.826 150.02 38.497 6.65
1024 256 20480 7.063 144.98 38.756 6.61
1024 256 21504 7.045 145.34 39.013 6.56
1024 256 22528 7.092 144.39 39.752 6.44
1024 256 23552 7.078 144.68 39.992 6.40
1024 256 24576 7.134 143.55 40.327 6.35
1024 256 25600 7.187 142.47 40.647 6.30
1024 256 26624 7.248 141.27 40.937 6.25
1024 256 27648 7.329 139.71 41.248 6.21
1024 256 28672 7.374 138.87 41.566 6.16
1024 256 29696 7.928 129.17 41.878 6.11
1024 256 30720 7.545 135.72 41.819 6.12
1024 256 31744 7.915 129.37 42.051 6.09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant