Skip to content

Vulkan: some improvement on mul_mat_iq2_xs#18031

Merged
0cc4m merged 3 commits intoggml-org:masterfrom
lovedheart:lovedheart-mul_mat_iq2_xs_improve
Dec 21, 2025
Merged

Vulkan: some improvement on mul_mat_iq2_xs#18031
0cc4m merged 3 commits intoggml-org:masterfrom
lovedheart:lovedheart-mul_mat_iq2_xs_improve

Conversation

@lovedheart
Copy link
Contributor

@lovedheart lovedheart commented Dec 14, 2025

Before:

Backend 1/3: Vulkan0
  Device description: AMD Radeon 780M Graphics
  Device memory: 73642 MB (69960 MB free)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               3408 runs -   296.84 us/run - 117.44 MFLOP/run - 395.64 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2982 runs -   363.47 us/run - 234.88 MFLOP/run - 646.21 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2556 runs -   425.20 us/run - 352.32 MFLOP/run - 828.60 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1704 runs -   649.41 us/run - 469.76 MFLOP/run - 723.37 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1881 runs -   572.77 us/run - 587.20 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                428 runs -  3006.30 us/run - 939.52 MFLOP/run - 312.52 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                       90 runs - 11222.01 us/run -  60.13 GFLOP/run -   5.36 TFLOPS
  Backend Vulkan0: OK
Backend 2/3: Vulkan1
  Device description: NVIDIA GeForce RTX 5060 Ti
  Device memory: 15962 MB (15051 MB free)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              20448 runs -    50.41 us/run - 117.44 MFLOP/run -   2.33 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              14484 runs -    70.38 us/run - 234.88 MFLOP/run -   3.34 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              10508 runs -    95.92 us/run - 352.32 MFLOP/run -   3.67 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               8094 runs -   125.17 us/run - 469.76 MFLOP/run -   3.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               4275 runs -   234.96 us/run - 587.20 MFLOP/run -   2.50 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2675 runs -   377.51 us/run - 939.52 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                      728 runs -  1374.24 us/run -  60.13 GFLOP/run -  43.75 TFLOPS

After:

Backend 1/3: Vulkan0
  Device description: AMD Radeon 780M Graphics
  Device memory: 73642 MB (69960 MB free)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               4260 runs -   274.42 us/run - 117.44 MFLOP/run - 427.96 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2982 runs -   350.13 us/run - 234.88 MFLOP/run - 670.83 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2272 runs -   488.74 us/run - 352.32 MFLOP/run - 720.87 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1917 runs -   550.91 us/run - 469.76 MFLOP/run - 852.71 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2223 runs -   464.25 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1391 runs -   721.72 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                       92 runs - 11043.90 us/run -  60.13 GFLOP/run -   5.44 TFLOPS
  Backend Vulkan0: OK
Backend 2/3: Vulkan1
  Device description: NVIDIA GeForce RTX 5060 Ti
  Device memory: 15962 MB (15191 MB free)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              20448 runs -    49.10 us/run - 117.44 MFLOP/run -   2.39 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              14484 runs -    70.39 us/run - 234.88 MFLOP/run -   3.34 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              10792 runs -    94.55 us/run - 352.32 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               6603 runs -   152.09 us/run - 469.76 MFLOP/run -   3.09 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               5301 runs -   191.29 us/run - 587.20 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               3210 runs -   314.95 us/run - 939.52 MFLOP/run -   2.98 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                      774 runs -  1294.60 us/run -  60.13 GFLOP/run -  46.45 TFLOPS
  Backend Vulkan1: OK
Backend 3/3: CPU
  Skipping CPU backend
3/3 backends passed
OK

Reference:

CUDA (5060 Ti)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              22152 runs -    45.72 us/run - 117.44 MFLOP/run -   2.57 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              21300 runs -    47.85 us/run - 234.88 MFLOP/run -   4.91 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              15620 runs -    64.36 us/run - 352.32 MFLOP/run -   5.47 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              14697 runs -    68.42 us/run - 469.76 MFLOP/run -   6.87 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):              11115 runs -    90.04 us/run - 587.20 MFLOP/run -   6.52 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               7383 runs -   136.86 us/run - 939.52 MFLOP/run -   6.86 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                      782 runs -  1279.14 us/run -  60.13 GFLOP/run -  47.01 TFLOPS

ROCm (780m)

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               4260 runs -   285.02 us/run - 117.44 MFLOP/run - 412.05 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2982 runs -   363.85 us/run - 234.88 MFLOP/run - 645.55 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2556 runs -   432.75 us/run - 352.32 MFLOP/run - 814.14 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               2130 runs -   502.78 us/run - 469.76 MFLOP/run - 934.33 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1881 runs -   561.89 us/run - 587.20 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               1391 runs -   742.95 us/run - 939.52 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                       98 runs - 10378.29 us/run -  60.13 GFLOP/run -   5.79 TFLOPS

Refactor calculations for db values and grid data to optimize performance and reduce redundancy.
@lovedheart lovedheart marked this pull request as ready for review December 14, 2025 13:54
@lovedheart lovedheart requested a review from 0cc4m as a code owner December 14, 2025 13:54
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025
@0cc4m
Copy link
Contributor

0cc4m commented Dec 15, 2025

Before I accidentally merge a PR in this state again, please fix your editor configuration to preven tabs and trailing whitespaces. This PR has them again, as you can see in the EditorConfig Checker CI.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 21, 2025

Thank you!

@0cc4m 0cc4m merged commit 4117ae5 into ggml-org:master Dec 21, 2025
66 of 67 checks passed
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* Some improvement on mul_mat_iq2_xs

Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

* Fix trailing whitespace
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* Some improvement on mul_mat_iq2_xs

Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

* Fix trailing whitespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants