AVX IQ Quants by netrunnereve · Pull Request #7845 · ggml-org/llama.cpp

netrunnereve · 2024-06-10T04:28:08Z

I finally had the time to work on original AVX versions of the IQ quants ggml_vec_dot for Sandy Bridge and Ivy Bridge users.

Master:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    121.59
      avg cycles/32 vals   :    122.17
      float32 throughput   :      2.68 GB/s
      quantized throughput :      0.26 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     67.44
      avg cycles/32 vals   :     67.99
      float32 throughput   :      4.92 GB/s
      quantized throughput :      0.69 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    111.94
      avg cycles/32 vals   :    112.66
      float32 throughput   :      2.93 GB/s
      quantized throughput :      0.32 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     89.03
      avg cycles/32 vals   :     89.54
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.30 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     87.30
      avg cycles/32 vals   :     88.05
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.49 GB/s

PR:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     24.12
      avg cycles/32 vals   :     24.41
      float32 throughput   :     13.87 GB/s
      quantized throughput :      1.33 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     10.12
      avg cycles/32 vals   :     10.33
      float32 throughput   :     38.15 GB/s
      quantized throughput :      5.36 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     25.13
      avg cycles/32 vals   :     25.22
      float32 throughput   :     12.72 GB/s
      quantized throughput :      1.37 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     14.95
      avg cycles/32 vals   :     15.16
      float32 throughput   :     21.80 GB/s
      quantized throughput :      1.75 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     13.88
      avg cycles/32 vals   :     13.94
      float32 throughput   :     25.43 GB/s
      quantized throughput :      3.38 GB/s

Some example benchmarks:

model	size	params	backend	threads	test	t/s
llama 8B IQ4_XS - 4.25 bpw (Master)	4.13 GiB	8.03 B	CPU	8	pp512	1.78 ± 0.08
llama 8B IQ4_XS - 4.25 bpw (Master)	4.13 GiB	8.03 B	CPU	8	tg128	1.60 ± 0.05
llama 8B IQ4_XS - 4.25 bpw (PR)	4.13 GiB	8.03 B	CPU	8	pp512	10.95 ± 0.04
llama 8B IQ4_XS - 4.25 bpw (PR)	4.13 GiB	8.03 B	CPU	8	tg128	7.72 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw (Master)	2.42 GiB	8.03 B	CPU	8	pp512	1.09 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (Master)	2.42 GiB	8.03 B	CPU	8	tg128	1.02 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR)	2.42 GiB	8.03 B	CPU	8	pp512	6.65 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR)	2.42 GiB	8.03 B	CPU	8	tg128	5.47 ± 0.14

The scalar IQ code is really slow on my computer, even with a 8B model. Pretty much any K quant of equivalent size can beat it with a 30B model! I mostly followed the original AVX2 implementation and converted the new 256-bit instructions into two 128-bit ones when required.

ggml-quants.c

ggerganov

Make sure the test-backend-ops passes

netrunnereve · 2024-06-16T16:44:20Z

Make sure the test-backend-ops passes

Yeah it runs and passes with -b CPU; interestingly enough the test is skipped by default if no backend is set. I actually used test-quantize-fns when writing this PR.

./tests/test-backend-ops -b CPU
Testing 1 backends

Backend 1/1 (CPU)
  Backend name: CPU
  ABS(type=f32,ne_a=[128,10,10,10],v=0): OK
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,type_KV=q4_0): OK
  1270/1270 tests passed
  Backend CPU: OK

1/1 backends passed
OK

Considering how it only takes a minute to run I think it's worth adding the CPU version of test-backend-ops to the CI.

slaren · 2024-06-16T16:47:52Z

The goal of test-backend-ops is to compare GPU backends with the CPU backend. When used with the CPU backend, it only compares the CPU backend with itself. It can still detect some issues such as nan and inf values, but generally I don't think it adds much value.

ggerganov · 2024-06-18T07:10:52Z

I compared the AVX CPU vs the GPU results on my linux box and tests are passing. Should be good to merge

Update hv/matmul up to: commit 557b653 (HEAD -> master, origin/master, origin/HEAD) Author: k.h.lai <adrian.k.h.lai@outlook.com> Date: Fri Jun 21 16:28:20 2024 +0800 vulkan: detect multiple devices by deviceUUID instead of deviceID (ggml-org#8022) commit 7d5e877 Author: Eve <139727413+netrunnereve@users.noreply.github.com> Date: Fri Jun 21 05:57:36 2024 +0000 ggml : AVX IQ quants (ggml-org#7845) ...

initial iq4_xs

0fd5a1b

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 10, 2024

netrunnereve added 2 commits June 11, 2024 15:54

fix ci

b7e1707

Merge branch 'ggerganov:master' into avx_iq

2f37328

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jun 12, 2024

netrunnereve added 11 commits June 11, 2024 23:23

iq4_nl

8d1d112

iq1_m

5ff64ad

iq1_s

75370d7

iq2_xxs

65765c9

Merge branch 'ggerganov:master' into avx_iq

520361f

iq3_xxs

5926186

iq2_s

dcfee06

iq2_xs

eccc609

iq3_s before sllv

39e816e

iq3_s

99f666c

iq3_s small fix

b57187f

netrunnereve marked this pull request as ready for review June 16, 2024 03:01

netrunnereve commented Jun 16, 2024

View reviewed changes

ggml-quants.c Outdated Show resolved Hide resolved

ggerganov approved these changes Jun 16, 2024

View reviewed changes

iq3_s sllv can be safely replaced with sse multiply

29e2a96

Merge branch 'ggerganov:master' into avx_iq

a055767

netrunnereve mentioned this pull request Jun 21, 2024

sgemm for IQ4_NL #8049

Closed

4 tasks

ggerganov merged commit 7d5e877 into ggml-org:master Jun 21, 2024

netrunnereve deleted the avx_iq branch June 21, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX IQ Quants#7845

AVX IQ Quants#7845
ggerganov merged 16 commits intoggml-org:masterfrom
netrunnereve:avx_iq

netrunnereve commented Jun 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

ggerganov left a comment

Uh oh!

netrunnereve commented Jun 16, 2024

Uh oh!

slaren commented Jun 16, 2024

Uh oh!

ggerganov commented Jun 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

netrunnereve commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

netrunnereve commented Jun 16, 2024

Uh oh!

slaren commented Jun 16, 2024

Uh oh!

ggerganov commented Jun 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netrunnereve commented Jun 10, 2024 •

edited

Loading