Skip to content

AVX IQ Quants#7845

Merged
ggerganov merged 16 commits intoggml-org:masterfrom
netrunnereve:avx_iq
Jun 21, 2024
Merged

AVX IQ Quants#7845
ggerganov merged 16 commits intoggml-org:masterfrom
netrunnereve:avx_iq

Conversation

@netrunnereve
Copy link
Copy Markdown
Collaborator

@netrunnereve netrunnereve commented Jun 10, 2024

I finally had the time to work on original AVX versions of the IQ quants ggml_vec_dot for Sandy Bridge and Ivy Bridge users.

Master:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    121.59
      avg cycles/32 vals   :    122.17
      float32 throughput   :      2.68 GB/s
      quantized throughput :      0.26 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     67.44
      avg cycles/32 vals   :     67.99
      float32 throughput   :      4.92 GB/s
      quantized throughput :      0.69 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    111.94
      avg cycles/32 vals   :    112.66
      float32 throughput   :      2.93 GB/s
      quantized throughput :      0.32 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     89.03
      avg cycles/32 vals   :     89.54
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.30 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     87.30
      avg cycles/32 vals   :     88.05
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.49 GB/s

PR:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     24.12
      avg cycles/32 vals   :     24.41
      float32 throughput   :     13.87 GB/s
      quantized throughput :      1.33 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     10.12
      avg cycles/32 vals   :     10.33
      float32 throughput   :     38.15 GB/s
      quantized throughput :      5.36 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     25.13
      avg cycles/32 vals   :     25.22
      float32 throughput   :     12.72 GB/s
      quantized throughput :      1.37 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     14.95
      avg cycles/32 vals   :     15.16
      float32 throughput   :     21.80 GB/s
      quantized throughput :      1.75 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     13.88
      avg cycles/32 vals   :     13.94
      float32 throughput   :     25.43 GB/s
      quantized throughput :      3.38 GB/s

Some example benchmarks:

model size params backend threads test t/s
llama 8B IQ4_XS - 4.25 bpw (Master) 4.13 GiB 8.03 B CPU 8 pp512 1.78 ± 0.08
llama 8B IQ4_XS - 4.25 bpw (Master) 4.13 GiB 8.03 B CPU 8 tg128 1.60 ± 0.05
llama 8B IQ4_XS - 4.25 bpw (PR) 4.13 GiB 8.03 B CPU 8 pp512 10.95 ± 0.04
llama 8B IQ4_XS - 4.25 bpw (PR) 4.13 GiB 8.03 B CPU 8 tg128 7.72 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw (Master) 2.42 GiB 8.03 B CPU 8 pp512 1.09 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (Master) 2.42 GiB 8.03 B CPU 8 tg128 1.02 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR) 2.42 GiB 8.03 B CPU 8 pp512 6.65 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR) 2.42 GiB 8.03 B CPU 8 tg128 5.47 ± 0.14

The scalar IQ code is really slow on my computer, even with a 8B model. Pretty much any K quant of equivalent size can beat it with a 30B model! I mostly followed the original AVX2 implementation and converted the new 256-bit instructions into two 128-bit ones when required.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 10, 2024
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jun 12, 2024
@netrunnereve netrunnereve marked this pull request as ready for review June 16, 2024 03:01
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the test-backend-ops passes

@netrunnereve
Copy link
Copy Markdown
Collaborator Author

Make sure the test-backend-ops passes

Yeah it runs and passes with -b CPU; interestingly enough the test is skipped by default if no backend is set. I actually used test-quantize-fns when writing this PR.

./tests/test-backend-ops -b CPU
Testing 1 backends

Backend 1/1 (CPU)
  Backend name: CPU
  ABS(type=f32,ne_a=[128,10,10,10],v=0): OK
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,type_KV=q4_0): OK
  1270/1270 tests passed
  Backend CPU: OK

1/1 backends passed
OK

Considering how it only takes a minute to run I think it's worth adding the CPU version of test-backend-ops to the CI.

@slaren
Copy link
Copy Markdown
Member

slaren commented Jun 16, 2024

The goal of test-backend-ops is to compare GPU backends with the CPU backend. When used with the CPU backend, it only compares the CPU backend with itself. It can still detect some issues such as nan and inf values, but generally I don't think it adds much value.

@ggerganov
Copy link
Copy Markdown
Member

I compared the AVX CPU vs the GPU results on my linux box and tests are passing. Should be good to merge

@netrunnereve netrunnereve mentioned this pull request Jun 21, 2024
4 tasks
@ggerganov ggerganov merged commit 7d5e877 into ggml-org:master Jun 21, 2024
@netrunnereve netrunnereve deleted the avx_iq branch June 21, 2024 15:26
HoiV added a commit to HoiV/llama_dc.cpp that referenced this pull request Jun 24, 2024
Update hv/matmul up to:
commit 557b653 (HEAD -> master, origin/master, origin/HEAD)
Author: k.h.lai <adrian.k.h.lai@outlook.com>
Date:   Fri Jun 21 16:28:20 2024 +0800

    vulkan: detect multiple devices by deviceUUID instead of deviceID (ggml-org#8022)

commit 7d5e877
Author: Eve <139727413+netrunnereve@users.noreply.github.com>
Date:   Fri Jun 21 05:57:36 2024 +0000

    ggml : AVX IQ quants (ggml-org#7845)

...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants