Refactor/online repacking by Djip007 · Pull Request #10446 · ggml-org/llama.cpp

Djip007 · 2024-11-21T20:46:00Z

goal: consolidation of the cpu backend for reintegration of the AMX backend.

remove Q4_0_N_M from ggml file tensor type, only the cpu backend have know of this type and do dynamic repacking for it on "ggml_backend_cpu_aarch64_buffer_type" only.
"extract" extra_buffer_type part (aarch64/hbm) and move most in there .cpp/.h files (migrate aarch64 to c++)
get more general structure for "extra_op"
get GGML_OP_MUL_MAT_ID for Q4_0_N_M working with dynamic repacking (aarch64_buffer)

Djip007 · 2024-11-21T20:48:55Z

ggml/src/ggml-cpu/ggml-cpu-hbm.cpp

Not sure it is good, can not test.
And may not work/build on master branch either.

It could probably be removed, the normal CPU buffer type calls ggml_aligned_malloc which already uses HBM. So at the moment this buffer type serves no purpose.

Djip007 · 2024-11-21T20:52:51Z

ggml/src/ggml-cpu/ggml-cpu.c

Look for me it have not be write for dynamic repack but only with "native" Q4_0_N_M packing.
leave it commented it need some work to be usable on dynamic repacking.

We should try to fix to keep support for aarch64 types with models with experts.

ggml/src/ggml-cpu/ggml-cpu.c

slaren · 2024-11-22T00:27:33Z

Overall looks good. I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later.

Djip007 · 2024-11-22T01:11:27Z

I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later.

yes it will be the main/difficult choice :

Allow weight repacking only at load time, and reduce the interest of mmap...
Allow to add new "bloc" type... and be prepare to have lot of new type (AVX512 will like bloc of 16xN, AVX512BF16 of 2x16xN, AVX2 of 8xN, RDNA3 of 16x16 ...)

Djip007 · 2024-11-24T13:56:07Z

@slaren I still need your expertise so as not to make too many mistakes.

I was looking for where params->wdata was created.
https://github.com/ggerganov/llama.cpp/blob/9336db462c0c34bbe2055413fe4e16442626c38b/ggml/src/ggml-cpu/ggml-cpu.c#L7497
for me look to be in this function:
https://github.com/ggerganov/llama.cpp/blob/9336db462c0c34bbe2055413fe4e16442626c38b/ggml/src/ggml-cpu/ggml-cpu.c#L13220-L13223
Am I right?

If yes, look for me that the size is not calculated correctly for llamafile and Q4_0 repacking:

llamafile: we may compute size for src[1] that may not be used.
Q4_0_M_N: may be compute with wrong 'vec_dot_type'

https://github.com/ggerganov/llama.cpp/blob/9336db462c0c34bbe2055413fe4e16442626c38b/ggml/src/ggml-cpu/ggml-cpu.c#L13277-L13284

Note: I'm trying to make it more generic to make it easier to reintegrate the AMX backend so maybe not useful to fix it for now.

ggerganov · 2024-11-24T15:46:28Z

llamafile: we may compute size for src[1] that may not be used.

It's OK if we over-allocate a bit of memory for wdata even if it ends up not being needed. It would be best to add asserts in the different branches that validate wdata is big enough.

Q4_0_M_N: may be compute with wrong 'vec_dot_type'

Isn't vec_dot_type always GGML_TYPE_Q8_0 for the Q4_0_M_N?

Djip007 · 2024-11-24T18:12:59Z

Q4_0_M_N: may be compute with wrong 'vec_dot_type'

Isn't vec_dot_type always GGML_TYPE_Q8_0 for the Q4_0_M_N?

Yes it is the case for Q4_0_M_N for, so not critical for now. Even if internally it is more a Q8_0_N:
https://github.com/ggerganov/llama.cpp/blob/cce5a9007572c6e9fa522296b77571d2e5071357/ggml/src/ggml-cpu/ggml-cpu-aarch64.c#L195

But may not work with other/future case.

slaren · 2024-11-24T18:17:56Z

If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a std::vector in the context, and each operation that uses it only needs to resize it to the amount of memory it needs. Then we can remove ggml_cplan and related functions. However at this point this would break a lot of code.

Djip007 · 2024-11-24T18:47:29Z

If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a std::vector in the context, and each operation that uses it only needs to resize it to the amount of memory it needs. Then we can remove ggml_cplan and related functions. However at this point this would break a lot of code.

So you confirm that for now this is where the size is calculated.

slaren · 2024-11-24T18:48:51Z

Yes, the size is calculated in the function ggml_graph_plan.

ggml/src/ggml-cpu/ggml-cpu-traits.h

Djip007 · 2024-11-29T05:34:12Z

can find how to enable c++17 for macOS-latest-swift / xcode...

Djip007 · 2024-11-30T21:59:52Z

OK now this is merged #10570 , and c++17 is the default I need some more work.

Djip007 · 2024-12-01T18:43:20Z

@slaren @ggerganov what do you think with this refactor?

I tried to make adding a "cpu-extra-buffer" simpler and more general. 🤞

slaren

The design looks good, this is a good improvement. Just reformat the code according to the .clang-format file and remove outdated comments.

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

ggml/src/ggml-cpu/ggml-cpu-hbm.h

ggml/src/ggml-cpu/ggml-cpu.cpp

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Djip007 · 2024-12-06T20:28:25Z

I updated the size controls, it should be better like this. 🖕

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

ggerganov · 2024-12-07T07:49:38Z

We should add debug logs about what repacks are applied to what tensors, so that when running with --log-verbose 1 we would be able to understand the conversions. Currently, it is quite difficult to trace.

Co-authored-by: Georgi Gerganov <[email protected]>

Djip007 · 2024-12-07T12:16:46Z

We should add debug logs about what repacks are applied to what tensors, so that when running with --log-verbose 1 we would be able to understand the conversions. Currently, it is quite difficult to trace.

@ggerganov: I added 2 logs, is that what you were thinking?

ggerganov · 2024-12-07T12:17:32Z

Yes, perfect. Should we merge this or is there anything else you are planning to do?

Djip007 · 2024-12-07T12:23:46Z

Yes, perfect. Should we merge this or is there anything else you are planning to do?

For me we can merge it. (if the CI success 🤞)

Djip007 · 2024-12-07T12:44:37Z

@slaren @ggerganov thanks for all you reviews and time.

bartowski1182 · 2024-12-10T17:26:13Z

Just saw that opened bug, is the implication of this change that I should no longer be making Q4_0_N_M quants then? They seem to be fully removed?

slaren · 2024-12-10T20:33:58Z

Yes, support for Q4_0_N_M model files has been removed and cannot be made anymore.

* rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 21, 2024

Djip007 commented Nov 21, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu.c Outdated Show resolved Hide resolved

Djip007 force-pushed the refactor/online_repacking branch from 36a0406 to 655a3fb Compare November 21, 2024 21:24

Djip007 force-pushed the refactor/online_repacking branch from 655a3fb to e772df4 Compare November 29, 2024 00:51

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 29, 2024

Djip007 force-pushed the refactor/online_repacking branch 3 times, most recently from a411d95 to fd768e0 Compare November 29, 2024 01:38

Djip007 commented Nov 29, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-traits.h Outdated Show resolved Hide resolved

Djip007 force-pushed the refactor/online_repacking branch from fd768e0 to 16154eb Compare November 29, 2024 05:18

Djip007 force-pushed the refactor/online_repacking branch from 16154eb to dc8adeb Compare November 29, 2024 05:41

Djip007 force-pushed the refactor/online_repacking branch from dc8adeb to 1b29245 Compare December 1, 2024 15:35

Djip007 mentioned this pull request Dec 1, 2024

add FP8 support to gguf/llama: #10055

Draft

3 tasks

Djip007 force-pushed the refactor/online_repacking branch from 1b29245 to 733f891 Compare December 1, 2024 16:54

Djip007 marked this pull request as ready for review December 1, 2024 20:18

slaren reviewed Dec 2, 2024

View reviewed changes

slaren requested a review from ggerganov December 6, 2024 02:14

ggerganov approved these changes Dec 6, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

Djip007 commented Dec 6, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

added/corrected control on tensor size for Q4 repacking.

b14b471

Djip007 force-pushed the refactor/online_repacking branch from 8e5bd04 to b14b471 Compare December 6, 2024 19:57

ggerganov reviewed Dec 7, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

Djip007 and others added 3 commits December 7, 2024 12:51

Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

7dc8a3e

Co-authored-by: Georgi Gerganov <[email protected]>

Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

e115f6f

Co-authored-by: Georgi Gerganov <[email protected]>

add debug logs on repacks.

1221d13

Djip007 force-pushed the refactor/online_repacking branch from be3c64b to 1221d13 Compare December 7, 2024 12:14

ggerganov merged commit 19d8762 into ggml-org:master Dec 7, 2024

smpurkis mentioned this pull request Dec 10, 2024

Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757

Closed

slaren mentioned this pull request Dec 10, 2024

Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method #10181

Open

4 tasks

a-ghorbani mentioned this pull request Dec 11, 2024

[Feat]: sync llama.cpp/llama.rn a-ghorbani/pocketpal-ai#133

Closed

slaren mentioned this pull request Dec 12, 2024

changelog : libllama API #9289

Open

RhinoDevel mentioned this pull request Jan 2, 2025

Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

Merged

jeffbolznv mentioned this pull request Jan 6, 2025

flux crashes with latest ggml leejet/stable-diffusion.cpp#553

Closed

QingtaoLi1 mentioned this pull request May 15, 2025

Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC) #13206

Open

4 tasks

fj-y-saito mentioned this pull request Aug 25, 2025

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277

Merged

jakexcosme mentioned this pull request Oct 22, 2025

changelog : libllama API COG-GTM/llama.cpp#246

Open

Conversation

Djip007 commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Djip007 Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

slaren Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

Djip007 Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

slaren Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slaren commented Nov 22, 2024

Uh oh!

Djip007 commented Nov 22, 2024

Uh oh!

Djip007 commented Nov 24, 2024

Uh oh!

ggerganov commented Nov 24, 2024

Uh oh!

Djip007 commented Nov 24, 2024

Uh oh!

slaren commented Nov 24, 2024

Uh oh!

Djip007 commented Nov 24, 2024

Uh oh!

slaren commented Nov 24, 2024

Uh oh!

Uh oh!

Djip007 commented Nov 29, 2024

Uh oh!

Djip007 commented Nov 30, 2024

Uh oh!

Djip007 commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Djip007 commented Dec 6, 2024

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Dec 7, 2024

Uh oh!

Djip007 commented Dec 7, 2024

Uh oh!

ggerganov commented Dec 7, 2024

Uh oh!

Djip007 commented Dec 7, 2024

Uh oh!

Djip007 commented Dec 7, 2024

Uh oh!

bartowski1182 commented Dec 10, 2024

Uh oh!

slaren commented Dec 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Djip007 commented Nov 21, 2024 •

edited

Loading

Djip007 commented Dec 1, 2024 •

edited

Loading