get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8)#3223
get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8)#3223swetha097 wants to merge 15 commits intoggml-org:masterfrom
Conversation
…ghts of type q4_0
|
@danbev Addressed the review comments |
|
Let's merge this after the refactoring in ggml-org/llama.cpp#13892 to avoid resolving the conflicts. |
|
@swetha097 Would you be able to take a look at resolving the conflict reported here? |
…ghts of type q4_0
|
@danbev Resolved the conflicts.
|
|
@swetha097 Thanks! I think there will be a sync of ggml/llama.cpp shortly, hopefully this won't cause conflict but I'm not sure. |
|
@swetha097 Hm, I am a bit confused about why this would make a difference for the performance. The only place that we use Lines 2543 to 2548 in 1e72e4b How does this change lead to increased usage of the |
|
@ggerganov this tensor is used by both GGML_OP_MAT_MUL and GGML_OP_GET_ROWS. Because the original get_rows operation could not handle repacked weights, the tensor remained in a standard format, preventing the GGML_OP_MAT_MUL operation from using its most optimized kernel (ggml_gemm_q4_0_8x8_q8_0), instead it was using gemm4xN kernel previously. |
|
@swetha097 Thanks, got it. @eddnjjn Could you help reviewing this change? |
…ghts of type q4_0
ggml/src/ggml-cpu/repack.cpp
Outdated
| const char * base_ptr_for_higher_dims_in_src0 = (const char *)src0->data + i11 * nb02 + i12 * nb03; | ||
|
|
||
| // Pointer to the first block_q4_0x8 of the identified row_group_idx | ||
| const block_q4_0x8 * p_first_repacked_block_of_group_x8 = (const block_q4_0x8 *)(base_ptr_for_higher_dims_in_src0 + row_group_idx * stride_between_actual_row_groups); |
There was a problem hiding this comment.
Wouldn't this fail if the type of this template is actually block_q4_0x4?
I think we should add support for testing extra buffer types to test-backend-ops before adding more complexity to this code.
There was a problem hiding this comment.
Yes, I agree. We'll have to add tests before merging more stuff to the repacking.
|
I have updated the code with templated changes which helps to differentiate between different block_q4_0's (q4_0x8, q4_0x4). Please do have a look at the changes done and share your feedback on the same. |
|
Are there any other dependencies that should be resolved for the PR to move forward. Please share your thoughts. |
|
The main concern here is that we don't have a way to test these changes as noted earlier by slaren. The PR should wait until we extend |
|
Given that llama.cpp already includes test-backend-ops, would it be acceptable to open this PR in llama.cpp directly? |
This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 4-bit quantized format (q4_0x8) defined within ggml-cpu-aarch64.cpp
The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q4_0_8x8_q8_0) for q4_0 type defined within ggml-cpu-aarch64.cpp
master branch commit - 82f461eaa4e6a1ba29fc0dbdaa415a9934ee8a1d
development (get_rows) branch commit - d6cc466bd48dd27474ecb00c3baba2e8a887f6c4
The PR was tested in AMD Raphael 7600X which supports the following flags :
Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q4_0
This patch of the code was also tested with llama.cpp repository & the perplexity of Q4_0 models were ensured to be the same before and after the changes made :
Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b